6  Keyness

For this lab, we’ll be following many of the same procedures that we’ve done previously:

For today’s lab we’ll begin some hypothesis testing using news functions from our repository:

We’ll also look at quanteda’s function:

6.1 What is keyness?

Keyness is a generic term for various tests that compare observed vs. expected frequencies.

The most commonly used (though not the only option) is called log-likelihood in corpus linguistics, but you will see it else where called a G-test goodness-of-fit.

The calculation is based on a 2 x 2 contingency table. It is similar to a chi-square test, but performs better when corpora are unequally sized.

Expected frequencies are based on the relative size of each corpus (in total number of words Ni) and the total number of observed frequencies:

\[ E_i = \sum_i O_i \times \frac{N_i}{\sum_i N_i} \] And log-likelihood is calculated according the formula:

\[ LL = 2 \times \sum_i O_i \ln \frac{O_i}{E_i} \] A good explanation of its implementation in linguistics can be found here: http://ucrel.lancs.ac.uk/llwizard.html

In addition to log-likelihood, the textstat_keyness() function in quanteda has other optional measures.

See here: https://quanteda.io/reference/textstat_keyness.html

6.2 Prepare a corpus

We’ll begin, just as we did in the distributions lab.

6.2.1 Load the needed packages


Load the functions:


Load the data:


6.2.2 Pre-process the data & create a corpus

sc <- sample_corpus %>%
  mutate(text = preprocess_text(text)) %>%

6.2.3 Extract meta-data from file names

We’ll extract some meta-data by (1) selecting the doc_id column, (2) extracting the initial letter string before the underscore, and (3) renaming the vector text_type.

doc_categories <- sample_corpus %>%
  dplyr::select(doc_id) %>%
  mutate(doc_id = str_extract(doc_id, "^[a-z]+")) %>%
  rename(text_type = doc_id)

6.2.4 Assign the meta-data to the corpus

The accessor function docvars() lets us add or modify data in an object. We’re going to use it to assign text_type as a variable. Note that doc_categories could include more than one column and the assignment process would be the same.

docvars(sc) <- doc_categories

And check the result:

sc |>
  summary() |>
  head(10) |>
Partial summary of sample corpus.
Text Types Tokens Sentences text_type
acad_01 772 2534 1 acad
acad_02 933 2544 1 acad
acad_03 889 2525 1 acad
acad_04 941 2541 1 acad
acad_05 857 2504 1 acad
acad_06 962 2575 1 acad
acad_07 615 2443 1 acad
acad_08 805 2540 1 acad
acad_09 912 2544 1 acad
acad_10 1063 2532 1 acad

Note the new column (text_type on the right). We could assign any number of categorical variables to our corpus, which could be used for analysis downstream.

6.2.5 Create a dfm

sc_dfm <- sc %>%
  tokens(what="fastestword", remove_numbers=TRUE) %>%
  tokens_compound(pattern = phrase(multiword_expressions)) %>%

6.3 A corpus composition table

It is conventional to report out the composition of the corpus or corpora you are using for your study. Here will will sum our tokens by text-type and similarly count the number of texts in each grouping.

corpus_comp <- ntoken(sc_dfm) %>% 
  data.frame(Tokens = .) %>%
  rownames_to_column("Text_Type") %>%
  mutate(Text_Type = str_extract(Text_Type, "^[a-z]+")) %>%
  group_by(Text_Type) %>%
  summarize(Texts = n(),
    Tokens = sum(Tokens)) %>%
  mutate(Text_Type = c("Academic", "Blog", "Fiction", "Magazine", "News", "Spoken", "Television/Movies", "Web"))

Now, using grand_summary_rows(), we can append a row of totals at the bottom of the table.

corpus_comp |> 
  gt() |>
  fmt_integer() |>
    Text_Type = md("**Text Type**"),
    Texts = md("**Texts**"),
    Tokens = md("**Tokens**")
  ) |>
    columns = c(Texts, Tokens),
    fns = list(
      Total ~ sum(.)
    ) ,
    fmt = ~ fmt_integer(.)
Composition of the sample corpus.

Text Type



Academic 50 121,442
Blog 50 125,492
Fiction 50 128,644
Magazine 50 126,631
News 50 119,029
Spoken 50 127,156
Television/Movies 50 128,191
Web 50 124,302
Total 400 1,000,887

6.4 Keyness in quanteda

Now that we have a dfm we perform keyness calculations. First, let’s carry out calculations using textstat_keyness().

When we use it with textstat_keyness we are indicating that we want the papers with discipline_cat equal to “acad” to be our target corpus. The everything else (i.e., “acad” == FALSE) will be the reference corpus.

The specific method we’re using is log-likelihood, which is designated by “lr”. Thus keyness will show the tokens that are more frequent in papers written for the academic text-type vs. those written for other text-types.

acad_kw <- textstat_keyness(sc_dfm, docvars(sc_dfm, "text_type") == "acad", measure = "lr")

Note the second argument: docvars(sc_dfm, "text_type") == "acad". That slightly awkward syntax simply produces a logical vector. You could store it and pass the vector the function, as well.

acad_kw |>
  head(10) |>
  gt() |>
  fmt_number(columns = "G2",
             decimals = 2)
Tokens with the highest keyness values in the academic text-type when compared to the rest of the sample corpus.
feature G2 p n_target n_reference
of 1,703.90 0 4848 17258
the 777.88 0 8273 42712
social 456.00 0 208 133
studies 391.06 0 155 71
study 318.95 0 158 119
in 316.66 0 2715 13341
perfectionism 301.84 0 74 1
by 297.33 0 790 2707
practice 267.90 0 117 68
science 259.37 0 118 75

6.4.1 Creating sub-corpora

If we want to compare one text-type (as our target corpus) to another (as our reference corpus), we can easily subset the data.

sub_dfm <- dfm_subset(sc_dfm, text_type == "acad" | text_type == "fic")

When we do this, the resulting data will still include all the tokens in the sample corpus, including those that do not appear in either the academic or fiction text-type. To deal with this, we will trim the dfm.

sub_dfm <- dfm_trim(sub_dfm, min_termfreq = 1)

We’ll do the same for fiction.

fic_kw <- textstat_keyness(sub_dfm, docvars(sub_dfm, "text_type") == "fic", measure = "lr")
fic_kw |>
  head(10) |>
  gt() |>
  fmt_number(columns = "G2",
             decimals = 2)
Tokens with the highest keyness values in the fiction text-type when compared to the academic text-type.
feature G2 p n_target n_reference
i 2,350.21 0 2428 143
she 1,861.48 0 1763 70
he 1,699.10 0 1978 170
her 1,453.01 0 1559 104
you 1,361.11 0 1286 50
n't 929.29 0 914 43
his 805.06 0 1155 157
my 732.26 0 758 44
me 591.50 0 557 21
him 541.51 0 548 29

Note that if we switch our target and reference corpora (academic as target, fiction as reference), the tail of the keyness table contains the negative values of the original (fiction as target, academic and reference), which you may have already gathered given the formula above.

acad_kw <- textstat_keyness(sub_dfm, docvars(sub_dfm, "text_type") == "acad", measure = "lr")
acad_kw |>
  tail(10) |>
  gt() |>
  fmt_number(columns = "G2",
             decimals = 2)
Tokens with the lowest keyness values int the academic text-type when compared to the fiction text-type.
feature G2 p n_target n_reference
him −541.51 0 29 548
me −591.50 0 21 557
my −732.26 0 44 758
his −805.06 0 157 1155
n't −929.29 0 43 914
you −1,361.11 0 50 1286
her −1,453.01 0 104 1559
he −1,699.10 0 170 1978
she −1,861.48 0 70 1763
i −2,350.21 0 143 2428

6.5 Effect size

While quanteda produces one important piece of information (the amount of evidence we have for an effect), it neglects another (the magnitude of the effect). Whenever we report on significance it is critical to report effect size. Some common effect size measures include:

  • %DIFF - see Gabrielatos and Marchi (2011)
  • Bayes Factor (BIC) - see Wilson (2013)
    • You can interpret the approximate Bayes Factor as degrees of evidence against the null hypothesis as follows:
      • 0-2: not worth more than a bare mention
      • 2-6: positive evidence against H0
      • 6-10: strong evidence against H0
      • 10: very strong evidence against H0
    • For negative scores, the scale is read as “in favor of” instead of “against”.
  • Effect Size for Log Likelihood (ELL) - see Johnston et al. (2006)
    • ELL varies between 0 and 1 (inclusive). Johnston et al. say “interpretation is straightforward as the proportion of the maximum departure between the observed and expected proportions”.
  • Relative Risk
  • Odds Ratio
  • Log Ratio - see Andrew Hardie’s CASS blog for how to interpret this
    • Note that if either word has zero frequency then a small adjustment is automatically applied (0.5 observed frequency which is then normalized) to avoid division by zero errors.

6.5.1 Log Ratio (LR)

You are welcome to use any of these effect size measures. Our repo comes with a function for calculating Hardie’s Log Ratio, which is easy and intuitive.

6.5.2 The keyness_table() function

We’ll start by creating 2 dfms–a target and a reference:

acad_dfm <- dfm_subset(sc_dfm, text_type == "acad") %>% dfm_trim(min_termfreq = 1)
fic_dfm <- dfm_subset(sc_dfm, text_type == "fic") %>% dfm_trim(min_termfreq = 1)

Then we will use the keyness_table() function.

acad_kw <- keyness_table(acad_dfm, fic_dfm)

And check the result:

acad_kw |>
  head(10) |>
  gt() |>
  fmt_number(columns = c("LL", "LR", "Per_10.5_Tar", "Per_10.5_Ref", "DP_Tar", "DP_Ref"),
             decimals = 2) |>
  fmt_number(columns = "PV",
             decimals = 5)
Tokens with the highest keyness values in the academic text-type when compared to the fiction text-type.
Token LL LR PV AF_Tar AF_Ref Per_10.5_Tar Per_10.5_Ref DP_Tar DP_Ref
of 1,225.77 1.25 0.00000 4848 2153 3,992.03 1,673.61 0.09 0.15
the 250.03 0.37 0.00000 8273 6768 6,812.31 5,261.03 0.11 0.10
social 248.10 4.98 0.00000 208 7 171.28 5.44 0.65 0.88
are 221.19 1.44 0.00000 707 276 582.17 214.55 0.21 0.30
studies 213.17 7.36 0.00000 155 1 127.63 0.78 0.67 0.98
by 209.00 1.29 0.00000 790 342 650.52 265.85 0.17 0.22
in 185.82 0.58 0.00000 2715 1922 2,235.64 1,494.05 0.11 0.09
students 179.36 5.27 0.00000 146 4 120.22 3.11 0.80 0.94
research 175.19 4.74 0.00000 151 6 124.34 4.66 0.63 0.90
is 171.74 0.87 0.00000 1241 720 1,021.89 559.68 0.23 0.36

The columns are as follows:

  1. LL: the keyness value or log-likelihood, also know as a G2 or goodness-of-fit test.
  2. LR: the effect size, which here is the log ratio
  3. PV: the p-value associated with the log-likelihood
  4. AF_Tar: the absolute frequency in the target corpus
  5. AF_Ref: the absolute frequency in the reference corpus
  6. Per_10.x_Tar: the relative frequency in the target corpus (automatically calibrated to a normalizing factor, where here is per 100,000 tokens)
  7. Per_10.x_Ref: the relative frequency in the reference corpus (automatically calibrated to a normalizing factor, where here is per 100,000 tokens)
  8. DP_Tar: the deviation of proportions (a dispersion measure) in the target corpus
  9. DP_Ref: the deviation of proportions in the reference corpus

6.5.3 Keyness pairs

There is also a function for quickly generating pair-wise keyness comparisions among multiple sub-corpora. To demonstrate, create a third dfm, this time containing news articles.

news_dfm <- dfm_subset(sc_dfm, text_type == "news") %>% dfm_trim(min_termfreq = 1)

To produce a data.frame comparing more than two sup-corpora, use the keyness_pairs() function:

kp <- keyness_pairs(news_dfm, acad_dfm, fic_dfm)

Check the result:

kp |>
  head(10) |>
  gt() |>
             decimals = 2)
Pairwise comparisions of news (target) vs. academic (reference), news (target) vs. fiction (reference), and academic (target) vs. fiction (reference).
Token A_v_B_LL A_v_B_LR A_v_B_PV A_v_C_LL A_v_C_LR A_v_C_PV B_v_C_LL B_v_C_LR B_v_C_PV
he 492.01 2.32 0.00 −394.56 −1.13 0.00 −1,686.80 −3.46 0.00
said 455.34 3.69 0.00 1.94 0.13 0.16 −414.33 −3.56 0.00
i 430.97 2.36 0.00 −853.61 −1.65 0.00 −2,330.44 −4.00 0.00
n't 333.06 3.23 0.00 −173.20 −1.10 0.00 −926.44 −4.33 0.00
you 327.51 3.06 0.00 −410.98 −1.54 0.00 −1,355.34 −4.60 0.00
mr 236.58 5.10 0.00 75.25 1.61 0.00 −60.92 −3.48 0.00
park 226.44 8.36 0.00 139.47 3.20 0.00 −25.26 −5.16 0.00
she 212.56 2.36 0.00 −919.21 −2.21 0.00 −1,850.64 −4.57 0.00
p.m 209.56 8.25 0.00 199.71 6.33 0.00 −2.66 −1.92 0.10
ob 198.31 8.17 0.00 206.63 8.25 0.00 0.00 0.08 1.00

6.6 Key key words

The concept of key key words was introduced by Mike Smith for the WordSmith concordancer. The process compares each text in the target corpus to the reference corpus. Log-likelihood is calculated for each comparison. Then a mean is calculated for keyness and effect size. In addition, a range is provided for the number of texts in which keyness reaches significance for a given threshold. (The default is p < 0.05.) That range is returned as a percentage.

In this way, key key words accounts for the dispersion of key words by indicating whether a keyness value is driven by a relatively high frequency in a few target texts or many.

kk <- key_keys(acad_dfm, fic_dfm)

Again, we can look at the first few rows of the table:

kk |>
  head(10) |>
  gt() |>
             decimals = 2)
Key key words when comparing the academic text-type to the fiction text-type.
token key_range key_mean key_sd effect_mean
of 98.00 59.20 36.40 1.21
social 38.00 25.00 70.11 3.41
studies 44.00 22.56 67.90 5.93
students 16.00 19.64 70.17 3.56
the 64.00 17.80 26.72 0.31
research 34.00 17.57 63.50 3.48
political 26.00 17.00 57.92 3.78
changes 44.00 16.51 62.96 4.71
science 16.00 15.65 60.28 3.46
study 46.00 15.35 27.65 3.27
Pause for Lab Set Question

6.7 Works cited

Gabrielatos, Costas, and Anna Marchi. 2011. “Keyness: Matching Metrics to Definitions.” In Theoretical-Methodological Challenges in Corpus Approaches to Discourse Studies and Some Ways of Addressing Them. https://research.edgehill.ac.uk/en/publications/keyness-matching-metrics-to-definitions-2.
Johnston, Janis E, Kenneth J Berry, and Paul W Mielke Jr. 2006. “Measures of Effect Size for Chi-Squared and Likelihood-Ratio Goodness-of-Fit Tests.” Perceptual and Motor Skills 103 (2): 412–14.
Wilson, Andrew. 2013. “Embracing Bayes Factors for Key Item Analysis in Corpus Linguistics.” New Approaches to the Study of Linguistic Variability 4: 3–11.