library(tidyverse)
library(quanteda)
library(quanteda.textstats)
library(quanteda.extras)
library(gt)6 Keyness
For this lab, we’ll be following many of the same procedures that we’ve done previously:
- attaching metadata to a corpus using docvars()
- tokenizing using tokens()
- handling multiword expressions using tokens_compound()
- creating a document-feature matrix using dfm()
For today’s lab we’ll begin some hypothesis testing using news functions from quanteda.extras:
- keyness_table()
- keyness_pairs()
- key_keys()
We’ll also look at quanteda’s function:
- textstat_keyness()
6.1 What is keyness?
Keyness is a generic term for various tests that compare observed vs. expected frequencies.
The most commonly used (though not the only option) is called log-likelihood in corpus linguistics, but you will see it else where called a G-test goodness-of-fit.
The calculation is based on a 2 × 2 contingency table. It is similar to a chi-square test, but performs better when corpora are unequally sized.
Expected frequencies are based on the relative size of each corpus (in total number of words Ci) and observed frequencies:
\[ E_{ij} =\frac{R_i C_j}{N} \] And log-likelihood is calculated according the formula:
\[ LL = 2 \times \sum_i \sum_j O_{ij} \log \frac{O_{ij}}{E_{ij}} \] A good explanation of its implementation in linguistics can be found here: http://ucrel.lancs.ac.uk/llwizard.html
In addition to log-likelihood, the textstat_keyness() function in quanteda.textstats has other optional measures.
See here: https://quanteda.io/reference/textstat_keyness.html
6.2 Prepare a corpus
We’ll begin, just as we did in the distributions lab.
6.2.1 Load the needed packages
6.2.2 Pre-process the data & create a corpus
sc <- sample_corpus |>
  mutate(text = preprocess_text(text)) |>
  corpus()6.2.3 Extract meta-data from file names
We’ll extract some meta-data by (1) selecting the doc_id column, (2) extracting the initial letter string before the underscore, and (3) renaming the vector text_type.
doc_categories <- sample_corpus |>
  dplyr::select(doc_id) |>
  mutate(doc_id = str_extract(doc_id, "^[a-z]+")) |>
  rename(text_type = doc_id)6.2.4 Assign the meta-data to the corpus
The accessor function docvars() lets us add or modify data in an object. We’re going to use it to assign text_type as a variable. Note that doc_categories could include more than one column and the assignment process would be the same.
docvars(sc) <- doc_categoriesAnd check the result:
Code
sc |>
  summary() |>
  head(10) |>
  gt()| Text | Types | Tokens | Sentences | text_type | 
|---|---|---|---|---|
| acad_01 | 772 | 2534 | 1 | acad | 
| acad_02 | 933 | 2544 | 1 | acad | 
| acad_03 | 889 | 2525 | 1 | acad | 
| acad_04 | 941 | 2541 | 1 | acad | 
| acad_05 | 857 | 2504 | 1 | acad | 
| acad_06 | 962 | 2575 | 1 | acad | 
| acad_07 | 615 | 2443 | 1 | acad | 
| acad_08 | 805 | 2540 | 1 | acad | 
| acad_09 | 912 | 2544 | 1 | acad | 
| acad_10 | 1063 | 2532 | 1 | acad | 
Note the new column (text_type on the right). We could assign any number of categorical variables to our corpus, which could be used for analysis downstream.
6.2.5 Create a dfm
sc_dfm <- sc |>
  tokens(what = "fastestword", remove_numbers = TRUE) |>
  tokens_compound(pattern = phrase(multiword_expressions)) |>
  dfm()6.3 A corpus composition table
It is conventional to report out the composition of the corpus or corpora you are using for your study. Here will will sum our tokens by text-type and similarly count the number of texts in each grouping.
corpus_comp <- ntoken(sc_dfm) |> 
  as.data.frame() |>
  rownames_to_column("Text_Type") |>
  select(Text_Type, Tokens = `ntoken(sc_dfm)`) |>
  mutate(Text_Type = str_extract(Text_Type, "^[a-z]+")) |>
  group_by(Text_Type) |>
  summarize(Texts = n(),
    Tokens = sum(Tokens)) |>
  mutate(Text_Type = c("Academic", "Blog", "Fiction", "Magazine", "News", "Spoken", "Television/Movies", "Web"))Now, using grand_summary_rows(), we can append a row of totals at the bottom of the table.
Code
corpus_comp |> 
  gt() |>
  fmt_integer() |>
  cols_label(
    Text_Type = md("**Text Type**"),
    Texts = md("**Texts**"),
    Tokens = md("**Tokens**")
  ) |>
  grand_summary_rows(
    columns = c(Texts, Tokens),
    fns = list(
      Total ~ sum(.)
    ) ,
    fmt = ~ fmt_integer(.)
    )| Text Type | Texts | Tokens | |
|---|---|---|---|
| Academic | 50 | 121,442 | |
| Blog | 50 | 125,492 | |
| Fiction | 50 | 128,644 | |
| Magazine | 50 | 126,631 | |
| News | 50 | 119,029 | |
| Spoken | 50 | 127,156 | |
| Television/Movies | 50 | 128,191 | |
| Web | 50 | 124,302 | |
| Total | — | 400 | 1,000,887 | 
6.4 Keyness in quanteda
Now that we have a dfm we perform keyness calculations. First, let’s carry out calculations using textstat_keyness().
When we use it with textstat_keyness we are indicating that we want the papers with discipline_cat equal to “acad” to be our target corpus. The everything else (i.e., “acad” == FALSE) will be the reference corpus.
The specific method we’re using is the likelihood ratio test, which is designated by “lr”. Thus keyness will show the tokens that are more frequent in papers written for the academic text-type vs. those written for other text-types.
acad_kw <- textstat_keyness(sc_dfm, docvars(sc_dfm, "text_type") == "acad", 
                            measure = "lr")Note the second argument: docvars(sc_dfm, "text_type") == "acad". That slightly awkward syntax simply produces a logical vector. You could store it and pass the vector the function, as well.
Code
acad_kw |>
  head(10) |>
  gt() |>
  fmt_number(columns = "G2",
             decimals = 2)| feature | G2 | p | n_target | n_reference | 
|---|---|---|---|---|
| of | 1,703.90 | 0 | 4848 | 17258 | 
| the | 777.88 | 0 | 8273 | 42712 | 
| social | 456.00 | 0 | 208 | 133 | 
| studies | 391.06 | 0 | 155 | 71 | 
| study | 318.95 | 0 | 158 | 119 | 
| in | 316.66 | 0 | 2715 | 13341 | 
| perfectionism | 301.84 | 0 | 74 | 1 | 
| by | 297.33 | 0 | 790 | 2707 | 
| practice | 267.90 | 0 | 117 | 68 | 
| science | 259.37 | 0 | 118 | 75 | 
6.4.1 Creating sub-corpora
If we want to compare one text-type (as our target corpus) to another (as our reference corpus), we can easily subset the data.
sub_dfm <- dfm_subset(sc_dfm, text_type == "acad" | text_type == "fic")When we do this, the resulting data will still include all the tokens in the sample corpus, including those that do not appear in either the academic or fiction text-type. To deal with this, we will trim the dfm.
sub_dfm <- dfm_trim(sub_dfm, min_termfreq = 1)We’ll do the same comparison for fiction against academic writing.
fic_kw <- textstat_keyness(sub_dfm, docvars(sub_dfm, "text_type") == "fic", 
                           measure = "lr")Code
fic_kw |>
  head(10) |>
  gt() |>
  fmt_number(columns = "G2",
             decimals = 2)| feature | G2 | p | n_target | n_reference | 
|---|---|---|---|---|
| i | 2,350.21 | 0 | 2428 | 143 | 
| she | 1,861.48 | 0 | 1763 | 70 | 
| he | 1,699.10 | 0 | 1978 | 170 | 
| her | 1,453.01 | 0 | 1559 | 104 | 
| you | 1,361.11 | 0 | 1286 | 50 | 
| n't | 929.29 | 0 | 914 | 43 | 
| his | 805.06 | 0 | 1155 | 157 | 
| my | 732.26 | 0 | 758 | 44 | 
| me | 591.50 | 0 | 557 | 21 | 
| him | 541.51 | 0 | 548 | 29 | 
Note that if we switch our target and reference corpora (academic as target, fiction as reference), the tail of the keyness table contains the negative values of the original (fiction as target, academic and reference), which you may have already gathered given the formula above.
acad_kw <- textstat_keyness(sub_dfm, docvars(sub_dfm, "text_type") == "acad", measure = "lr")Code
acad_kw |>
  tail(10) |>
  gt() |>
  fmt_number(columns = "G2",
             decimals = 2)| feature | G2 | p | n_target | n_reference | 
|---|---|---|---|---|
| him | −541.51 | 0 | 29 | 548 | 
| me | −591.50 | 0 | 21 | 557 | 
| my | −732.26 | 0 | 44 | 758 | 
| his | −805.06 | 0 | 157 | 1155 | 
| n't | −929.29 | 0 | 43 | 914 | 
| you | −1,361.11 | 0 | 50 | 1286 | 
| her | −1,453.01 | 0 | 104 | 1559 | 
| he | −1,699.10 | 0 | 170 | 1978 | 
| she | −1,861.48 | 0 | 70 | 1763 | 
| i | −2,350.21 | 0 | 143 | 2428 | 
6.5 Effect size
While quanteda produces one important piece of information (the amount of evidence we have for an effect), it neglects another (the magnitude of the effect). Whenever we report on significance it is critical to report effect size. Some common effect size measures include:
- %DIFF - see Gabrielatos and Marchi (2011)
- Costas has also provided an FAQ with more details http://ucrel.lancs.ac.uk/ll/DIFF_FAQ.pdf
 
- Bayes Factor (BIC) - see Wilson (2013)
- You can interpret the approximate Bayes Factor as degrees of evidence against the null hypothesis as follows:
- 0-2: not worth more than a bare mention
- 2-6: positive evidence against H0
- 6-10: strong evidence against H0
- 10: very strong evidence against H0
 
- For negative scores, the scale is read as “in favor of” instead of “against”.
 
- You can interpret the approximate Bayes Factor as degrees of evidence against the null hypothesis as follows:
- Effect Size for Log Likelihood (ELL) - see Johnston et al. (2006)
- ELL varies between 0 and 1 (inclusive). Johnston et al. say “interpretation is straightforward as the proportion of the maximum departure between the observed and expected proportions”.
 
- Relative Risk
- Odds Ratio
- Log Ratio - see Andrew Hardie’s CASS blog for how to interpret this
- Note that if either word has zero frequency then a small adjustment is automatically applied (0.5 observed frequency which is then normalized) to avoid division by zero errors.
 
6.5.1 Log Ratio (LR)
You are welcome to use any of these effect size measures. Our repo comes with a function for calculating Hardie’s Log Ratio, which is easy and intuitive.
6.5.2 The keyness_table() function
We’ll start by creating 2 dfms–a target and a reference:
acad_dfm <- dfm_subset(sc_dfm, text_type == "acad") |> 
  dfm_trim(min_termfreq = 1)
fic_dfm <- dfm_subset(sc_dfm, text_type == "fic") |> 
  dfm_trim(min_termfreq = 1)Then we will use the keyness_table() function.
acad_kw <- keyness_table(acad_dfm, fic_dfm)And check the result:
Code
acad_kw |>
  arrange(desc(abs(LL))) |>
  head(10) |>
  gt() |>
  fmt_number(columns = c("LL", "LR", "Per_10.5_Tar", "Per_10.5_Ref", "DP_Tar", "DP_Ref"),
             decimals = 2) |>
  fmt_number(columns = "PV",
             decimals = 5)| Token | LL | LR | PV | AF_Tar | AF_Ref | Per_10.5_Tar | Per_10.5_Ref | DP_Tar | DP_Ref | 
|---|---|---|---|---|---|---|---|---|---|
| i | −2,350.21 | −4.00 | 0.00000 | 143 | 2428 | 117.75 | 1,887.38 | 0.54 | 0.32 | 
| she | −1,861.48 | −4.57 | 0.00000 | 70 | 1763 | 57.64 | 1,370.45 | 0.79 | 0.37 | 
| he | −1,699.10 | −3.46 | 0.00000 | 170 | 1978 | 139.98 | 1,537.58 | 0.58 | 0.26 | 
| her | −1,453.01 | −3.82 | 0.00000 | 104 | 1559 | 85.64 | 1,211.87 | 0.77 | 0.38 | 
| you | −1,361.11 | −4.60 | 0.00000 | 50 | 1286 | 41.17 | 999.66 | 0.76 | 0.24 | 
| of | 1,260.33 | 1.25 | 0.00000 | 4848 | 2153 | 3,992.03 | 1,673.61 | 0.09 | 0.15 | 
| n't | −929.29 | −4.33 | 0.00000 | 43 | 914 | 35.41 | 710.49 | 0.74 | 0.20 | 
| his | −805.06 | −2.80 | 0.00000 | 157 | 1155 | 129.28 | 897.83 | 0.54 | 0.22 | 
| my | −732.26 | −4.02 | 0.00000 | 44 | 758 | 36.23 | 589.22 | 0.74 | 0.41 | 
| me | −591.50 | −4.65 | 0.00000 | 21 | 557 | 17.29 | 432.98 | 0.78 | 0.38 | 
The columns are as follows:
- LL: the keyness value or log-likelihood, also know as a G2 or goodness-of-fit test.
- LR: the effect size, which here is the log ratio
- PV: the p-value associated with the log-likelihood
- AF_Tar: the absolute frequency in the target corpus
- AF_Ref: the absolute frequency in the reference corpus
- Per_10.x_Tar: the relative frequency in the target corpus (automatically calibrated to a normalizing factor, where here is per 100,000 tokens)
- Per_10.x_Ref: the relative frequency in the reference corpus (automatically calibrated to a normalizing factor, where here is per 100,000 tokens)
- DP_Tar: the deviation of proportions (a dispersion measure) in the target corpus
- DP_Ref: the deviation of proportions in the reference corpus
6.5.3 Keyness pairs
There is also a function for quickly generating pair-wise keyness comparisions among multiple sub-corpora. To demonstrate, create a third dfm, this time containing news articles.
news_dfm <- dfm_subset(sc_dfm, text_type == "news") |> 
  dfm_trim(min_termfreq = 1)To produce a data.frame comparing more than two sub-corpora, use the keyness_pairs() function:
kp <- keyness_pairs(news_dfm, acad_dfm, fic_dfm)Check the result:
Code
kp |>
  head(10) |>
  gt() |>
  fmt_number(everything(),
             decimals = 2)| Token | A_v_B_LL | A_v_B_LR | A_v_B_PV | A_v_C_LL | A_v_C_LR | A_v_C_PV | B_v_C_LL | B_v_C_LR | B_v_C_PV | 
|---|---|---|---|---|---|---|---|---|---|
| he | 493.90 | 2.32 | 0.00 | −398.94 | −1.13 | 0.00 | −1,699.10 | −3.46 | 0.00 | 
| said | 456.17 | 3.69 | 0.00 | 1.95 | 0.13 | 0.16 | −415.04 | −3.56 | 0.00 | 
| i | 432.39 | 2.36 | 0.00 | −863.93 | −1.65 | 0.00 | −2,350.21 | −4.00 | 0.00 | 
| n't | 333.59 | 3.23 | 0.00 | −174.09 | −1.10 | 0.00 | −929.29 | −4.33 | 0.00 | 
| you | 328.06 | 3.06 | 0.00 | −413.66 | −1.54 | 0.00 | −1,361.11 | −4.60 | 0.00 | 
| mr | 236.74 | 5.10 | 0.00 | 75.33 | 1.61 | 0.00 | −60.94 | −3.48 | 0.00 | 
| park | 226.55 | 8.36 | 0.00 | 139.56 | 3.20 | 0.00 | −25.26 | −5.16 | 0.00 | 
| she | 212.91 | 2.36 | 0.00 | −926.40 | −2.21 | 0.00 | −1,861.48 | −4.57 | 0.00 | 
| p.m | 209.66 | 8.25 | 0.00 | 199.80 | 6.33 | 0.00 | −2.66 | −1.92 | 0.10 | 
| ob | 198.40 | 8.17 | 0.00 | 206.72 | 8.25 | 0.00 | 0.00 | 0.08 | 1.00 | 
6.6 Key key words
The concept of “key key words” was introduced by Mike Smith for the WordSmith concordancer. The process compares each text in the target corpus to the reference corpus. Log-likelihood is calculated for each comparison. Then a mean is calculated for keyness and effect size. In addition, a range is provided for the number of texts in which keyness reaches significance for a given threshold. (The default is p < 0.05.) That range is returned as a percentage.
In this way, key key words accounts for the dispersion of key words by indicating whether a keyness value is driven by a relatively high frequency in a few target texts or many.
kk <- key_keys(acad_dfm, fic_dfm)Again, we can look at the first few rows of the table:
Code
kk |>
  head(10) |>
  gt() |>
  fmt_number(c(-token, -key_range),
             decimals = 2) |>
  fmt_percent(key_range, decimals = 0, scale_values = FALSE)| token | key_range | key_mean | key_sd | effect_mean | 
|---|---|---|---|---|
| of | 98% | 60.71 | 37.44 | 1.21 | 
| social | 38% | 25.05 | 70.30 | 3.41 | 
| studies | 44% | 22.59 | 68.07 | 5.93 | 
| students | 16% | 19.67 | 70.34 | 3.56 | 
| the | 64% | 19.04 | 28.59 | 0.31 | 
| research | 34% | 17.60 | 63.70 | 3.48 | 
| political | 26% | 17.02 | 58.07 | 3.78 | 
| changes | 44% | 16.54 | 63.12 | 4.71 | 
| science | 16% | 15.68 | 60.42 | 3.46 | 
| study | 46% | 15.37 | 27.68 | 3.27 | 
Complete Task 1 in Lab Set 2.