6 Keyness

For this lab, we’ll be following many of the same procedures that we’ve done previously:

attaching metadata to a corpus using docvars()
tokenizing using tokens()
handling multiword expressions using tokens_compound()
creating a document-feature matrix using dfm()

For today’s lab we’ll begin some hypothesis testing using news functions from our repository:

keyness_table()
keyness_pairs()
key_keys()

We’ll also look at quanteda’s function:

textstat_keyness()

6.1 What is keyness?

Keyness is a generic term for various tests that compare observed vs. expected frequencies.

The most commonly used (though not the only option) is called log-likelihood in corpus linguistics, but you will see it else where called a G-test goodness-of-fit.

The calculation is based on a 2 x 2 contingency table. It is similar to a chi-square test, but performs better when corpora are unequally sized.

Expected frequencies are based on the relative size of each corpus (in total number of words N_i) and the total number of observed frequencies:

\[ E_i = \sum_i O_i \times \frac{N_i}{\sum_i N_i} \] And log-likelihood is calculated according the formula:

\[ LL = 2 \times \sum_i O_i \ln \frac{O_i}{E_i} \] A good explanation of its implementation in linguistics can be found here: http://ucrel.lancs.ac.uk/llwizard.html

In addition to log-likelihood, the textstat_keyness() function in quanteda has other optional measures.

See here: https://quanteda.io/reference/textstat_keyness.html

6.2 Prepare a corpus

We’ll begin, just as we did in the distributions lab.

6.2.1 Load the needed packages

library(tidyverse)
library(quanteda)
library(quanteda.textstats)
library(gt)

Load the functions:

source("../R/keyness_functions.R")
source("../R/helper_functions.R")

Load the data:

load("../data/sample_corpus.rda")
load("../data/multiword_expressions.rda")

6.2.2 Pre-process the data & create a corpus

sc <- sample_corpus %>%
  mutate(text = preprocess_text(text)) %>%
  corpus()

6.2.3 Extract meta-data from file names

We’ll extract some meta-data by (1) selecting the doc_id column, (2) extracting the initial letter string before the underscore, and (3) renaming the vector text_type.

doc_categories <- sample_corpus %>%
  dplyr::select(doc_id) %>%
  mutate(doc_id = str_extract(doc_id, "^[a-z]+")) %>%
  rename(text_type = doc_id)

6.2.4 Assign the meta-data to the corpus

The accessor function docvars() lets us add or modify data in an object. We’re going to use it to assign text_type as a variable. Note that doc_categories could include more than one column and the assignment process would be the same.

docvars(sc) <- doc_categories

And check the result:

Code

sc |>
  summary() |>
  head(10) |>
  gt()

Partial summary of sample corpus.
Text	Types	Tokens	Sentences	text_type
acad_01	772	2534	1	acad
acad_02	933	2544	1	acad
acad_03	889	2525	1	acad
acad_04	941	2541	1	acad
acad_05	857	2504	1	acad
acad_06	962	2575	1	acad
acad_07	615	2443	1	acad
acad_08	805	2540	1	acad
acad_09	912	2544	1	acad
acad_10	1063	2532	1	acad

Note the new column (text_type on the right). We could assign any number of categorical variables to our corpus, which could be used for analysis downstream.

6.2.5 Create a dfm

sc_dfm <- sc %>%
  tokens(what="fastestword", remove_numbers=TRUE) %>%
  tokens_compound(pattern = phrase(multiword_expressions)) %>%
  dfm()

6.3 A corpus composition table

It is conventional to report out the composition of the corpus or corpora you are using for your study. Here will will sum our tokens by text-type and similarly count the number of texts in each grouping.

corpus_comp <- ntoken(sc_dfm) %>% 
  data.frame(Tokens = .) %>%
  rownames_to_column("Text_Type") %>%
  mutate(Text_Type = str_extract(Text_Type, "^[a-z]+")) %>%
  group_by(Text_Type) %>%
  summarize(Texts = n(),
    Tokens = sum(Tokens)) %>%
  mutate(Text_Type = c("Academic", "Blog", "Fiction", "Magazine", "News", "Spoken", "Television/Movies", "Web"))

Now, using grand_summary_rows(), we can append a row of totals at the bottom of the table.

Code

corpus_comp |> 
  gt() |>
  fmt_integer() |>
  cols_label(
    Text_Type = md("**Text Type**"),
    Texts = md("**Texts**"),
    Tokens = md("**Tokens**")
  ) |>
  grand_summary_rows(
    columns = c(Texts, Tokens),
    fns = list(
      Total ~ sum(.)
    ) ,
    fmt = ~ fmt_integer(.)
    )

Composition of the sample corpus.
	Text Type	Texts	Tokens
	Academic	50	121,442
	Blog	50	125,492
	Fiction	50	128,644
	Magazine	50	126,631
	News	50	119,029
	Spoken	50	127,156
	Television/Movies	50	128,191
	Web	50	124,302
Total	—	400	1,000,887

6.4 Keyness in quanteda

Now that we have a dfm we perform keyness calculations. First, let’s carry out calculations using textstat_keyness().

When we use it with textstat_keyness we are indicating that we want the papers with discipline_cat equal to “acad” to be our target corpus. The everything else (i.e., “acad” == FALSE) will be the reference corpus.

The specific method we’re using is log-likelihood, which is designated by “lr”. Thus keyness will show the tokens that are more frequent in papers written for the academic text-type vs. those written for other text-types.

acad_kw <- textstat_keyness(sc_dfm, docvars(sc_dfm, "text_type") == "acad", measure = "lr")

Note the second argument: docvars(sc_dfm, "text_type") == "acad". That slightly awkward syntax simply produces a logical vector. You could store it and pass the vector the function, as well.

Code

acad_kw |>
  head(10) |>
  gt() |>
  fmt_number(columns = "G2",
             decimals = 2)

Tokens with the highest keyness values in the academic text-type when compared to the rest of the sample corpus.
feature	G2	n_target	n_reference
of	1,703.90	4848	17258
the	777.88	8273	42712
social	456.00	208	133
studies	391.06	155	71
study	318.95	158	119
in	316.66	2715	13341
perfectionism	301.84	74	1
by	297.33	790	2707
practice	267.90	117	68
science	259.37	118	75

6.4.1 Creating sub-corpora

If we want to compare one text-type (as our target corpus) to another (as our reference corpus), we can easily subset the data.

sub_dfm <- dfm_subset(sc_dfm, text_type == "acad" | text_type == "fic")

When we do this, the resulting data will still include all the tokens in the sample corpus, including those that do not appear in either the academic or fiction text-type. To deal with this, we will trim the dfm.

sub_dfm <- dfm_trim(sub_dfm, min_termfreq = 1)

We’ll do the same for fiction.

fic_kw <- textstat_keyness(sub_dfm, docvars(sub_dfm, "text_type") == "fic", measure = "lr")

Code

fic_kw |>
  head(10) |>
  gt() |>
  fmt_number(columns = "G2",
             decimals = 2)

Tokens with the highest keyness values in the fiction text-type when compared to the academic text-type.
feature	G2	n_target	n_reference
i	2,350.21	2428	143
she	1,861.48	1763	70
he	1,699.10	1978	170
her	1,453.01	1559	104
you	1,361.11	1286	50
n't	929.29	914	43
his	805.06	1155	157
my	732.26	758	44
me	591.50	557	21
him	541.51	548	29

Note that if we switch our target and reference corpora (academic as target, fiction as reference), the tail of the keyness table contains the negative values of the original (fiction as target, academic and reference), which you may have already gathered given the formula above.

acad_kw <- textstat_keyness(sub_dfm, docvars(sub_dfm, "text_type") == "acad", measure = "lr")

Code

acad_kw |>
  tail(10) |>
  gt() |>
  fmt_number(columns = "G2",
             decimals = 2)

Tokens with the lowest keyness values int the academic text-type when compared to the fiction text-type.
feature	G2	n_target	n_reference
him	−541.51	29	548
me	−591.50	21	557
my	−732.26	44	758
his	−805.06	157	1155
n't	−929.29	43	914
you	−1,361.11	50	1286
her	−1,453.01	104	1559
he	−1,699.10	170	1978
she	−1,861.48	70	1763
i	−2,350.21	143	2428

6.5 Effect size

While quanteda produces one important piece of information (the amount of evidence we have for an effect), it neglects another (the magnitude of the effect). Whenever we report on significance it is critical to report effect size. Some common effect size measures include:

%DIFF - see Gabrielatos and Marchi (2011)
- Costas has also provided an FAQ with more details http://ucrel.lancs.ac.uk/ll/DIFF_FAQ.pdf
Bayes Factor (BIC) - see Wilson (2013)
- You can interpret the approximate Bayes Factor as degrees of evidence against the null hypothesis as follows:
  - 0-2: not worth more than a bare mention
  - 2-6: positive evidence against H₀
  - 6-10: strong evidence against H₀
  - 10: very strong evidence against H₀
- For negative scores, the scale is read as “in favor of” instead of “against”.
Effect Size for Log Likelihood (ELL) - see Johnston et al. (2006)
- ELL varies between 0 and 1 (inclusive). Johnston et al. say “interpretation is straightforward as the proportion of the maximum departure between the observed and expected proportions”.
Relative Risk
Odds Ratio
Log Ratio - see Andrew Hardie’s CASS blog for how to interpret this
- Note that if either word has zero frequency then a small adjustment is automatically applied (0.5 observed frequency which is then normalized) to avoid division by zero errors.

6.5.1 Log Ratio (LR)

You are welcome to use any of these effect size measures. Our repo comes with a function for calculating Hardie’s Log Ratio, which is easy and intuitive.

6.5.2 The `keyness_table()` function

We’ll start by creating 2 dfms–a target and a reference:

acad_dfm <- dfm_subset(sc_dfm, text_type == "acad") %>% dfm_trim(min_termfreq = 1)
fic_dfm <- dfm_subset(sc_dfm, text_type == "fic") %>% dfm_trim(min_termfreq = 1)

Then we will use the keyness_table() function.

acad_kw <- keyness_table(acad_dfm, fic_dfm)

And check the result:

Code

acad_kw |>
  head(10) |>
  gt() |>
  fmt_number(columns = c("LL", "LR", "Per_10.5_Tar", "Per_10.5_Ref", "DP_Tar", "DP_Ref"),
             decimals = 2) |>
  fmt_number(columns = "PV",
             decimals = 5)

Tokens with the highest keyness values in the academic text-type when compared to the fiction text-type.
Token	LL	LR	AF_Tar	AF_Ref	Per_10.5_Tar	Per_10.5_Ref	DP_Tar	DP_Ref
of	1,225.77	1.25	4848	2153	3,992.03	1,673.61	0.09	0.15
the	250.03	0.37	8273	6768	6,812.31	5,261.03	0.11	0.10
social	248.10	4.98	208	7	171.28	5.44	0.65	0.88
are	221.19	1.44	707	276	582.17	214.55	0.21	0.30
studies	213.17	7.36	155	1	127.63	0.78	0.67	0.98
by	209.00	1.29	790	342	650.52	265.85	0.17	0.22
in	185.82	0.58	2715	1922	2,235.64	1,494.05	0.11	0.09
students	179.36	5.27	146	4	120.22	3.11	0.80	0.94
research	175.19	4.74	151	6	124.34	4.66	0.63	0.90
is	171.74	0.87	1241	720	1,021.89	559.68	0.23	0.36

The columns are as follows:

LL: the keyness value or log-likelihood, also know as a G2 or goodness-of-fit test.
LR: the effect size, which here is the log ratio
PV: the p-value associated with the log-likelihood
AF_Tar: the absolute frequency in the target corpus
AF_Ref: the absolute frequency in the reference corpus
Per_10.x_Tar: the relative frequency in the target corpus (automatically calibrated to a normalizing factor, where here is per 100,000 tokens)
Per_10.x_Ref: the relative frequency in the reference corpus (automatically calibrated to a normalizing factor, where here is per 100,000 tokens)
DP_Tar: the deviation of proportions (a dispersion measure) in the target corpus
DP_Ref: the deviation of proportions in the reference corpus

6.5.3 Keyness pairs

There is also a function for quickly generating pair-wise keyness comparisions among multiple sub-corpora. To demonstrate, create a third dfm, this time containing news articles.

news_dfm <- dfm_subset(sc_dfm, text_type == "news") %>% dfm_trim(min_termfreq = 1)

To produce a data.frame comparing more than two sup-corpora, use the keyness_pairs() function:

kp <- keyness_pairs(news_dfm, acad_dfm, fic_dfm)

Check the result:

Code

kp |>
  head(10) |>
  gt() |>
  fmt_number(everything(),
             decimals = 2)

Pairwise comparisions of news (target) vs. academic (reference), news (target) vs. fiction (reference), and academic (target) vs. fiction (reference).
Token	A_v_B_LL	A_v_B_LR	A_v_C_LL	A_v_C_LR	A_v_C_PV	B_v_C_LL	B_v_C_LR	B_v_C_PV
he	492.01	2.32	−394.56	−1.13	0.00	−1,686.80	−3.46	0.00
said	455.34	3.69	1.94	0.13	0.16	−414.33	−3.56	0.00
i	430.97	2.36	−853.61	−1.65	0.00	−2,330.44	−4.00	0.00
n't	333.06	3.23	−173.20	−1.10	0.00	−926.44	−4.33	0.00
you	327.51	3.06	−410.98	−1.54	0.00	−1,355.34	−4.60	0.00
mr	236.58	5.10	75.25	1.61	0.00	−60.92	−3.48	0.00
park	226.44	8.36	139.47	3.20	0.00	−25.26	−5.16	0.00
she	212.56	2.36	−919.21	−2.21	0.00	−1,850.64	−4.57	0.00
p.m	209.56	8.25	199.71	6.33	0.00	−2.66	−1.92	0.10
ob	198.31	8.17	206.63	8.25	0.00	0.00	0.08	1.00

6.6 Key key words

The concept of “key key words” was introduced by Mike Smith for the WordSmith concordancer. The process compares each text in the target corpus to the reference corpus. Log-likelihood is calculated for each comparison. Then a mean is calculated for keyness and effect size. In addition, a range is provided for the number of texts in which keyness reaches significance for a given threshold. (The default is p < 0.05.) That range is returned as a percentage.

In this way, key key words accounts for the dispersion of key words by indicating whether a keyness value is driven by a relatively high frequency in a few target texts or many.

kk <- key_keys(acad_dfm, fic_dfm)

Again, we can look at the first few rows of the table:

Code

kk |>
  head(10) |>
  gt() |>
  fmt_number(everything(),
             decimals = 2)

Key key words when comparing the academic text-type to the fiction text-type.
token	key_range	key_mean	key_sd	effect_mean
of	98.00	59.20	36.40	1.21
social	38.00	25.00	70.11	3.41
studies	44.00	22.56	67.90	5.93
students	16.00	19.64	70.17	3.56
the	64.00	17.80	26.72	0.31
research	34.00	17.57	63.50	3.48
political	26.00	17.00	57.92	3.78
changes	44.00	16.51	62.96	4.71
science	16.00	15.65	60.28	3.46
study	46.00	15.35	27.65	3.27

Pause for Lab Set Question

Complete Task 1 in Lab Set 2.

6.7 Works cited

Gabrielatos, Costas, and Anna Marchi. 2011. “Keyness: Matching Metrics to Definitions.” In Theoretical-Methodological Challenges in Corpus Approaches to Discourse Studies and Some Ways of Addressing Them. https://research.edgehill.ac.uk/en/publications/keyness-matching-metrics-to-definitions-2.

Johnston, Janis E, Kenneth J Berry, and Paul W Mielke Jr. 2006. “Measures of Effect Size for Chi-Squared and Likelihood-Ratio Goodness-of-Fit Tests.” Perceptual and Motor Skills 103 (2): 412–14.

Wilson, Andrew. 2013. “Embracing Bayes Factors for Key Item Analysis in Corpus Linguistics.” New Approaches to the Study of Linguistic Variability 4: 3–11.

6.1 What is keyness?

6.2 Prepare a corpus

6.2.1 Load the needed packages

6.2.2 Pre-process the data & create a corpus

6.2.3 Extract meta-data from file names

6.2.4 Assign the meta-data to the corpus

6.2.5 Create a dfm

6.3 A corpus composition table

6.4 Keyness in quanteda

6.4.1 Creating sub-corpora

6.5 Effect size

6.5.1 Log Ratio (LR)

6.5.2 The keyness_table() function

6.5.3 Keyness pairs

6.6 Key key words

6.7 Works cited

6.5.2 The `keyness_table()` function