library(tidyverse)
library(quanteda)
library(quanteda.textstats)
library(gt)
6 Keyness
For this lab, we’ll be following many of the same procedures that we’ve done previously:
- attaching metadata to a corpus using
docvars()
- tokenizing using
tokens()
- handling multiword expressions using
tokens_compound()
- creating a document-feature matrix using
dfm()
For today’s lab we’ll begin some hypothesis testing using news functions from our repository:
keyness_table()
keyness_pairs()
key_keys()
We’ll also look at quanteda’s function:
textstat_keyness()
6.1 What is keyness?
Keyness is a generic term for various tests that compare observed vs. expected frequencies.
The most commonly used (though not the only option) is called log-likelihood in corpus linguistics, but you will see it else where called a G-test goodness-of-fit.
The calculation is based on a 2 x 2 contingency table. It is similar to a chi-square test, but performs better when corpora are unequally sized.
Expected frequencies are based on the relative size of each corpus (in total number of words Ni) and the total number of observed frequencies:
\[ E_i = \sum_i O_i \times \frac{N_i}{\sum_i N_i} \] And log-likelihood is calculated according the formula:
\[ LL = 2 \times \sum_i O_i \ln \frac{O_i}{E_i} \] A good explanation of its implementation in linguistics can be found here: http://ucrel.lancs.ac.uk/llwizard.html
In addition to log-likelihood, the textstat_keyness()
function in quanteda has other optional measures.
See here: https://quanteda.io/reference/textstat_keyness.html
6.2 Prepare a corpus
We’ll begin, just as we did in the distributions lab.
6.2.1 Load the needed packages
Load the functions:
source("../R/keyness_functions.R")
source("../R/helper_functions.R")
Load the data:
load("../data/sample_corpus.rda")
load("../data/multiword_expressions.rda")
6.2.2 Pre-process the data & create a corpus
<- sample_corpus %>%
sc mutate(text = preprocess_text(text)) %>%
corpus()
6.2.3 Extract meta-data from file names
We’ll extract some meta-data by (1) selecting the doc_id column, (2) extracting the initial letter string before the underscore, and (3) renaming the vector text_type.
<- sample_corpus %>%
doc_categories ::select(doc_id) %>%
dplyrmutate(doc_id = str_extract(doc_id, "^[a-z]+")) %>%
rename(text_type = doc_id)
6.2.4 Assign the meta-data to the corpus
The accessor function docvars()
lets us add or modify data in an object. We’re going to use it to assign text_type as a variable. Note that doc_categories could include more than one column and the assignment process would be the same.
docvars(sc) <- doc_categories
And check the result:
Code
|>
sc summary() |>
head(10) |>
gt()
Text | Types | Tokens | Sentences | text_type |
---|---|---|---|---|
acad_01 | 772 | 2534 | 1 | acad |
acad_02 | 933 | 2544 | 1 | acad |
acad_03 | 889 | 2525 | 1 | acad |
acad_04 | 941 | 2541 | 1 | acad |
acad_05 | 857 | 2504 | 1 | acad |
acad_06 | 962 | 2575 | 1 | acad |
acad_07 | 615 | 2443 | 1 | acad |
acad_08 | 805 | 2540 | 1 | acad |
acad_09 | 912 | 2544 | 1 | acad |
acad_10 | 1063 | 2532 | 1 | acad |
Note the new column (text_type on the right). We could assign any number of categorical variables to our corpus, which could be used for analysis downstream.
6.2.5 Create a dfm
<- sc %>%
sc_dfm tokens(what="fastestword", remove_numbers=TRUE) %>%
tokens_compound(pattern = phrase(multiword_expressions)) %>%
dfm()
6.3 A corpus composition table
It is conventional to report out the composition of the corpus or corpora you are using for your study. Here will will sum our tokens by text-type and similarly count the number of texts in each grouping.
<- ntoken(sc_dfm) %>%
corpus_comp data.frame(Tokens = .) %>%
rownames_to_column("Text_Type") %>%
mutate(Text_Type = str_extract(Text_Type, "^[a-z]+")) %>%
group_by(Text_Type) %>%
summarize(Texts = n(),
Tokens = sum(Tokens)) %>%
mutate(Text_Type = c("Academic", "Blog", "Fiction", "Magazine", "News", "Spoken", "Television/Movies", "Web"))
Now, using grand_summary_rows()
, we can append a row of totals at the bottom of the table.
Code
|>
corpus_comp gt() |>
fmt_integer() |>
cols_label(
Text_Type = md("**Text Type**"),
Texts = md("**Texts**"),
Tokens = md("**Tokens**")
|>
) grand_summary_rows(
columns = c(Texts, Tokens),
fns = list(
~ sum(.)
Total
) ,fmt = ~ fmt_integer(.)
)
Text Type |
Texts |
Tokens |
|
---|---|---|---|
Academic | 50 | 121,442 | |
Blog | 50 | 125,492 | |
Fiction | 50 | 128,644 | |
Magazine | 50 | 126,631 | |
News | 50 | 119,029 | |
Spoken | 50 | 127,156 | |
Television/Movies | 50 | 128,191 | |
Web | 50 | 124,302 | |
Total | — | 400 | 1,000,887 |
6.4 Keyness in quanteda
Now that we have a dfm we perform keyness calculations. First, let’s carry out calculations using textstat_keyness()
.
When we use it with textstat_keyness we are indicating that we want the papers with discipline_cat equal to “acad” to be our target corpus. The everything else (i.e., “acad” == FALSE) will be the reference corpus.
The specific method we’re using is log-likelihood, which is designated by “lr”. Thus keyness will show the tokens that are more frequent in papers written for the academic text-type vs. those written for other text-types.
<- textstat_keyness(sc_dfm, docvars(sc_dfm, "text_type") == "acad", measure = "lr") acad_kw
Note the second argument: docvars(sc_dfm, "text_type") == "acad"
. That slightly awkward syntax simply produces a logical vector. You could store it and pass the vector the function, as well.
Code
|>
acad_kw head(10) |>
gt() |>
fmt_number(columns = "G2",
decimals = 2)
feature | G2 | p | n_target | n_reference |
---|---|---|---|---|
of | 1,703.90 | 0 | 4848 | 17258 |
the | 777.88 | 0 | 8273 | 42712 |
social | 456.00 | 0 | 208 | 133 |
studies | 391.06 | 0 | 155 | 71 |
study | 318.95 | 0 | 158 | 119 |
in | 316.66 | 0 | 2715 | 13341 |
perfectionism | 301.84 | 0 | 74 | 1 |
by | 297.33 | 0 | 790 | 2707 |
practice | 267.90 | 0 | 117 | 68 |
science | 259.37 | 0 | 118 | 75 |
6.4.1 Creating sub-corpora
If we want to compare one text-type (as our target corpus) to another (as our reference corpus), we can easily subset the data.
<- dfm_subset(sc_dfm, text_type == "acad" | text_type == "fic") sub_dfm
When we do this, the resulting data will still include all the tokens in the sample corpus, including those that do not appear in either the academic or fiction text-type. To deal with this, we will trim the dfm.
<- dfm_trim(sub_dfm, min_termfreq = 1) sub_dfm
We’ll do the same for fiction.
<- textstat_keyness(sub_dfm, docvars(sub_dfm, "text_type") == "fic", measure = "lr") fic_kw
Code
|>
fic_kw head(10) |>
gt() |>
fmt_number(columns = "G2",
decimals = 2)
feature | G2 | p | n_target | n_reference |
---|---|---|---|---|
i | 2,350.21 | 0 | 2428 | 143 |
she | 1,861.48 | 0 | 1763 | 70 |
he | 1,699.10 | 0 | 1978 | 170 |
her | 1,453.01 | 0 | 1559 | 104 |
you | 1,361.11 | 0 | 1286 | 50 |
n't | 929.29 | 0 | 914 | 43 |
his | 805.06 | 0 | 1155 | 157 |
my | 732.26 | 0 | 758 | 44 |
me | 591.50 | 0 | 557 | 21 |
him | 541.51 | 0 | 548 | 29 |
Note that if we switch our target and reference corpora (academic as target, fiction as reference), the tail of the keyness table contains the negative values of the original (fiction as target, academic and reference), which you may have already gathered given the formula above.
<- textstat_keyness(sub_dfm, docvars(sub_dfm, "text_type") == "acad", measure = "lr") acad_kw
Code
|>
acad_kw tail(10) |>
gt() |>
fmt_number(columns = "G2",
decimals = 2)
feature | G2 | p | n_target | n_reference |
---|---|---|---|---|
him | −541.51 | 0 | 29 | 548 |
me | −591.50 | 0 | 21 | 557 |
my | −732.26 | 0 | 44 | 758 |
his | −805.06 | 0 | 157 | 1155 |
n't | −929.29 | 0 | 43 | 914 |
you | −1,361.11 | 0 | 50 | 1286 |
her | −1,453.01 | 0 | 104 | 1559 |
he | −1,699.10 | 0 | 170 | 1978 |
she | −1,861.48 | 0 | 70 | 1763 |
i | −2,350.21 | 0 | 143 | 2428 |
6.5 Effect size
While quanteda produces one important piece of information (the amount of evidence we have for an effect), it neglects another (the magnitude of the effect). Whenever we report on significance it is critical to report effect size. Some common effect size measures include:
- %DIFF - see Gabrielatos and Marchi (2011)
- Costas has also provided an FAQ with more details http://ucrel.lancs.ac.uk/ll/DIFF_FAQ.pdf
- Bayes Factor (BIC) - see Wilson (2013)
- You can interpret the approximate Bayes Factor as degrees of evidence against the null hypothesis as follows:
- 0-2: not worth more than a bare mention
- 2-6: positive evidence against H0
- 6-10: strong evidence against H0
- 10: very strong evidence against H0
- For negative scores, the scale is read as “in favor of” instead of “against”.
- You can interpret the approximate Bayes Factor as degrees of evidence against the null hypothesis as follows:
- Effect Size for Log Likelihood (ELL) - see Johnston et al. (2006)
- ELL varies between 0 and 1 (inclusive). Johnston et al. say “interpretation is straightforward as the proportion of the maximum departure between the observed and expected proportions”.
- Relative Risk
- Odds Ratio
- Log Ratio - see Andrew Hardie’s CASS blog for how to interpret this
- Note that if either word has zero frequency then a small adjustment is automatically applied (0.5 observed frequency which is then normalized) to avoid division by zero errors.
6.5.1 Log Ratio (LR)
You are welcome to use any of these effect size measures. Our repo comes with a function for calculating Hardie’s Log Ratio, which is easy and intuitive.
6.5.2 The keyness_table()
function
We’ll start by creating 2 dfms–a target and a reference:
<- dfm_subset(sc_dfm, text_type == "acad") %>% dfm_trim(min_termfreq = 1)
acad_dfm <- dfm_subset(sc_dfm, text_type == "fic") %>% dfm_trim(min_termfreq = 1) fic_dfm
Then we will use the keyness_table()
function.
<- keyness_table(acad_dfm, fic_dfm) acad_kw
And check the result:
Code
|>
acad_kw head(10) |>
gt() |>
fmt_number(columns = c("LL", "LR", "Per_10.5_Tar", "Per_10.5_Ref", "DP_Tar", "DP_Ref"),
decimals = 2) |>
fmt_number(columns = "PV",
decimals = 5)
Token | LL | LR | PV | AF_Tar | AF_Ref | Per_10.5_Tar | Per_10.5_Ref | DP_Tar | DP_Ref |
---|---|---|---|---|---|---|---|---|---|
of | 1,225.77 | 1.25 | 0.00000 | 4848 | 2153 | 3,992.03 | 1,673.61 | 0.09 | 0.15 |
the | 250.03 | 0.37 | 0.00000 | 8273 | 6768 | 6,812.31 | 5,261.03 | 0.11 | 0.10 |
social | 248.10 | 4.98 | 0.00000 | 208 | 7 | 171.28 | 5.44 | 0.65 | 0.88 |
are | 221.19 | 1.44 | 0.00000 | 707 | 276 | 582.17 | 214.55 | 0.21 | 0.30 |
studies | 213.17 | 7.36 | 0.00000 | 155 | 1 | 127.63 | 0.78 | 0.67 | 0.98 |
by | 209.00 | 1.29 | 0.00000 | 790 | 342 | 650.52 | 265.85 | 0.17 | 0.22 |
in | 185.82 | 0.58 | 0.00000 | 2715 | 1922 | 2,235.64 | 1,494.05 | 0.11 | 0.09 |
students | 179.36 | 5.27 | 0.00000 | 146 | 4 | 120.22 | 3.11 | 0.80 | 0.94 |
research | 175.19 | 4.74 | 0.00000 | 151 | 6 | 124.34 | 4.66 | 0.63 | 0.90 |
is | 171.74 | 0.87 | 0.00000 | 1241 | 720 | 1,021.89 | 559.68 | 0.23 | 0.36 |
The columns are as follows:
- LL: the keyness value or log-likelihood, also know as a G2 or goodness-of-fit test.
- LR: the effect size, which here is the log ratio
- PV: the p-value associated with the log-likelihood
- AF_Tar: the absolute frequency in the target corpus
- AF_Ref: the absolute frequency in the reference corpus
- Per_10.x_Tar: the relative frequency in the target corpus (automatically calibrated to a normalizing factor, where here is per 100,000 tokens)
- Per_10.x_Ref: the relative frequency in the reference corpus (automatically calibrated to a normalizing factor, where here is per 100,000 tokens)
- DP_Tar: the deviation of proportions (a dispersion measure) in the target corpus
- DP_Ref: the deviation of proportions in the reference corpus
6.5.3 Keyness pairs
There is also a function for quickly generating pair-wise keyness comparisions among multiple sub-corpora. To demonstrate, create a third dfm, this time containing news articles.
<- dfm_subset(sc_dfm, text_type == "news") %>% dfm_trim(min_termfreq = 1) news_dfm
To produce a data.frame comparing more than two sup-corpora, use the keyness_pairs()
function:
<- keyness_pairs(news_dfm, acad_dfm, fic_dfm) kp
Check the result:
Code
|>
kp head(10) |>
gt() |>
fmt_number(everything(),
decimals = 2)
Token | A_v_B_LL | A_v_B_LR | A_v_B_PV | A_v_C_LL | A_v_C_LR | A_v_C_PV | B_v_C_LL | B_v_C_LR | B_v_C_PV |
---|---|---|---|---|---|---|---|---|---|
he | 492.01 | 2.32 | 0.00 | −394.56 | −1.13 | 0.00 | −1,686.80 | −3.46 | 0.00 |
said | 455.34 | 3.69 | 0.00 | 1.94 | 0.13 | 0.16 | −414.33 | −3.56 | 0.00 |
i | 430.97 | 2.36 | 0.00 | −853.61 | −1.65 | 0.00 | −2,330.44 | −4.00 | 0.00 |
n't | 333.06 | 3.23 | 0.00 | −173.20 | −1.10 | 0.00 | −926.44 | −4.33 | 0.00 |
you | 327.51 | 3.06 | 0.00 | −410.98 | −1.54 | 0.00 | −1,355.34 | −4.60 | 0.00 |
mr | 236.58 | 5.10 | 0.00 | 75.25 | 1.61 | 0.00 | −60.92 | −3.48 | 0.00 |
park | 226.44 | 8.36 | 0.00 | 139.47 | 3.20 | 0.00 | −25.26 | −5.16 | 0.00 |
she | 212.56 | 2.36 | 0.00 | −919.21 | −2.21 | 0.00 | −1,850.64 | −4.57 | 0.00 |
p.m | 209.56 | 8.25 | 0.00 | 199.71 | 6.33 | 0.00 | −2.66 | −1.92 | 0.10 |
ob | 198.31 | 8.17 | 0.00 | 206.63 | 8.25 | 0.00 | 0.00 | 0.08 | 1.00 |
6.6 Key key words
The concept of “key key words” was introduced by Mike Smith for the WordSmith concordancer. The process compares each text in the target corpus to the reference corpus. Log-likelihood is calculated for each comparison. Then a mean is calculated for keyness and effect size. In addition, a range is provided for the number of texts in which keyness reaches significance for a given threshold. (The default is p < 0.05.) That range is returned as a percentage.
In this way, key key words accounts for the dispersion of key words by indicating whether a keyness value is driven by a relatively high frequency in a few target texts or many.
<- key_keys(acad_dfm, fic_dfm) kk
Again, we can look at the first few rows of the table:
Code
|>
kk head(10) |>
gt() |>
fmt_number(everything(),
decimals = 2)
token | key_range | key_mean | key_sd | effect_mean |
---|---|---|---|---|
of | 98.00 | 59.20 | 36.40 | 1.21 |
social | 38.00 | 25.00 | 70.11 | 3.41 |
studies | 44.00 | 22.56 | 67.90 | 5.93 |
students | 16.00 | 19.64 | 70.17 | 3.56 |
the | 64.00 | 17.80 | 26.72 | 0.31 |
research | 34.00 | 17.57 | 63.50 | 3.48 |
political | 26.00 | 17.00 | 57.92 | 3.78 |
changes | 44.00 | 16.51 | 62.96 | 4.71 |
science | 16.00 | 15.65 | 60.28 | 3.46 |
study | 46.00 | 15.35 | 27.65 | 3.27 |
Complete Task 1 in Lab Set 2.