[1] 0.6740684
Fall 2025
Get Out is an extraordinary movie, a great catapulting leap forward…
\[ \rho(X, Y) = \frac{\operatorname{cov}(X, Y)}{\sigma_X \sigma_Y} = \frac{\sum_{i=1}^n (x_i - \bar x) (y_i - \bar y)}{\sqrt{\sum_{i=1}^n (x_i - \bar x)^2} \sqrt{\sum_{i=1}^n (y_i - \bar y)^2}} \]
For example,
Document | Adjectives | Adj. rank | Nouns | Noun rank |
---|---|---|---|---|
news_35 | 3.1 | 1 | 19.7 | 1 |
news_02 | 13.2 | 2 | 69.3 | 3 |
acad_48 | 18.4 | 3 | 136.3 | 107 |
news_26 | 23.1 | 4 | 154.9 | 168 |
tvm_10 | 26.4 | 5 | 111.2 | 41 |
news_44 | 26.5 | 6 | 138.7 | 112 |
tvm_42 | 26.8 | 7 | 90.5 | 9 |
tvm_16 | 27.8 | 8 | 97.2 | 19 |
spok_14 | 30.0 | 9 | 131.1 | 86 |
tvm_40 | 30.0 | 10 | 99.6 | 23 |
Rough cut-off values are used to indicate the size of correlation (Cohen 1988: 79-80); this is a measure of effect size:
Every pairwise correlation between a set of variables can be presented in a matrix:
ADJ | ADV | NOUN | PRON | VERB | |
---|---|---|---|---|---|
ADJ | 1.00 | −0.17 | 0.72 | −0.64 | −0.40 |
ADV | −0.17 | 1.00 | −0.46 | 0.65 | 0.61 |
NOUN | 0.72 | −0.46 | 1.00 | −0.81 | −0.45 |
PRON | −0.64 | 0.65 | −0.81 | 1.00 | 0.75 |
VERB | −0.40 | 0.61 | −0.45 | 0.75 | 1.00 |
ADJ | ADV | NOUN | PRON | VERB | |
---|---|---|---|---|---|
ADJ | 1.00 | −0.24 | 0.77 | −0.68 | −0.52 |
ADV | −0.24 | 1.00 | −0.51 | 0.64 | 0.51 |
NOUN | 0.77 | −0.51 | 1.00 | −0.82 | −0.55 |
PRON | −0.68 | 0.64 | −0.82 | 1.00 | 0.77 |
VERB | −0.52 | 0.51 | −0.55 | 0.77 | 1.00 |
The matrix
There is a strong positive correlation (\(r = .55\), 95% CI [.48, .61]) between the number of nouns and the number of adjectives used in American English. This value, however, is not as large as the correlation between verbs and pronouns (\(r = .75\), 95% CI [.71, .78]), each of which explains more than half the variance of the other (\(R^2 = 56\)%).
There is a strong negative correlation (\(r_s = -.84\), \(p < .01\)) between the number of nouns and the number of pronouns used in American English. These show a complementary distribution.
American English verbs and adjectives are in an inverse proportional relationship (\(r = -.61\)**), as are verbs and nouns (\(r = -.67\)**). The negative correlation between verbs and coordinators is near zero, however (\(r = -.04\)), indicating those lexical classes have no observable relationship.
(Stargazing!)
What should this remind you of?
The factor model is that \[ \underbrace{X}_{n \times p} = \underbrace{F}_{n \times m} \underbrace{L}_{m \times p} + \underbrace{\epsilon}_{n \times p} \] where \(X\) has been centered so its columns have mean 0.
\[ X = FL + \epsilon \]
Biber defined 67 linguistic features, like
These features distinguish different types of written and spoken English.
For example, suppose we have \(p = 6\) variables and choose \(m = 2\)
Feature | Factor1 | Factor2 | Factor3 |
---|---|---|---|
f_39_prepositions | −0.68 | −0.23 | −0.06 |
f_07_second_person_pronouns | 0.66 | −0.11 | −0.14 |
f_44_mean_word_length | −0.65 | −0.36 | 0.04 |
f_59_contractions | 0.61 | 0.25 | −0.03 |
f_03_present_tense | 0.60 | −1.06 | −0.04 |
f_67_neg_analytic | 0.57 | 0.20 | 0.38 |
f_42_adverbs | 0.57 | 0.21 | 0.51 |
f_12_proverb_do | 0.54 | 0.04 | 0.14 |
f_37_if | 0.53 | −0.27 | 0.12 |
f_27_past_participle_whiz | −0.53 | 0.08 | −0.11 |
\[ X = FL + \epsilon \]
library(gtsummary)
f1 <- lm(Factor1 ~ group, data = bc_mda)
f2 <- lm(Factor2 ~ group, data = bc_mda)
f3 <- lm(Factor3 ~ group, data = bc_mda)
tbl_merge(list(
tbl_regression(f1, intercept = TRUE),
tbl_regression(f2, intercept = TRUE),
tbl_regression(f3, intercept = TRUE)
),
tab_spanner = c("Factor 1", "Factor 2", "Factor 3")
)
Characteristic |
Factor 1
|
Factor 2
|
Factor 3
|
||||||
---|---|---|---|---|---|---|---|---|---|
Beta | 95% CI | p-value | Beta | 95% CI | p-value | Beta | 95% CI | p-value | |
(Intercept) | -1.2 | -3.7, 1.2 | 0.3 | -0.81 | -2.1, 0.51 | 0.2 | 1.7 | 0.20, 3.3 | 0.027 |
group | |||||||||
BELLES-LETTRES | — | — | — | — | — | — | |||
FICTION: ADVENTURE | 14 | 9.0, 18 | <0.001 | 15 | 12, 17 | <0.001 | -0.81 | -3.7, 2.1 | 0.6 |
FICTION: GENERAL | 12 | 7.1, 16 | <0.001 | 13 | 11, 16 | <0.001 | -0.23 | -3.1, 2.7 | 0.9 |
FICTION: MYSTERY | 20 | 15, 25 | <0.001 | 14 | 12, 17 | <0.001 | 3.1 | -0.02, 6.2 | 0.051 |
FICTION: ROMANCE | 22 | 17, 27 | <0.001 | 14 | 12, 17 | <0.001 | 2.9 | 0.02, 5.8 | 0.048 |
FICTION: SCIENCE | 16 | 7.1, 25 | <0.001 | 8.1 | 3.3, 13 | 0.001 | 5.4 | -0.21, 11 | 0.059 |
HUMOR | 12 | 4.8, 20 | 0.001 | 7.3 | 3.2, 11 | <0.001 | 2.9 | -1.7, 7.6 | 0.2 |
LEARNED | -9.7 | -13, -6.3 | <0.001 | -7.8 | -9.6, -5.9 | <0.001 | -0.11 | -2.2, 2.0 | >0.9 |
MISCELLANEOUS: GOVERNMENT & HOUSE ORGANS | -14 | -19, -9.6 | <0.001 | -9.6 | -12, -7.1 | <0.001 | -9.2 | -12, -6.3 | <0.001 |
POPULAR LORE | -1.2 | -5.1, 2.7 | 0.5 | -0.58 | -2.7, 1.5 | 0.6 | -2.4 | -4.8, 0.09 | 0.059 |
PRESS: EDITORIAL | 3.2 | -1.6, 7.9 | 0.2 | -2.3 | -4.8, 0.28 | 0.081 | -0.87 | -3.8, 2.1 | 0.6 |
PRESS: REPORTAGE | -6.9 | -11, -2.9 | <0.001 | 0.71 | -1.5, 2.9 | 0.5 | -10 | -13, -7.8 | <0.001 |
PRESS: REVIEWS | -0.57 | -6.2, 5.1 | 0.8 | -2.9 | -6.0, 0.18 | 0.065 | -3.9 | -7.4, -0.29 | 0.034 |
RELIGION | 3.3 | -2.3, 9.0 | 0.2 | -3.8 | -6.9, -0.75 | 0.015 | 4.3 | 0.73, 7.9 | 0.018 |
SKILL AND HOBBIES | -0.51 | -4.8, 3.8 | 0.8 | -5.6 | -7.9, -3.2 | <0.001 | -5.1 | -7.8, -2.4 | <0.001 |
Abbreviation: CI = Confidence Interval |