token | col_freq | total_freq | MI_1 |
---|---|---|---|
Introduction to association measures
Fall 2024
Question
What pattern (or patterns) do you recognize?
Question
What statistical problems do you see with using simple frequencies as a measure?
Question
What factors affect the likelihood of “cause” and “effect” being together in the same group?
Pre-node | Node | Post-node |
---|---|---|
Does chocolate | cause | acne? After many studies, the answer is … ‘complicated’ |
Shortage of sailors a | cause | for concern for Royal Canadian Navy |
Did wine | cause | a full-scale revolution in Armenia? |
Motorcycle explosion likely | cause | of Milton motel fire, manager reports |
Heavy rainfall, power outages | cause | school districts to close Friday - |
Bales of hay | cause | traffic backup along I-55, truck fire |
Pre-node | Node | Post-node |
---|---|---|
Does chocolate | cause | acne? After many studies, the answer is … ‘complicated’ |
Shortage of sailors a | cause | for concern for Royal Canadian Navy |
Did wine | cause | a full-scale revolution in Armenia? |
Motorcycle explosion likely | cause | of Milton motel fire, manager reports |
Heavy rainfall, power outages | cause | school districts to close Friday - |
Bales of hay | cause | traffic backup along I-55, truck fire |
Pre-node | Node | Post-node |
---|---|---|
Does chocolate | cause | acne? After many studies, the answer is … ‘complicated’ |
Shortage of sailors a | cause | for concern for Royal Canadian Navy |
Did wine | cause | a full-scale revolution in Armenia? |
Motorcycle explosion likely | cause | of Milton motel fire, manager reports |
Heavy rainfall, power outages | cause | school districts to close Friday - |
Bales of hay | cause | traffic backup along I-55, truck fire |
Pre-node | Node | Post-node |
---|---|---|
Does chocolate | cause | acne? After many studies, the answer is … ‘complicated’ |
Shortage of sailors a | cause | for concern for Royal Canadian Navy |
Did wine | cause | a full-scale revolution in Armenia? |
Motorcycle explosion likely | cause | of Milton motel fire, manager reports |
Heavy rainfall, power outages | cause | school districts to close Friday - |
Bales of hay | cause | traffic backup along I-55, truck fire |
The formula for calculating MI is as follows:
\[ PMI = log_{2} \left( \frac{O_{11}}{E_{11}} \right) \]
Where O11 and E11 are the observed (i.e., collocate in the span divided by the total number of words in the corpus) and expected frequencies of the node word within a given window. The expected frequency is given by:
\[ E_{11} = \frac{R_{1} \times C_{1}}{N} \]
\[ \begin{aligned} PMI{^k} = log_{2} \left( \frac{O_{11}{^2}}{E_{11}} \right) = PMI - (1 - k)~\times~log_{2} \left( O_{11} \right) \\ \text{where}~2 \ge k \le 3 \end{aligned} \]
\[ NPMI = \frac{PMI}{-log_{2}(O_{11})} \]
Question
What differences do you notice from the earlier result using frequencies?
Note
When measuring collocations, you must have a data structure of sequential tokens. In other words, something like a document-feature matrix won’t work.
Note
We’re using the preprocess_text() that we introduced in an earlier lab.
token | col_freq | total_freq | MI_1 |
---|---|---|---|
10:29 | 1 | 1 | 11.08049 |
38th | 1 | 1 | 11.08049 |
allocations | 1 | 1 | 11.08049 |
americanizing | 1 | 1 | 11.08049 |
anthedon | 1 | 1 | 11.08049 |
assignats | 1 | 1 | 11.08049 |
bamboozling | 1 | 1 | 11.08049 |
baser | 1 | 1 | 11.08049 |
borrowers | 1 | 1 | 11.08049 |
bridegrooms | 1 | 1 | 11.08049 |
token | col_freq | total_freq | MI_1 |
---|---|---|---|
owe | 5 | 21 | 9.010102 |
raise | 10 | 79 | 8.098639 |
extra | 6 | 64 | 7.665454 |
spend | 10 | 111 | 7.608004 |
insurance | 5 | 64 | 7.402420 |
spent | 9 | 122 | 7.319679 |
amount | 6 | 109 | 6.897270 |
making | 14 | 343 | 6.465782 |
cost | 6 | 154 | 6.398668 |
buy | 5 | 150 | 6.173601 |
Question
How do you determine where to set these kinds of thresholds?
ggraph(net, weight = link_weight, layout = "stress") +
geom_edge_link(color = "gray80", alpha = .75) +
geom_node_point(aes(alpha = node_weight, size = 3, color = n_intersects)) +
geom_node_text(aes(label = label), repel = T, size = 3) +
scale_alpha(range = c(0.2, 0.9)) +
theme_graph() +
theme(legend.position="none")