library(tidyverse)
library(quanteda)
library(quanteda.textstats)
17 Lab Set 2
The preview of this lab set is rendered in HTML. However, all assignments must be rendered in PDF for submission on Canvas. The textstat_tools repository is already set up to do this for you. Be sure to follow the directions including the installation of tinytex.
17.1 Distributions
17.1.1 Task 1
load("../data/sample_corpus.rda")
source("../R/dispersion_functions.R")
source("../R/helper_functions.R")
<- sample_corpus %>%
sc_tokens corpus() %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, what = "word") %>%
tokens_tolower()
<- sc_tokens %>%
sc_dfm dfm()
<- sc_dfm %>%
sc_freq textstat_frequency() %>%
mutate(RF = (frequency/sum(frequency))*1000000)
Plot a histogram (or histograms) for the the 1st, 10th, and 100th most frequent tokens in the sample corpus.
# your code goes here
What do you notice (or what conclusions can you draw) from the plots you’ve generated about the distributions of tokens as their frequency decreases?
Your response
17.1.2 Task 2
<- dispersions_token(sc_dfm, "the") %>% unlist()
the <- dispersions_token(sc_dfm, "data") %>% unlist() data
'Deviation of proportions DP'] the[
Deviation of proportions DP
0.1388907
'Deviation of proportions DP'] data[
Deviation of proportions DP
0.845857
What do you note about the difference in the Deviation of Proportions for the vs. data?
Your response
17.1.3 Task 3
<- frequency_table(sc_tokens) sc_ft
Which token is the most frequent? The most dispersed?
Your response
Write a sentence or two reporting the frequencies and dispersions of the and data fowling the examples on page 53 of Brezina:
Your response
ggplot(sc_freq %>% filter(rank < 101), aes(x = rank, y = frequency)) +
geom_point(shape = 1, alpha = .5) +
theme_classic() +
ylab("Absolute frequency") +
xlab("Rank")
The relationship you’re seeing between the rank of a token and it’s frequency holds true for almost any corpus and is referred to as Zipf’s Law (see Brezina pg. 44).
17.1.4 Task 4
Describe at least one statistical and one methodological implication of what the plot is illustrating.
Your response
17.2 Collocations and association measures
17.2.1 Task 1
library(tidyverse)
library(quanteda)
library(quanteda.textstats)
source("../R/helper_functions.R")
source("../R/utility_functions.R")
source("../R/collocation_functions.R")
<- sample_corpus %>%
sc_tokens mutate(text = preprocess_text(text)) %>%
corpus() %>%
tokens(what="fastestword", remove_numbers=TRUE)
<- collocates_by_MI(sc_tokens, "money")
money_collocations <- collocates_by_MI(sc_tokens, "time") time_collocations
Report the collocations of time and money in 2 or 3 sentences following the conventions described in Brezina (pg. 75).
Your response
17.2.2 Task 2
<- time_collocations %>% filter(col_freq >= 5 & MI_1 >= 5)
tc <- money_collocations %>% filter(col_freq >= 5 & MI_1 >= 5)
mc <- col_network(tc, mc) net
library(ggraph)
ggraph(net, weight = link_weight, layout = "stress") +
geom_edge_link(color = "gray80", alpha = .75) +
geom_node_point(aes(alpha = node_weight, size = 3, color = n_intersects)) +
geom_node_text(aes(label = label), repel = T, size = 3) +
scale_alpha(range = c(0.2, 0.9)) +
theme_graph() +
theme(legend.position="none")
Write a 2-4 sentence interpretation of the time vs. money collocational network.
Your response
17.2.3 Task 3
Load the down-sampled screenplays, extract the dialogue, and tokenize the data.
load("../data/screenplays.rda")
<- from_play(sp, extract = "dialogue")
sp
<- sp %>%
sp mutate(text = preprocess_text(text)) %>%
corpus() %>%
tokens(what="fastestword", remove_numbers=TRUE)
<- collocates_by_MI(sp, "boy", left = 3, right = 0)
b <- b %>% filter(col_freq >= 3 & MI_1 >= 3)
b
<- collocates_by_MI(sp, "girl", left = 3, right = 0)
g <- g %>% filter(col_freq >= 3 & MI_1 >= 3) g
17.2.3.1 Plot the network
<- col_network(b, g)
net
ggraph(net, weight = link_weight, layout = "stress") +
geom_edge_link(color = "gray80", alpha = .75) +
geom_node_point(aes(alpha = node_weight, size = 3, color = n_intersects)) +
geom_node_text(aes(label = label), repel = T, size = 3) +
scale_alpha(range = c(0.2, 0.9)) +
theme_graph() +
theme(legend.position="none")
Write a 3-5 sentence interpretation of the boy vs. girl collocational network, which includes reporting relevant association measures following the example in Brezina (pg. 75).
Your response
17.3 Keyness
17.3.1 Task 1
17.3.1.1 Create a keyness table
source("../R/keyness_functions.R")
source("../R/helper_functions.R")
load("../data/sample_corpus.rda")
- In the code block below, create a document-feature matrix of the blog text-type.
- In the same code-block create a keyness table with the blog text-type as the target corpus and the news text-type as the reference.
# your code goes here
- Use the code block below to output the head of the keyness table with an accompanying caption.
# your table goes here
17.3.1.2 Answer the following questions
- What are the 2 tokens with the highest keyness values?
Your response
- Posit an explanation for their greater frequency in blog corpus, being as descriptive as possible. Think about the communicative purposes of these text-types, as opposed to value judgments about the writers or the genres.
Your response
- What are the 2 tokens with the greatest effect sizes?
Your response
- Posit a reason for that result.
Your response