library(quanteda)
library(quanteda.textstats)
library(tidyverse)
library(gt)
3 Tokenizing with quanteda
Processing pipelines
In the previous lab, we did some back-of-the-napkin text processing. In that lab, you were encouraged to think about what exactly is happening when you split a text into tokens and convert those into counts.
We’ll build on that foundational work, but let an R package quanteda do some of the heavy lifting for us. So let’s load our packages:
Why use quanteda rather than something like tidytext? Quite simply processing speed. While most packages are built around R functions and structures, quanteda does its processing in C+++ under the hood. Compiled languages like C+++ and Rust are much more efficient for string processing. This is also why polars is useful for processing large dataframes.
Load in some useful functions:
source("../R/helper_functions.R")
And again, we’ll start with the first sentence from A Tale of Two Cities.
<- "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair." totc_txt
3.1 Create a corpus
The first step is to create a corpus object:
<- corpus(totc_txt) totc_corpus
And see what we have:
Code
|>
totc_corpus summary() |>
gt()
Text | Types | Tokens | Sentences |
---|---|---|---|
text1 | 23 | 70 | 1 |
Note that if we had more than 1 document, we would get a count of how many documents in which the token appear, and that we can assign documents to grouping variable. This will become useful later.
3.2 Tokenize the corpus
<- tokens(totc_corpus, what = "word", remove_punct = TRUE) totc_tkns
3.3 Create a document-feature matrix (dfm)
<- dfm(totc_tkns) totc_dfm
A dfm is an important data structure to understand, as it often serves as the foundation for all kinds of downstream statistical processing. It is a table with rows for documents (or observations) and columns for tokens (or variables)
Code
|>
totc_dfm convert(to = "data.frame") |>
::select(1:12) |>
dplyrgt()
doc_id | it | was | the | best | of | times | worst | age | wisdom | foolishness | epoch |
---|---|---|---|---|---|---|---|---|---|---|---|
text1 | 10 | 10 | 10 | 1 | 10 | 2 | 1 | 2 | 1 | 1 | 2 |
3.4 And count our tokens
Code
|>
totc_dfm textstat_frequency() |>
gt()
feature | frequency | rank | docfreq | group |
---|---|---|---|---|
it | 10 | 1 | 1 | all |
was | 10 | 1 | 1 | all |
the | 10 | 1 | 1 | all |
of | 10 | 1 | 1 | all |
times | 2 | 5 | 1 | all |
age | 2 | 5 | 1 | all |
epoch | 2 | 5 | 1 | all |
season | 2 | 5 | 1 | all |
best | 1 | 9 | 1 | all |
worst | 1 | 9 | 1 | all |
wisdom | 1 | 9 | 1 | all |
foolishness | 1 | 9 | 1 | all |
belief | 1 | 9 | 1 | all |
incredulity | 1 | 9 | 1 | all |
light | 1 | 9 | 1 | all |
darkness | 1 | 9 | 1 | all |
spring | 1 | 9 | 1 | all |
hope | 1 | 9 | 1 | all |
winter | 1 | 9 | 1 | all |
despair | 1 | 9 | 1 | all |
3.5 Using pipes to expidite the process
This time, we will change remove_punct to FALSE.
<- totc_corpus %>%
totc_freq tokens(what = "word", remove_punct = FALSE) %>%
dfm() %>%
textstat_frequency()
Code
|>
totc_freq gt()
feature | frequency | rank | docfreq | group |
---|---|---|---|---|
it | 10 | 1 | 1 | all |
was | 10 | 1 | 1 | all |
the | 10 | 1 | 1 | all |
of | 10 | 1 | 1 | all |
, | 9 | 5 | 1 | all |
times | 2 | 6 | 1 | all |
age | 2 | 6 | 1 | all |
epoch | 2 | 6 | 1 | all |
season | 2 | 6 | 1 | all |
best | 1 | 10 | 1 | all |
worst | 1 | 10 | 1 | all |
wisdom | 1 | 10 | 1 | all |
foolishness | 1 | 10 | 1 | all |
belief | 1 | 10 | 1 | all |
incredulity | 1 | 10 | 1 | all |
light | 1 | 10 | 1 | all |
darkness | 1 | 10 | 1 | all |
spring | 1 | 10 | 1 | all |
hope | 1 | 10 | 1 | all |
winter | 1 | 10 | 1 | all |
despair | 1 | 10 | 1 | all |
. | 1 | 10 | 1 | all |
3.6 Tokenizing options
In the previous lab, you were asked to consider the questions: What counts as a token/word? And how do you tell the computer to count what you want?
As the above code block suggest, the tokens()
function in quanteda gives you some measure on control.
We’ll read in a more complex string:
<- "Jane Austen was not credited as the author of 'Pride and Prejudice.' In 1813, the title page simply read \"by the author of Sense and Sensibility.\" It wasn't until after Austen's death that her identity was revealed. #MentalFlossBookClub with @HowLifeUnfolds #15Pages https://pbs.twimg.com/media/EBOUqbfWwAABEoj.jpg" text_2
And process it as we did earlier.
<- text_2 %>%
text_2_freq corpus() %>%
tokens(what = "word", remove_punct = TRUE) %>%
dfm() %>%
textstat_frequency()
Code
|>
text_2_freq gt()
feature | frequency | rank | docfreq | group |
---|---|---|---|---|
the | 3 | 1 | 1 | all |
was | 2 | 2 | 1 | all |
author | 2 | 2 | 1 | all |
of | 2 | 2 | 1 | all |
and | 2 | 2 | 1 | all |
jane | 1 | 6 | 1 | all |
austen | 1 | 6 | 1 | all |
not | 1 | 6 | 1 | all |
credited | 1 | 6 | 1 | all |
as | 1 | 6 | 1 | all |
pride | 1 | 6 | 1 | all |
prejudice | 1 | 6 | 1 | all |
in | 1 | 6 | 1 | all |
1813 | 1 | 6 | 1 | all |
title | 1 | 6 | 1 | all |
page | 1 | 6 | 1 | all |
simply | 1 | 6 | 1 | all |
read | 1 | 6 | 1 | all |
by | 1 | 6 | 1 | all |
sense | 1 | 6 | 1 | all |
sensibility | 1 | 6 | 1 | all |
it | 1 | 6 | 1 | all |
wasn't | 1 | 6 | 1 | all |
until | 1 | 6 | 1 | all |
after | 1 | 6 | 1 | all |
austen's | 1 | 6 | 1 | all |
death | 1 | 6 | 1 | all |
that | 1 | 6 | 1 | all |
her | 1 | 6 | 1 | all |
identity | 1 | 6 | 1 | all |
revealed | 1 | 6 | 1 | all |
#mentalflossbookclub | 1 | 6 | 1 | all |
with | 1 | 6 | 1 | all |
@howlifeunfolds | 1 | 6 | 1 | all |
#15pages | 1 | 6 | 1 | all |
https://pbs.twimg.com/media/ebouqbfwwaabeoj.jpg | 1 | 6 | 1 | all |
Note that in addition to various logical “remove” arguments (remove_punct, remove_symbols, etc.), the tokens()
function has a what argument. The default, “word”, is “smarter”, but also slower. Another option is “fastestword”, which splits at spaces.
<- text_2 %>%
text_2_freq corpus() %>%
tokens(what = "fastestword", remove_punct = TRUE, remove_url = TRUE) %>%
dfm() %>%
textstat_frequency() %>%
as_tibble() %>%
::select(feature, frequency) dplyr
Code
|>
text_2_freq gt()
feature | frequency |
---|---|
the | 3 |
was | 2 |
author | 2 |
of | 2 |
and | 2 |
jane | 1 |
austen | 1 |
not | 1 |
credited | 1 |
as | 1 |
'pride | 1 |
prejudice.' | 1 |
in | 1 |
1813, | 1 |
title | 1 |
page | 1 |
simply | 1 |
read | 1 |
"by | 1 |
sense | 1 |
sensibility." | 1 |
it | 1 |
wasn't | 1 |
until | 1 |
after | 1 |
austen's | 1 |
death | 1 |
that | 1 |
her | 1 |
identity | 1 |
revealed. | 1 |
#mentalflossbookclub | 1 |
with | 1 |
@howlifeunfolds | 1 |
#15pages | 1 |
This, of course, makes no difference with just a few tokens, but does if you’re trying to process millions.
Also note that we’ve used the select()
function to choose specific columns.
3.7 Pre-processing
An alternative to making tokenizing decisions inside the tokenizing process, you can process the text before tokenizing using functions for manipulating strings in stringr, stringi, textclean, or base R (like grep( )). Some common and convenient transformations are wrapped in a function called preprocess_text( )
<- text_2 %>%
text_2_freq preprocess_text() %>%
corpus() %>%
tokens(what = "fastestword") %>%
dfm() %>%
textstat_frequency() %>%
as_tibble() %>%
::select(feature, frequency) %>%
dplyrrename(Token = feature, AF = frequency) %>%
mutate(New = NA)
Code
|>
text_2_freq gt()
Token | AF | New |
---|---|---|
was | 3 | NA |
the | 3 | NA |
austen | 2 | NA |
author | 2 | NA |
of | 2 | NA |
and | 2 | NA |
jane | 1 | NA |
not | 1 | NA |
credited | 1 | NA |
as | 1 | NA |
pride | 1 | NA |
prejudice | 1 | NA |
in | 1 | NA |
1813 | 1 | NA |
title | 1 | NA |
page | 1 | NA |
simply | 1 | NA |
read | 1 | NA |
by | 1 | NA |
sense | 1 | NA |
sensibility | 1 | NA |
it | 1 | NA |
n't | 1 | NA |
until | 1 | NA |
after | 1 | NA |
s | 1 | NA |
death | 1 | NA |
that | 1 | NA |
her | 1 | NA |
identity | 1 | NA |
revealed | 1 | NA |
mentalflossbookclub | 1 | NA |
with | 1 | NA |
howlifeunfolds | 1 | NA |
15pages | 1 | NA |
httpspbs.twimg.com/media/ebouqbfwwaabeoj.jpg | 1 | NA |
Note how the default arguments treat negation and possessive markers. As with the tokens ()
function, many of these (options)[http://htmlpreview.github.io/?https://raw.githubusercontent.com/browndw/quanteda.extras/main/vignettes/preprocess_introduction.html] are logical.
Note, too, that we’ve renamed the columns and added a new one using mutate()
.
Complete Task 1 in Lab Set 1.
3.8 Creating a corpus composition table
Whenever you report the results of a corpus-based analysis, it is best practice to include a table that summarizes the composition of your corpus (or corpora) and any relevant variables. Most often this would include token counts aggregated by relevant categorical variables and a row of totals.
3.9 Adding a grouping variable
We have 2 short texts (one from fiction and one from Twitter). Let’s first combine them into a single corpus. First, a data frame is created that has 2 columns (doc_id and text). Then, the text column is passed to the preprocess_text()
function before creating the corpus.
<- data.frame(doc_id = c("text_1", "text_2"), text = c(totc_txt, text_2)) %>%
comb_corpus mutate(text = preprocess_text(text)) %>%
corpus()
Next well assign a grouping variable using docvars()
. In later labs, we’ll use a similar process to assign variables from tables of metadata.
docvars(comb_corpus) <- data.frame(text_type = c("Fiction", "Twitter"))
Now we can tokenize.
<- comb_corpus %>%
comb_tkns tokens(what = "fastestword")
Once we have done this, we can use that grouping variable to manipulate the data in a variety of ways. We could use dfm_group()
to aggregate by group instead of individual text. (Though because we have only 2 texts here, it amounts to the same thing.)
<- dfm(comb_tkns) %>%
comb_dfm dfm_group(groups = text_type)
<- ntoken(comb_dfm) %>%
corpus_comp data.frame(frequency = .) %>%
rownames_to_column("group") %>%
group_by(group) %>%
summarize(Texts = n(),
Tokens = sum(frequency))
Code
|>
corpus_comp gt() |>
fmt_integer() |>
cols_label(
group = md("**Text Type**"),
Texts = md("**Texts**"),
Tokens = md("**Tokens**")
|>
) grand_summary_rows(
columns = c(Texts, Tokens),
fns = list(
~ sum(.)
Total
) ,fmt = ~ fmt_integer(.)
)
Text Type |
Texts |
Tokens |
|
---|---|---|---|
Fiction | 1 | 60 | |
1 | 44 | ||
Total | — | 2 | 104 |
Complete Task 2 in Lab Set 1.