3  Tokenizing with quanteda

Processing pipelines

In the previous lab, we did some back-of-the-napkin text processing. In that lab, you were encouraged to think about what exactly is happening when you split a text into tokens and convert those into counts.

We’ll build on that foundational work, but let an R package quanteda do some of the heavy lifting for us. So let’s load our packages:

Note

Why use quanteda rather than something like tidytext? Quite simply processing speed. While most packages are built around R functions and structures, quanteda does its processing in C+++ under the hood. Compiled languages like C+++ and Rust are much more efficient for string processing. This is also why polars is useful for processing large dataframes.

library(quanteda)
library(quanteda.textstats)
library(tidyverse)
library(gt)

Load in some useful functions:

source("../R/helper_functions.R")

And again, we’ll start with the first sentence from A Tale of Two Cities.

totc_txt <- "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair."

3.1 Create a corpus

The first step is to create a corpus object:

totc_corpus <- corpus(totc_txt)

And see what we have:

Code
totc_corpus |>
  summary() |>
  gt()
Summary of a corpus.
Text Types Tokens Sentences
text1 23 70 1

Note that if we had more than 1 document, we would get a count of how many documents in which the token appear, and that we can assign documents to grouping variable. This will become useful later.

3.2 Tokenize the corpus

totc_tkns <- tokens(totc_corpus, what = "word", remove_punct = TRUE)

3.3 Create a document-feature matrix (dfm)

totc_dfm <- dfm(totc_tkns)

A dfm is an important data structure to understand, as it often serves as the foundation for all kinds of downstream statistical processing. It is a table with rows for documents (or observations) and columns for tokens (or variables)

Code
totc_dfm |>
  convert(to = "data.frame") |>
  dplyr::select(1:12) |>
  gt()
Part of a document-feature matrix.
doc_id it was the best of times worst age wisdom foolishness epoch
text1 10 10 10 1 10 2 1 2 1 1 2

3.4 And count our tokens

Code
totc_dfm |>
  textstat_frequency() |>
  gt()
Token counts of sample sentence.
feature frequency rank docfreq group
it 10 1 1 all
was 10 1 1 all
the 10 1 1 all
of 10 1 1 all
times 2 5 1 all
age 2 5 1 all
epoch 2 5 1 all
season 2 5 1 all
best 1 9 1 all
worst 1 9 1 all
wisdom 1 9 1 all
foolishness 1 9 1 all
belief 1 9 1 all
incredulity 1 9 1 all
light 1 9 1 all
darkness 1 9 1 all
spring 1 9 1 all
hope 1 9 1 all
winter 1 9 1 all
despair 1 9 1 all

3.5 Using pipes to expidite the process

This time, we will change remove_punct to FALSE.

totc_freq <- totc_corpus %>%
  tokens(what = "word", remove_punct = FALSE) %>%
  dfm() %>%
  textstat_frequency()
Code
totc_freq |>
  gt()
Token counts of sample sentence.
feature frequency rank docfreq group
it 10 1 1 all
was 10 1 1 all
the 10 1 1 all
of 10 1 1 all
, 9 5 1 all
times 2 6 1 all
age 2 6 1 all
epoch 2 6 1 all
season 2 6 1 all
best 1 10 1 all
worst 1 10 1 all
wisdom 1 10 1 all
foolishness 1 10 1 all
belief 1 10 1 all
incredulity 1 10 1 all
light 1 10 1 all
darkness 1 10 1 all
spring 1 10 1 all
hope 1 10 1 all
winter 1 10 1 all
despair 1 10 1 all
. 1 10 1 all

3.6 Tokenizing options

In the previous lab, you were asked to consider the questions: What counts as a token/word? And how do you tell the computer to count what you want?

As the above code block suggest, the tokens() function in quanteda gives you some measure on control.

We’ll read in a more complex string:

text_2 <- "Jane Austen was not credited as the author of 'Pride and Prejudice.' In 1813, the title page simply read \"by the author of Sense and Sensibility.\" It wasn't until after Austen's death that her identity was revealed. #MentalFlossBookClub with @HowLifeUnfolds #15Pages https://pbs.twimg.com/media/EBOUqbfWwAABEoj.jpg"

And process it as we did earlier.

text_2_freq <- text_2 %>%
  corpus() %>%
  tokens(what = "word", remove_punct = TRUE) %>%
  dfm() %>%
  textstat_frequency()
Code
text_2_freq |>
  gt()
Token counts of sample Tweet
feature frequency rank docfreq group
the 3 1 1 all
was 2 2 1 all
author 2 2 1 all
of 2 2 1 all
and 2 2 1 all
jane 1 6 1 all
austen 1 6 1 all
not 1 6 1 all
credited 1 6 1 all
as 1 6 1 all
pride 1 6 1 all
prejudice 1 6 1 all
in 1 6 1 all
1813 1 6 1 all
title 1 6 1 all
page 1 6 1 all
simply 1 6 1 all
read 1 6 1 all
by 1 6 1 all
sense 1 6 1 all
sensibility 1 6 1 all
it 1 6 1 all
wasn't 1 6 1 all
until 1 6 1 all
after 1 6 1 all
austen's 1 6 1 all
death 1 6 1 all
that 1 6 1 all
her 1 6 1 all
identity 1 6 1 all
revealed 1 6 1 all
#mentalflossbookclub 1 6 1 all
with 1 6 1 all
@howlifeunfolds 1 6 1 all
#15pages 1 6 1 all
https://pbs.twimg.com/media/ebouqbfwwaabeoj.jpg 1 6 1 all

Note that in addition to various logical “remove” arguments (remove_punct, remove_symbols, etc.), the tokens() function has a what argument. The default, “word”, is “smarter”, but also slower. Another option is “fastestword”, which splits at spaces.

text_2_freq <- text_2 %>%
  corpus() %>%
  tokens(what = "fastestword", remove_punct = TRUE, remove_url = TRUE) %>%
  dfm() %>%
  textstat_frequency()  %>%
  as_tibble() %>%
  dplyr::select(feature, frequency)
Code
text_2_freq |>
  gt()
Token counts of sample Tweet
feature frequency
the 3
was 2
author 2
of 2
and 2
jane 1
austen 1
not 1
credited 1
as 1
'pride 1
prejudice.' 1
in 1
1813, 1
title 1
page 1
simply 1
read 1
"by 1
sense 1
sensibility." 1
it 1
wasn't 1
until 1
after 1
austen's 1
death 1
that 1
her 1
identity 1
revealed. 1
#mentalflossbookclub 1
with 1
@howlifeunfolds 1
#15pages 1

This, of course, makes no difference with just a few tokens, but does if you’re trying to process millions.

Also note that we’ve used the select() function to choose specific columns.

3.7 Pre-processing

An alternative to making tokenizing decisions inside the tokenizing process, you can process the text before tokenizing using functions for manipulating strings in stringr, stringi, textclean, or base R (like grep( )). Some common and convenient transformations are wrapped in a function called preprocess_text( )

text_2_freq <- text_2 %>%
  preprocess_text() %>%
  corpus() %>%
  tokens(what = "fastestword") %>%
  dfm() %>%
  textstat_frequency() %>%
  as_tibble() %>%
  dplyr::select(feature, frequency) %>%
  rename(Token = feature, AF = frequency) %>%
  mutate(New = NA)
Code
text_2_freq |>
  gt()
Token counts of sample Tweet
Token AF New
was 3 NA
the 3 NA
austen 2 NA
author 2 NA
of 2 NA
and 2 NA
jane 1 NA
not 1 NA
credited 1 NA
as 1 NA
pride 1 NA
prejudice 1 NA
in 1 NA
1813 1 NA
title 1 NA
page 1 NA
simply 1 NA
read 1 NA
by 1 NA
sense 1 NA
sensibility 1 NA
it 1 NA
n't 1 NA
until 1 NA
after 1 NA
s 1 NA
death 1 NA
that 1 NA
her 1 NA
identity 1 NA
revealed 1 NA
mentalflossbookclub 1 NA
with 1 NA
howlifeunfolds 1 NA
15pages 1 NA
httpspbs.twimg.com/media/ebouqbfwwaabeoj.jpg 1 NA

Note how the default arguments treat negation and possessive markers. As with the tokens () function, many of these (options)[http://htmlpreview.github.io/?https://raw.githubusercontent.com/browndw/quanteda.extras/main/vignettes/preprocess_introduction.html] are logical.

Note, too, that we’ve renamed the columns and added a new one using mutate().

Pause for Lab Set Question

3.8 Creating a corpus composition table

Whenever you report the results of a corpus-based analysis, it is best practice to include a table that summarizes the composition of your corpus (or corpora) and any relevant variables. Most often this would include token counts aggregated by relevant categorical variables and a row of totals.

3.9 Adding a grouping variable

We have 2 short texts (one from fiction and one from Twitter). Let’s first combine them into a single corpus. First, a data frame is created that has 2 columns (doc_id and text). Then, the text column is passed to the preprocess_text() function before creating the corpus.

comb_corpus <-   data.frame(doc_id = c("text_1", "text_2"), text = c(totc_txt, text_2)) %>%
  mutate(text = preprocess_text(text)) %>%
  corpus()

Next well assign a grouping variable using docvars(). In later labs, we’ll use a similar process to assign variables from tables of metadata.

docvars(comb_corpus) <- data.frame(text_type = c("Fiction", "Twitter"))

Now we can tokenize.

comb_tkns <- comb_corpus %>%
  tokens(what = "fastestword")

Once we have done this, we can use that grouping variable to manipulate the data in a variety of ways. We could use dfm_group() to aggregate by group instead of individual text. (Though because we have only 2 texts here, it amounts to the same thing.)

comb_dfm <- dfm(comb_tkns) %>% 
  dfm_group(groups = text_type)

corpus_comp <- ntoken(comb_dfm) %>%
  data.frame(frequency = .) %>%
  rownames_to_column("group") %>%
  group_by(group) %>%
  summarize(Texts = n(),
            Tokens = sum(frequency))
Code
corpus_comp |> 
  gt() |>
  fmt_integer() |>
  cols_label(
    group = md("**Text Type**"),
    Texts = md("**Texts**"),
    Tokens = md("**Tokens**")
  ) |>
  grand_summary_rows(
    columns = c(Texts, Tokens),
    fns = list(
      Total ~ sum(.)
    ) ,
    fmt = ~ fmt_integer(.)
    )
Table 3.1: Composition of corpus.

Text Type

Texts

Tokens

Fiction 1 60
Twitter 1 44
Total 2 104
Pause for Lab Set Question