2  NLP Basics

2.1 A simple processing pipeline

Let’s begin by creating an object consisting of a character string. In this case, the first sentence from A Tale of Two Cities.

totc_txt <- "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair."

And we’ll load the tidyverse libraries.

library(gt)
library(tidyverse)

We could then split the vector, say at each space.

totc_tkns <- totc_txt %>% str_split(" ")

Then, we can create a table of counts.

totc_df <- table(totc_tkns) %>% # make a table of counts
  as_tibble() %>%
  rename(Token = totc_tkns, AF = n) %>% # rename columns
  arrange(-AF) # sort the data by frequency
Code
totc_df |>
  head(10) |>
  gt()
Token AF
of 10
the 10
was 10
it 9
age 2
epoch 2
season 2
times, 2
belief, 1
best 1

The process of splitting the string vector into constituent parts is called tokenizing. Think of this as telling the computer how to define a word (or a “token”, which is a more precise, technical term). In this case, we’ve done it in an extremely simple way–by defining a token as any string that is bounded by spaces.

Code
totc_df |>
  filter(str_detect(Token, regex("^it$", ignore_case= T))) |>
  gt()
Token AF
it 9
It 1

Case sensitive counts of the token it

Note that in doing so, we are counting capitalized and non-capitalized words as distinct tokens.

There may be specific instances when we want to do this. But normally, we’d want it and It to be the same token. To do that, we can add a step in the processing pipeline that converts our vector to lower case before tokenizing.

totc_df <- tolower(totc_txt) %>%
  str_split(" ") %>%
  table() %>% # make a table of counts
  as_tibble() %>%
  rename(Token = ".", AF = n) %>% # rename columns
  arrange(-AF) # sort the data by frequency
Code
totc_df |>
  head(10) |>
  gt()
Token AF
it 10
of 10
the 10
was 10
age 2
epoch 2
season 2
times, 2
belief, 1
best 1

Token counts of sample sentence.

Pause for Lab Set Question