2 NLP Basics

2.1 A simple processing pipeline

Let’s begin by creating an object consisting of a character string. In this case, the first sentence from A Tale of Two Cities.

totc_txt <- "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair."

And we’ll load the tidyverse libraries.

library(gt)
library(tidyverse)

We could then split the vector, say at each space.

totc_tkns <- totc_txt |> 
  str_split(" ")

Then, we can create a table of counts.

totc_df <- table(totc_tkns) |> # make a table of counts
  as_tibble() |>
  rename(Token = totc_tkns, AF = n) |> # rename columns
  arrange(desc(AF)) # sort the data by frequency

Code

totc_df |>
  head(10) |>
  gt()

Token	AF
of	10
the	10
was	10
it	9
age	2
epoch	2
season	2
times,	2
belief,	1
best	1

The process of splitting the string vector into constituent parts is called tokenizing. Think of this as telling the computer how to define a word (or a “token”, which is a more precise, technical term). In this case, we’ve done it in an extremely simple way–by defining a token as any string that is bounded by spaces.

Code

totc_df |>
  filter(str_detect(Token, regex("^it$", ignore_case = TRUE))) |>
  gt()

Token	AF
it	9
It	1

Case sensitive counts of the token it

Note that in doing so, we are counting capitalized and non-capitalized words as distinct tokens.

There may be specific instances when we want to do this. But normally, we’d want it and It to be the same token. To do that, we can add a step in the processing pipeline that converts our vector to lower case before tokenizing.

lowercase <- tolower(totc_txt)  |>
  str_split(" ")

totc_df <- table(lowercase) |> # make a table of counts
  as_tibble() |>
  rename(Token = lowercase, AF = n) |> # rename columns
  arrange(desc(AF)) # sort the data by frequency

Code

totc_df |>
  head(10) |>
  gt()

Token	AF
it	10
of	10
the	10
was	10
age	2
epoch	2
season	2
times,	2
belief,	1
best	1

Token counts of sample sentence.

Pause for Lab Set Question

Complete Tasks 1 and 2 in Lab Set 1.