totc_txt <- "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair."2 NLP Basics
2.1 A simple processing pipeline
Let’s begin by creating an object consisting of a character string. In this case, the first sentence from A Tale of Two Cities.
And we’ll load the tidyverse libraries.
library(gt)
library(tidyverse)We could then split the vector, say at each space.
totc_tkns <- totc_txt |>
str_split(" ")Then, we can create a table of counts.
totc_df <- table(totc_tkns) |> # make a table of counts
as_tibble() |>
rename(Token = totc_tkns, AF = n) |> # rename columns
arrange(desc(AF)) # sort the data by frequencyCode
totc_df |>
head(10) |>
gt()| Token | AF |
|---|---|
| of | 10 |
| the | 10 |
| was | 10 |
| it | 9 |
| age | 2 |
| epoch | 2 |
| season | 2 |
| times, | 2 |
| belief, | 1 |
| best | 1 |
The process of splitting the string vector into constituent parts is called tokenizing. Think of this as telling the computer how to define a word (or a “token”, which is a more precise, technical term). In this case, we’ve done it in an extremely simple way–by defining a token as any string that is bounded by spaces.
Code
totc_df |>
filter(str_detect(Token, regex("^it$", ignore_case = TRUE))) |>
gt()| Token | AF |
|---|---|
| it | 9 |
| It | 1 |
Case sensitive counts of the token it
Note that in doing so, we are counting capitalized and non-capitalized words as distinct tokens.
There may be specific instances when we want to do this. But normally, we’d want it and It to be the same token. To do that, we can add a step in the processing pipeline that converts our vector to lower case before tokenizing.
lowercase <- tolower(totc_txt) |>
str_split(" ")
totc_df <- table(lowercase) |> # make a table of counts
as_tibble() |>
rename(Token = lowercase, AF = n) |> # rename columns
arrange(desc(AF)) # sort the data by frequencyCode
totc_df |>
head(10) |>
gt()| Token | AF |
|---|---|
| it | 10 |
| of | 10 |
| the | 10 |
| was | 10 |
| age | 2 |
| epoch | 2 |
| season | 2 |
| times, | 2 |
| belief, | 1 |
| best | 1 |
Token counts of sample sentence.
Complete Tasks 1 and 2 in Lab Set 1.