<- "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair." totc_txt
2 NLP Basics
2.1 A simple processing pipeline
Let’s begin by creating an object consisting of a character string. In this case, the first sentence from A Tale of Two Cities.
And we’ll load the tidyverse libraries.
We could then split the vector, say at each space.
<- totc_txt %>% str_split(" ") totc_tkns
Then, we can create a table of counts.
<- table(totc_tkns) %>% # make a table of counts
totc_df as_tibble() %>%
rename(Token = totc_tkns, AF = n) %>% # rename columns
arrange(-AF) # sort the data by frequency
totc_df head(10) |>
Token | AF |
of | 10 |
the | 10 |
was | 10 |
it | 9 |
age | 2 |
epoch | 2 |
season | 2 |
times, | 2 |
belief, | 1 |
best | 1 |
The process of splitting the string vector into constituent parts is called tokenizing. Think of this as telling the computer how to define a word (or a “token”, which is a more precise, technical term). In this case, we’ve done it in an extremely simple way–by defining a token as any string that is bounded by spaces.
totc_df filter(str_detect(Token, regex("^it$", ignore_case= T))) |>
Token | AF |
it | 9 |
It | 1 |
Case sensitive counts of the token it
Note that in doing so, we are counting capitalized and non-capitalized words as distinct tokens.
There may be specific instances when we want to do this. But normally, we’d want it and It to be the same token. To do that, we can add a step in the processing pipeline that converts our vector to lower case before tokenizing.
<- tolower(totc_txt) %>%
totc_df str_split(" ") %>%
table() %>% # make a table of counts
as_tibble() %>%
rename(Token = ".", AF = n) %>% # rename columns
arrange(-AF) # sort the data by frequency
totc_df head(10) |>
Token | AF |
it | 10 |
of | 10 |
the | 10 |
was | 10 |
age | 2 |
epoch | 2 |
season | 2 |
times, | 2 |
belief, | 1 |
best | 1 |
Token counts of sample sentence.
Complete Tasks 1 and 2 in Lab Set 1.