7 Part-of-Speech Tagging and Dependency Parsing

In the previous lab, we worked with keyness and effect sizes, specifically using log-likelihood and log ratio measures.

We are now going to add to our toolkit by using the same measures, but applied to data that has been tagged and parsed. To our processing pipeline, we will be adding udpipe: https://bnosac.github.io/udpipe/en/

7.1 What does udpipe do?

Before we start processing in R, let’s get some sense of what “universal dependency parsing” is and what its output looks like.

7.1.1 Parse a sample sentence online

Go to this webpage: http://lindat.mff.cuni.cz/services/udpipe/.

And paste the following sentence into the text field:

The company offers credit cards, loans and interest-generating accounts.

Then, click the “Process Input” button. You should now see an output. If you choose the “Table” tab, you can view the output in a tablular format.

7.1.2 Basic parse structure

There is a column for the token and one for the token’s base form or lemma.

Those are followed by a tag for the general lexical class or “universal part-of-speech” (upos) tag, and a tree-bank specific (xpos) part-of-speech tag.

The xpos tags are Penn Treebank tags, which you can find here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

The part-of-speech tags are followed by a column of integers that refer to the id of the token that is at the head of the dependency structure, which is followed by the dependency relation identifier.

For a list of all dependency abbreviaitons see here: https://universaldependencies.org/u/dep/index.html.

7.1.3 Visualize the dependency

From the “Output Text” tab, copy the output start with the sent_id including the pound sign

Paste the information into the text field here: https://urd2.let.rug.nl/~kleiweg/conllu/. Then click the “Submit Query” button below the text field. This should generate a visualization of the dependency structure.

7.2 Load the needed packages

library(tidyverse)
library(quanteda)
library(quanteda.textstats)
library(udpipe)
library(gt)

Load the functions:

source("../R/keyness_functions.R")
source("../R/helper_functions.R")

Load the data:

load("../data/sample_corpus.rda")

7.3 Parsing

7.3.1 Preparing a corpus

When we parse texts using a model like ones available in udpipe or spacy, we need to do very little to prepare the corpus. We could trim extra spaces and returns using str_squish() or remove urls, but generally we want the text to be mostly “as is” so the model can do its job.

7.3.2 Download a model

You only need to run this line of code once. To run it, remove the pound sign, run the line, then add the pound sign after you’ve downloaded the model. Or you can run the next chunk and the model will automatically be downloaded in your working directory.

# udpipe_download_model(language = "english")

7.3.3 Annotate a sentence

txt <- "The company offers credit cards, loans and interest-generating accounts."
ud_model <- udpipe_load_model("../models/english-ewt-ud-2.5-191206.udpipe")
annotation <- udpipe(txt, ud_model)

Code

annotation[,8:15] |>
  gt() |>
  as_raw_html()

Annotation of a sample sentence.
token_id	token	lemma	upos	xpos	feats	head_token_id	dep_rel
1	The	the	DET	DT	Definite=Def\|PronType=Art	2	det
2	company	company	NOUN	NN	Number=Sing	3	nsubj
3	offers	offer	VERB	VBZ	Mood=Ind\|Number=Sing\|Person=3\|Tense=Pres\|VerbForm=Fin	0	root
4	credit	credit	NOUN	NN	Number=Sing	5	compound
5	cards	card	NOUN	NNS	Number=Plur	3	obj
6	,	,	PUNCT	,	NA	7	punct
7	loans	loans	NOUN	NNS	Number=Plur	5	conj
8	and	and	CCONJ	CC	NA	12	cc
9	interest	interest	NOUN	NN	Number=Sing	11	compound
10	-	-	PUNCT	HYPH	NA	11	punct
11	generating	genera	NOUN	NN	Number=Sing	12	compound
12	accounts	account	NOUN	NNS	Number=Plur	5	conj
13	.	.	PUNCT	.	NA	3	punct

7.3.4 Plot the annotation

We can also plot the dependency structure using igraph:

library(igraph)
library(ggraph)

First we’ll create a plotting function.

plot_annotation <- function(x, size = 3){
  stopifnot(is.data.frame(x) & all(c("sentence_id", "token_id", "head_token_id", "dep_rel",
                                     "token_id", "token", "lemma", "upos", "xpos", "feats") %in% colnames(x)))
  x <- x[!is.na(x$head_token_id), ]
  x <- x[x$sentence_id %in% min(x$sentence_id), ]
  edges <- x[x$head_token_id != 0, c("token_id", "head_token_id", "dep_rel")]
  edges <- edges[edges$dep_rel != "punct",]
  edges$head_token_id <- ifelse(edges$head_token_id == 0, edges$token_id, edges$head_token_id)
  nodes = x[, c("token_id", "token", "lemma", "upos", "xpos", "feats")]
  edges$label <- edges$dep_rel
  g <- graph_from_data_frame(edges,
                             vertices = nodes,
                             directed = TRUE)
  ggraph(g, layout = "linear") +
    geom_edge_arc(ggplot2::aes(label = dep_rel, vjust = -0.20), fold = T,linemitre = 2,
                  arrow = grid::arrow(length = unit(3, 'mm'), ends = "last", type = "closed"),
                  end_cap = ggraph::label_rect("wordswordswords"),
                  label_colour = "red", check_overlap = TRUE, label_size = size) +
    geom_node_label(ggplot2::aes(label = token), col = "black", size = size, fontface = "bold") +
    geom_node_text(ggplot2::aes(label = xpos), nudge_y = -0.35, size = size) +
    theme_graph(base_family = "Arial Narrow")
}

And plot the annotation:

plot_annotation(annotation, size = 2.5)

Dependency structure of a sample parsed sentence.

7.4 Annotate a corpus

Parsing text is a computationally intensive process and can take time. So for the purposes of this lab, we’ll create a smaller sub-sample of the the data. By adding a column called text_type which includes information extracted from the file names, we can sample 5 texts from each.

set.seed(123)
sub_corpus <- quanteda.extras::sample_corpus %>%
  mutate(text_type = str_extract(doc_id, "^[a-z]+")) %>%
  group_by(text_type) %>%
  sample_n(5) %>%
  ungroup() %>%
  dplyr::select(doc_id, text)

7.4.1 Parallel processing

Parallel processing is a method whereby separate parts of an overall complex task are broken up and run simultaneously on multiple CPUs, thereby reducing the amount of time for processing. Part-of-speech tagging and dependency parsing are computationally intensive, so using parallel processing can save valuable time.

The udpipe() function has an argument for assigning cores: parallel.cores = 1L. It’s easy to set up, so feel free to use that option.

A second option, requires more preparation, but is even faster. So we’ll walk through how it works. First, we will split the corpus based on available cores.

corpus_split <- split(sub_corpus, seq(1, nrow(sub_corpus), by = 10))

For parallel processing in R, we’ll us the package future.apply.

library(future.apply)

Next, we set up our parallel session by specifying the number of cores, and creating a simple annotation function.

ncores <- 4L
plan(multisession, workers = ncores)

annotate_splits <- function(corpus_text) {
  ud_model <- udpipe_load_model("../models/english-ewt-ud-2.5-191206.udpipe")
  x <- data.table::as.data.table(udpipe_annotate(ud_model, x = corpus_text$text,
                                                 doc_id = corpus_text$doc_id))
  return(x)
}

Finally, we annotate using future_lapply. On my machine, this takes roughly 32 seconds.

annotation <- future_lapply(corpus_split, annotate_splits, future.seed = T)

As you might guess, the output is a list of data frames, so we’ll combine them using rbindlist().

annotation <- data.table::rbindlist(annotation)

7.5 Process with quanteda

7.5.1 Format the data for quanteda

If we want to do any further processing in quanteda, we need to make a couple of adjustments to our data frame.

anno_edit <- annotation %>%
  dplyr::select(doc_id, sentence_id, token_id, token, lemma, upos, xpos, head_token_id, dep_rel) %>%
  rename(pos = upos, tag = xpos)

anno_edit <- structure(anno_edit, class = c("spacyr_parsed", "data.frame"))

7.5.2 Convert to tokens

sub_tkns <- as.tokens(anno_edit, include_pos = "tag", concatenator = "_")

7.5.3 Create a dfm

We will also extract and assign the variable text_type to the tokens object.

doc_categories <- names(sub_tkns) %>%
  data.frame(text_type = .) %>%
  mutate(text_type = str_extract(text_type, "^[a-z]+"))

docvars(sub_tkns) <- doc_categories

sub_dfm <- dfm(sub_tkns)

And check the frequencies:

Code

textstat_frequency(sub_dfm, n = 10) |>
  gt()

feature	frequency	rank	docfreq	group
._.	6452	1	40	all
,_,	5900	2	40	all
the_dt	5217	3	40	all
and_cc	2596	4	40	all
of_in	2513	5	40	all
a_dt	2256	6	40	all
to_to	1702	7	40	all
in_in	1645	8	40	all
i_prp	1497	9	36	all
you_prp	1202	10	36	all

7.5.4 Filter/select tokens

There are multiple ways to filter/select the tokens we want to count. We could, for example, just filter out all rows in the annotation data frame tagged as PUNCT, if we wanted to exclude punctuation from our counts.

I would, however, advise against altering the original parsed file. We may want to try different options, and we want to avoid having to re-parse our corpus, as that is the most computationally intensive step in the processing pipeline. In fact, if this were part of an actual project, I would advise that you save the parsed data frame as a .csv file using write_csv() for later use.

So we will try an alternative. We use the tokens_select() function to either keep or remove tokens based on regular expressions.

sub_dfm <- sub_tkns %>%
  tokens_select("^.*[a-zA-Z0-9]+.*_[a-z]", selection = "keep", valuetype = "regex", case_insensitive = T) %>%
  dfm()

And check the frequencies:

Code

textstat_frequency(sub_dfm, n = 10) |>
  gt() |>
  as_raw_html()

Most frequent tokens tagged for part-of-speech in sub-sample of the corpus.
feature	frequency	rank	docfreq	group
the_dt	5217	1	40	all
and_cc	2596	2	40	all
of_in	2513	3	40	all
a_dt	2256	4	40	all
to_to	1702	5	40	all
in_in	1645	6	40	all
i_prp	1497	7	36	all
you_prp	1202	8	36	all
it_prp	1168	9	39	all
is_vbz	1042	10	40	all

If we want to compare one text-type (as our target corpus) to another (as our reference corpus), we can easily subset the data.

acad_dfm <- dfm_subset(sub_dfm, text_type == "acad") %>% dfm_trim(min_termfreq = 1)
fic_dfm <- dfm_subset(sub_dfm, text_type == "fic") %>% dfm_trim(min_termfreq = 1)

And finally, we can generate a keyness table,

acad_v_fic <- keyness_table(acad_dfm, fic_dfm) %>%
  separate(col = Token, into = c("Token", "Tag"), sep = "_")

From that data, we can filter specific lexical classes, like modal verbs:

Code

acad_v_fic %>% filter(Tag == "md") |>
  gt() |>
  fmt_number(columns = c('LL', 'LR', 'Per_10.4_Tar', 'Per_10.4_Ref'), decimals = 2) |>
  fmt_number(columns = c('DP_Tar', 'DP_Ref'), decimals = 3) |>
  fmt_number(columns = c('PV'), decimals = 5) |>
  as_raw_html()

A keyness comparision of modal verbs in a sub-sample of the academic vs. fiction text-types.
Token	Tag	LL	LR	PV	AF_Tar	AF_Ref	Per_10.4_Tar	Per_10.4_Ref	DP_Tar	DP_Ref
may	md	3.99	1.43	0.04583	13	5	10.28	3.81	0.165	0.401
will	md	3.66	1.14	0.05579	17	8	13.44	6.09	0.539	0.599
ill	md	1.42	1.05	0.23275	1	0	0.79	0.00	0.797	NA
ought	md	1.42	1.05	0.23275	1	0	0.79	0.00	0.800	NA
must	md	0.13	0.32	0.71618	6	5	4.74	3.81	0.431	0.602
wo	md	0.00	0.05	0.97895	1	1	0.79	0.76	0.797	0.800
ca	md	−0.17	−0.53	0.68389	2	3	1.58	2.28	0.797	0.599
should	md	−0.22	−0.36	0.64138	6	8	4.74	6.09	0.261	0.346
can	md	−3.34	−0.80	0.06761	16	29	12.65	22.08	0.348	0.224
might	md	−3.63	−1.95	0.05659	2	8	1.58	6.09	0.598	0.677
could	md	−6.15	−0.91	0.01316	22	43	17.39	32.74	0.259	0.241
would	md	−9.22	−1.31	0.00240	14	36	11.07	27.41	0.313	0.215
'll	md	−12.97	−3.75	0.00032	1	14	0.79	10.66	0.797	0.239
'd	md	−32.38	−5.53	0.00000	0	24	0.00	18.28	NA	0.234

7.5.5 Extract phrases

We can also extract phrases of specific types. To so so, we first use the function as_phrasemachine() to add a new column to our annotation called phrase_tag.

annotation$phrase_tag <- as_phrasemachine(annotation$upos, type = "upos")

Next, we can use the function keywords_phrases() to extract phrase-types based on regular expressions. Refer to the documentation for suggested regex patterns: https://www.rdocumentation.org/packages/udpipe/versions/0.8.6/topics/keywords_phrases.

You can also read examples of use cases: https://bnosac.github.io/udpipe/docs/doc7.html.

First, we’ll subset our data into annotations by text-type.

acad_anno <- annotation %>% filter(str_detect(doc_id, "acad"))
fic_anno <- annotation %>% filter(str_detect(doc_id, "fic"))

acad_nps <- keywords_phrases(x = acad_anno$phrase_tag, term = tolower(acad_anno$token), 
                          pattern = "(A|N)*N(P+D*(A|N)*N)*", 
                          is_regex = TRUE, detailed = T)


fic_nps <- keywords_phrases(x = fic_anno$phrase_tag, term = tolower(fic_anno$token), 
                             pattern = "(A|N)*N(P+D*(A|N)*N)*", 
                             is_regex = TRUE, detailed = T)

Code

acad_nps |>
  head(25) |>
  gt() |>
  as_raw_html()

Noun phrases extracted from a sub-sample of the corpus.
keyword	ngram	pattern	start	end
largest creatures	2	AN	2	3
creatures	1	N	3	3
earth	1	N	9	9
animals	1	N	11	11
apatosaurus	1	N	14	14
aka	1	N	16	16
aka brontosaurus	2	NN	16	17
brontosaurus	1	N	17	17
paleontologists	1	N	19	19
picture	1	N	23	23
they	1	N	26	26
english anatomist	2	AN	35	36
anatomist	1	N	36	36
paleontologist	1	N	39	39
paleontologist richard	2	NN	39	40
paleontologist richard owen	3	NNN	39	41
richard	1	N	40	40
richard owen	2	NN	40	41
owen	1	N	41	41
word	1	N	44	44
dinosaur	1	N	46	46
new category	2	AN	51	52
new category of reptiles	4	ANPN	51	54
category	1	N	52	52
category of reptiles	3	NPN	52	54

Note that although the function uses the term keywords, it is NOT executing a hypothesis test of any kind.

7.5.6 Extract only unique phrases

Note that udpipe extracts overlapping constituents of phrase structures. Normally, we would want only unique phrases. To find those we’ll take advantage of the start and end indexes, using the between() function from the data.table package.

That will generate a logical vector, which we can use to filter out only those phrases that don’t overlap with another.

idx <- seq(1:nrow(acad_nps))

is_unique <- lapply(idx, function(i) sum(data.table::between(acad_nps$start[i], acad_nps$start, acad_nps$end) & data.table::between(acad_nps$end[i], acad_nps$start, acad_nps$end)) == 1) %>% unlist()

acad_nps <- acad_nps[is_unique, ]

idx <- seq(1:nrow(fic_nps))

is_unique <- lapply(idx, function(i) sum(data.table::between(fic_nps$start[i], fic_nps$start, fic_nps$end) & data.table::between(fic_nps$end[i], fic_nps$start, fic_nps$end)) == 1) %>% unlist()

fic_nps <- fic_nps[is_unique, ]

We can also add a rough accounting of the lengths of the noun phrases by summing the spaces and adding 1.

acad_nps <- acad_nps %>%
  mutate(phrase_length = str_count(keyword, " ") + 1)

fic_nps <- fic_nps %>%
  mutate(phrase_length = str_count(keyword, " ") + 1)

Code

fic_nps |>
  head(10) |>
  gt() |>
  as_raw_html()

Unique noun phrases extracted from a sub-sample of the corpus.
keyword	ngram	pattern	start	end	phrase_length
it	1	N	1	1	1
pleasant summer night	3	ANN	4	6	3
wind off the ocean	4	NPDN	10	13	4
trees along copley square	4	NPNN	16	19	4
i	1	N	21	21	1
boston public library	3	NNN	25	27	3
square	1	N	31	31	1
copley plaza bar	3	NNN	37	39	3
first time	2	AN	44	45	2
sammy	1	N	47	47	1

Pause for Lab Set Question

Complete Task 2 in Lab Set 2.