library(tidyverse)
library(quanteda)
library(rsample)
library(gt)
library(caret) # basic implementation
library(MLeval) # utilities for performance evaluation
15 Random Forest
We’re going to repeat the Federalist classification problem. Rather than a loasso regression model, however, this time we’ll be constructing a random forest classification model.
Broadly, random forests generate many classification trees. Each tree gives a classification, and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest).
If you’re unfamiliar with random forests, you can (and should) read more here:
https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
16 Variables
As with the Lab on lasso regression, we’ll start with Mosteller & Wallace’s lists of tokens.
Their first group contains 70 tokens…
<- c("a", "all", "also", "an", "and", "any", "are", "as", "at", "be", "been", "but", "by", "can", "do", "down", "even", "every", "for", "from", "had", "has", "have", "her", "his", "if", "in", "into", "is", "it", "its", "may", "more", "must", "my", "no", "not", "now", "of", "on", "one", "only", "or", "our", "shall", "should", "so", "some", "such", "than", "that", "the", "their", "then", "there", "things", "this", "to", "up", "upon", "was", "were", "what", "when", "which", "who", "will", "with", "would", "your") mw_group1
Their second an addional 47…
<- c("affect", "again", "although", "among", "another", "because", "between", "both", "city", "commonly", "consequently", "considerable", "contribute", "defensive", "destruction", "did", "direction", "disgracing", "either", "enough", "fortune", "function", "himself", "innovation", "join", "language", "most", "nor", "offensive", "often", "pass", "perhaps", "rapid", "same", "second", "still", "those", "throughout", "under", "vigor", "violate", "violence", "voice", "where", "whether", "while", "whilst") mw_group2
And their third another 48 (though they identify some by lemmas and another “expence” doesn’t appear in our data, possibly because of later editing done in our particular edition)…
<- c("about", "according", "adversaries", "after", "aid", "always", "apt", "asserted", "before", "being", "better", "care", "choice", "common", "danger", "decide", "decides", "decided", "deciding", "degree", "during", "expense", "expenses", "extent", "follow", "follows", "followed", "following", "i", "imagine", "imagined", "intrust", "intrusted", "intrusting","kind", "large", "likely", "matter", "matters", "moreover", "necessary", "necessity", "necessities", "others", "particularly", "principle", "probability", "proper", "propriety", "provision", "provisions", "requisite", "substance", "they", "though", "truth", "truths", "us", "usage", "usages", "we", "work", "works") mw_group3
All together, they list 165 candidate variables, though it works out to be 180 unlemmatized tokens as potential variables for their model.
We’ll concatenate a vector of all their variables into a single vector.
<- sort(c(mw_group1, mw_group2, mw_group3)) mw_all
17 The Federalist Papers
And we’ll start by setting up our data much like we did for lasso. First, we’ll get the metadata…
load("../data/federalist_meta.rda")
load("../data/federalist_papers.rda")
<- federalist_meta %>%
fed_meta ::select(doc_id, author_id) dplyr
18 Preparing the Data
Now, we’ll tokenize the data.
<- federalist_papers %>%
fed_tokens corpus() %>%
tokens(remove_punct = T, remove_symbols = T, what = "word")
And create a weighted dfm. The 3rd line preps the column so it can be merged with our metadata. The 4th orders the tokens by their mean frequencies. This isn’t necessary here, but can be useful when doing quick subsetting of variables. And the 5th changes the column name for easy joining.
<- fed_tokens %>% dfm() %>% dfm_weight(scheme = "prop") %>%
fed_dfm convert(to = "data.frame") %>%
mutate(doc_id = str_remove(doc_id, ".txt$")) %>%
select(doc_id, names(sort(colMeans(.[,-1]), decreasing = TRUE)))
Now let’s join the author_id from the metadata.
<- fed_dfm %>%
fed_dfm right_join(fed_meta) %>%
::select(doc_id, author_id, everything()) dplyr
19 Training and testing data
Now we can subset out our training and testing data.
<- fed_dfm %>%
train_dfm filter(author_id == "Hamilton" | author_id == "Madison") %>%
::select(doc_id, author_id, all_of(mw_all)) %>%
dplyrmutate(author_id = factor(author_id)) %>%
column_to_rownames("doc_id")
# Note that some R functions have difficulty when column names are same as base function names like 'in' or 'if'.
# To head off any problems, we'll convert the first letter to upper case.
colnames(train_dfm)[-1] <- colnames(train_dfm)[-1] %>% str_to_title()
<- fed_dfm %>%
test_dfm filter(author_id == "Disputed") %>%
::select(doc_id, author_id, all_of(mw_all)) %>%
dplyrcolumn_to_rownames("doc_id")
colnames(test_dfm)[-1] <- colnames(test_dfm)[-1] %>% str_to_title()
20 Create a model
Now let’s generate a random forest model. Fist, note that when the training set for the current tree is drawn by sampling with replacement, about one-third of the cases are left out of the sample. This oob (out-of-bag) data is used to get a running unbiased estimate of the classification error as trees are added to the forest. It is also used to get estimates of variable importance.
Thus, we don’t necessarily need to sample out a validation set of our data, though you certainly can.
We’ll set our seed. And generate a model.
set.seed(123)
<- train(author_id ~ ., data=train_dfm,
fed_m1 method="rf", trControl = trainControl(savePredictions = "final"))
fed_m1
Random Forest
65 samples
180 predictors
2 classes: 'Hamilton', 'Madison'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 65, 65, 65, 65, 65, 65, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.7749857 0.0000000
91 0.9551764 0.8644520
180 0.9502448 0.8474114
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 91.
Let’s set aside the error rate for a moment, and walk through the prediction… So now we use the model on our test data.
<- predict(fed_m1, test_dfm) pred_fed
#knitr::kable(col.names = c("Pred. Author"))
predict(fed_m1, test_dfm, type = "prob") |>
rownames_to_column("Disputed") |>
gt()
Disputed | Hamilton | Madison |
---|---|---|
FEDERALIST_49 | 0.108 | 0.892 |
FEDERALIST_50 | 0.368 | 0.632 |
FEDERALIST_51 | 0.164 | 0.836 |
FEDERALIST_52 | 0.178 | 0.822 |
FEDERALIST_53 | 0.158 | 0.842 |
FEDERALIST_54 | 0.350 | 0.650 |
FEDERALIST_55 | 0.404 | 0.596 |
FEDERALIST_56 | 0.202 | 0.798 |
FEDERALIST_57 | 0.176 | 0.824 |
FEDERALIST_58 | 0.090 | 0.910 |
FEDERALIST_62 | 0.086 | 0.914 |
FEDERALIST_63 | 0.236 | 0.764 |
This is close to what predicted with lasso regression at the beginning of the semester with the exception of Federalist 55, which is predicted here as Hamilton rather than Madison! Now let’s look quickly at variable importance.
Every node in the decision trees is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. The measure based on which (locally) optimal condition is chosen is called impurity. For classification, it is typically either Gini impurity or information gain/entropy. Thus when training a tree, it can be computed how much each feature decreases the weighted impurity in a tree.
For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure. These are the features that are important to our particular model and from our model, we can retrieve using varImp()
these.
<- varImp(fed_m1, scale = TRUE) m1_imp
We can put this result in tabular form.
$importance |>
m1_imprownames_to_column("Feature") |>
arrange(-Overall) |>
head(10) |>
gt()
Feature | Overall |
---|---|
Upon | 100.000000 |
On | 21.483177 |
There | 19.889467 |
To | 12.564302 |
Whilst | 8.411621 |
By | 4.370450 |
Consequently | 3.417843 |
This | 2.212836 |
Would | 1.715947 |
Both | 1.667829 |
Or we can create a plot.
plot(m1_imp, top = 10)
21 Setting the control hyperparameters
The caret package allows you to set control parameters for the type of cross-validation to use, the number of folds to use, etc.
<- trainControl(method="cv",
ctrl_1 classProbs=T,
savePredictions = T)
<- trainControl(method = "repeatedcv",
ctrl_2 number = 10,
repeats = 3,
savePredictions = TRUE,
classProbs = TRUE,
search = "grid")
<- train(author_id ~ ., data=train_dfm,
fed_m2 method="rf", preProc=c("center", "scale"),
trControl=ctrl_1, metric="ROC")
<- train(author_id ~ ., data=train_dfm,
fed_m3 method="rf", preProc=c("center", "scale"),
trControl=ctrl_2, metric="ROC")
Disputed | Hamilton | Madison |
---|---|---|
FEDERALIST_49 | 0.088 | 0.912 |
FEDERALIST_50 | 0.390 | 0.610 |
FEDERALIST_51 | 0.118 | 0.882 |
FEDERALIST_52 | 0.156 | 0.844 |
FEDERALIST_53 | 0.142 | 0.858 |
FEDERALIST_54 | 0.356 | 0.644 |
FEDERALIST_55 | 0.328 | 0.672 |
FEDERALIST_56 | 0.136 | 0.864 |
FEDERALIST_57 | 0.132 | 0.868 |
FEDERALIST_58 | 0.082 | 0.918 |
FEDERALIST_62 | 0.072 | 0.928 |
FEDERALIST_63 | 0.172 | 0.828 |
Disputed | Hamilton | Madison |
---|---|---|
FEDERALIST_49 | 0.102 | 0.898 |
FEDERALIST_50 | 0.382 | 0.618 |
FEDERALIST_51 | 0.166 | 0.834 |
FEDERALIST_52 | 0.190 | 0.810 |
FEDERALIST_53 | 0.152 | 0.848 |
FEDERALIST_54 | 0.366 | 0.634 |
FEDERALIST_55 | 0.402 | 0.598 |
FEDERALIST_56 | 0.170 | 0.830 |
FEDERALIST_57 | 0.166 | 0.834 |
FEDERALIST_58 | 0.120 | 0.880 |
FEDERALIST_62 | 0.100 | 0.900 |
FEDERALIST_63 | 0.212 | 0.788 |
For a more complete evaluation of the model, we can use evalm()
from MLeval:
<- evalm(fed_m2, silent = TRUE, showplots = FALSE)
m2_eval <- evalm(fed_m3, silent = TRUE, showplots = FALSE) m3_eval
Code
$optres$`Group 1` |>
m2_evaldata.frame() |>
rownames_to_column("Statistic") |>
gt()
Statistic | Score | CI |
---|---|---|
SENS | 0.929 | 0.69-0.99 |
SPEC | 1.000 | 0.93-1 |
MCC | 0.954 | NA |
Informedness | 0.929 | NA |
PREC | 1.000 | 0.77-1 |
NPV | 0.981 | 0.9-1 |
FPR | 0.000 | NA |
F1 | 0.963 | NA |
TP | 13.000 | NA |
FP | 0.000 | NA |
TN | 51.000 | NA |
FN | 1.000 | NA |
AUC-ROC | 0.990 | 0.95-1.03 |
AUC-PR | 0.910 | NA |
AUC-PRG | 0.860 | NA |
$roc m2_eval
Okay, we’ve repeated this task a couple of times now. Both times, we’ve taken advantage of Mosteller & Wallace’s filtering of variables. This has made life much easier for us. But what if we didn’t have their candidate words to feed into our models? What if we were starting from the dfm with 8765 words? How would you tackle variable selection, knowing that random forests don’t do well with highly zero-skewed variables. So just feeding the entire dfm into random forest isn’t going to be the best solution.
On the one hand, more common words are certainly good candidates for a model. On the other, you don’t want to throw out less frequent words that might be highly discriminatory….