# A couple for data wrangling...
library(tidyverse)
library(data.table)
library(gt)
# Our plots will be based on the results of knn...
library(kknn) # for knn
# We have some relatively intensive processing tasks that can be sped up...
library(foreach) # for iterating over elements in a collection
library(doParallel) # for parallel processing
# And we'll be working with spatial data, so we need some specialized packages...
library(sf) # for spatial data
library(rnaturalearth) # map data for plotting
library(housingData) # geolocation data
require(rgeos)
14 Geospacial Mapping: POP? SODA? or COKE?
In 2013 a statistics (!) grad student at NC State named Joshua Katz put up a series of maps on an R Studio server that plotted the geographic distributions of lexical variants across the US. These turned out to be wildly popular and he published versions at Business Insider:
https://www.businessinsider.com/american-english-dialects-maps-2018-1
And at the New York Times, where he now works:
https://www.nytimes.com/interactive/2014/upshot/dialect-quiz-map.html
His maps are based on data collect by Bert Vaux and Scott Golder for the Harvard Dialect Survey in 2003:
http://www4.uwm.edu/FLL/linguistics/dialect/index.html
Katz’s work is an interesting application of KNN to geospatial data. There is a nice introduction to the KNN algorithm here:
https://www.analyticsvidhya.com/blog/2018/03/introduction-k-neighbours-algorithm-clustering/
We’re going to re-create Katz’s pop vs. soda plot. This exercise requires a number of packages.
14.1 Prepare the Data
14.1.1 Load the pop vs. soda data
Read in the data.
<- read_csv("../data/pop_vs_soda.csv", show_col_types = F) pvs_df
Note how the data is structures.
State | County_Name | FIPS_State | FIPS_County | FIPS_combo | id | ID1_GADM | NAME_ST | ID2_GADM | NAME_COU | SUMCOUNT | SUMPOP | SUMSODA | SUMCOKE | SUMOTHER | COUNT | PCTPOP | PCTSODA | PCTCOKE | PCTOTHER |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Alabama | Autauga | 01 | 001 | 01001 | 1001 | 1 | Alabama | 1 | Autauga | 19 | 0 | 5 | 13 | 1 | 4 | 0.00000 | 0.26316 | 0.68421 | 0.05263 |
Alabama | Baldwin | 01 | 003 | 01003 | 1003 | 1 | Alabama | 2 | Baldwin | 78 | 1 | 5 | 70 | 2 | 12 | 0.01282 | 0.06410 | 0.89744 | 0.02564 |
Alabama | Barbour | 01 | 005 | 01005 | 1005 | 1 | Alabama | 3 | Barbour | 18 | 0 | 3 | 14 | 1 | 3 | 0.00000 | 0.16667 | 0.77778 | 0.05556 |
Alabama | Bibb | 01 | 007 | 01007 | 1007 | 1 | Alabama | 4 | Bibb | 9 | 0 | 1 | 8 | 0 | 4 | 0.00000 | 0.11111 | 0.88889 | 0.00000 |
Alabama | Blount | 01 | 009 | 01009 | 1009 | 1 | Alabama | 5 | Blount | 16 | 0 | 2 | 13 | 1 | 3 | 0.00000 | 0.12500 | 0.81250 | 0.06250 |
Alabama | Bullock | 01 | 011 | 01011 | 1011 | 1 | Alabama | 6 | Bullock | 10 | 0 | 9 | 1 | 0 | 1 | 0.00000 | 0.90000 | 0.10000 | 0.00000 |
Alabama | Butler | 01 | 013 | 01013 | 1013 | 1 | Alabama | 7 | Butler | 9 | 0 | 2 | 7 | 0 | 1 | 0.00000 | 0.22222 | 0.77778 | 0.00000 |
Alabama | Calhoun | 01 | 015 | 01015 | 1015 | 1 | Alabama | 8 | Calhoun | 85 | 1 | 18 | 62 | 4 | 12 | 0.01176 | 0.21176 | 0.72941 | 0.04706 |
Alabama | Chambers | 01 | 017 | 01017 | 1017 | 1 | Alabama | 9 | Chambers | 11 | 0 | 2 | 9 | 0 | 4 | 0.00000 | 0.18182 | 0.81818 | 0.00000 |
Alabama | Cherokee | 01 | 019 | 01019 | 1019 | 1 | Alabama | 10 | Cherokee | 9 | 0 | 0 | 9 | 0 | 3 | 0.00000 | 0.00000 | 1.00000 | 0.00000 |
14.1.2 Convert FIPS codes to longitude and latitude
We have counts by state and county. Importantly, those counts are identified by Federal Information Processing System (FIPS) codes. To convert that information to something map-able, we need GIS data for those FIPS codes. For that we can use data from the housingData package.
<- housingData::geoCounty
hd $fips <- as.character(hd$fips) hd
Now let’s select the columns we need from our data, and put that data in a long format.
<- pvs_df %>% select(fips = FIPS_combo, POP = SUMPOP, SODA = SUMSODA,
pvs_df COKE = SUMCOKE, OTHER = SUMOTHER)
<- pvs_df %>%
pvs_df pivot_longer(!fips, names_to = "var", values_to = "freq") %>%
filter(freq != 0)
Code
|>
pvs_df head(10) |>
gt()
fips | var | freq |
---|---|---|
01001 | SODA | 5 |
01001 | COKE | 13 |
01001 | OTHER | 1 |
01003 | POP | 1 |
01003 | SODA | 5 |
01003 | COKE | 70 |
01003 | OTHER | 2 |
01005 | SODA | 3 |
01005 | COKE | 14 |
01005 | OTHER | 1 |
For the task, we need to convert counts into instances. For example, if there are 3 soda users in a county, we want:
soda, soda, soda in the variable column. For that, we can use the data.table package.
setDT(pvs_df)
<- pvs_df[rep(seq(.N), freq), !"freq"] pvs_df
Code
|>
pvs_df head(10) |>
gt()
fips | var |
---|---|
01001 | SODA |
01001 | SODA |
01001 | SODA |
01001 | SODA |
01001 | SODA |
01001 | COKE |
01001 | COKE |
01001 | COKE |
01001 | COKE |
01001 | COKE |
Now let’s join this with the GIS data to add longitudes and latitudes, and drop any empty rows.
<- left_join(pvs_df, select(hd, fips, lon, lat), by = "fips")
pvs_df <- pvs_df[!is.na(pvs_df$lon), ] pvs_df
Code
|>
pvs_df head(10) |>
gt()
fips | var | lon | lat |
---|---|---|---|
01001 | SODA | -86.64565 | 32.54009 |
01001 | SODA | -86.64565 | 32.54009 |
01001 | SODA | -86.64565 | 32.54009 |
01001 | SODA | -86.64565 | 32.54009 |
01001 | SODA | -86.64565 | 32.54009 |
01001 | COKE | -86.64565 | 32.54009 |
01001 | COKE | -86.64565 | 32.54009 |
01001 | COKE | -86.64565 | 32.54009 |
01001 | COKE | -86.64565 | 32.54009 |
01001 | COKE | -86.64565 | 32.54009 |
14.1.3 Prepare a map
Now let’s get the information we need for our map. We’ll start with a map of the US states.
<- rnaturalearthdata::states50 states_us
There are several map options. We’re going to select iso_a2. We’ll also drop Alaska and Hawaii, so we’ll only plot the Continental US.
<- states_us[states_us$iso_a2 == 'US',]
states_us <- states_us[ !grepl( "Alaska|Hawaii" , states_us$name ) , ] states_us
Finally we want to convert this into an sf object for later plotting. This creates a simple features object. For more about simple features see here:
https://r-spatial.github.io/sf/articles/sf1.html
<- st_as_sf(states_us) states_us
14.1.4 Transform the data for plotting
Now we’ll similarly convert the coordinates in our pop vs. soda data. Note that we point the st_as_sf() function to our “lon” and “lat” columns. Additionally, we need to specify the coordinate reference system (crs). This can be tricky… See here for a discussion:
https://ryanpeek.github.io/2017-08-03-converting-XY-data-with-sf-package/
We’re specifying The World Geodetic System 1984 or WGS84.
<- pvs_df %>% st_as_sf(coords = c("lon", "lat"), crs = "+proj=longlat +ellps=WGS84") point_data
Finally, we need to specify a projection. Remember we are “projecting” geographical space into 2 dimensions. So we have to decide what convention to use. We’ll use the North America Lambert Conformal Conic projection:
https://en.wikipedia.org/wiki/Lambert_conformal_conic_projection
<- "+proj=lcc +lat_1=20 +lat_2=60 +lat_0=40 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs" nalcc
Now, we can transform our US data to fit this projection. And likewise our pop vs. soda data.
<- states_us %>% st_transform(nalcc)
us <- point_data %>% st_transform(nalcc) point_data
Next, we’ll make a data.frame with 3 columns. And we’ll convert our variable to a factor.
<- data.frame(dialect = point_data$var,
dialects_train lon = st_coordinates(point_data)[, 1],
lat = st_coordinates(point_data)[, 2]) %>%
mutate(dialect = as.factor(dialect))
Code
|>
dialects_train head(10) |>
gt()
dialect | lon | lat |
---|---|---|
SODA | 832585.8 | -736015.3 |
SODA | 832585.8 | -736015.3 |
SODA | 832585.8 | -736015.3 |
SODA | 832585.8 | -736015.3 |
SODA | 832585.8 | -736015.3 |
COKE | 832585.8 | -736015.3 |
COKE | 832585.8 | -736015.3 |
COKE | 832585.8 | -736015.3 |
COKE | 832585.8 | -736015.3 |
COKE | 832585.8 | -736015.3 |
To get a sense of our data, we can take a random sample from our training data and plot it.
<- dialects_train %>% sample_n(1000)
dialects_sample
ggplot(data = dialects_sample) +
geom_sf(data = us) +
geom_point(aes(x = lon, y = lat, color = dialect), alpha = 1) +
scale_color_brewer(palette = "Dark2", name = "Dialect")
14.2 KNN classification
14.2.1 Case study: Allegheny and Philadelphia counties
The implementation of the knn classifier is a little different from earlier classification problems, so let’s walk through the basic idea with a quick example. First, let’s subset out the data for 2 counties: Allegheny and Philadelphia counties (identified by their FIPS codes).
Then, we’ll construct a simple data frame with 3 columns.
<- point_data[grepl("42003|42101", point_data$fips), ]
dialects_penn
<- data.frame(dialect = dialects_penn$var,
dialects_penn lon = st_coordinates(dialects_penn)[, 1],
lat = st_coordinates(dialects_penn)[, 2]) %>%
mutate(dialect = as.factor(dialect))
From that, we can split the data into a training and test set.
set.seed(123)
<- rsample::initial_split(dialects_penn, .75)
valid_split
<- rsample::analysis(valid_split)
penn_train <- rsample::assessment(valid_split) penn_test
For more conventional classification tasks, we would want to determine the optimal k for our model using the train.kknn()
function:
https://rpubs.com/bonibruno/svm-knn
For our purposes, we’ll use a quick rule of thumb and use the square-root of our n observations. Now we can call a model in which we try to predict the dialect variant (soda, pop, coke, other) based on geolocation.
<- kknn::kknn(dialect ~ .,
knn_penn train = penn_train,
test = penn_test,
kernel = "gaussian",
k = 61)
For conventional classification tasks, we would be interested in our confusion matrix, and our model’s accuracy
<- table(knn_penn$fitted.values, penn_test$dialect)
tab
<- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy
accuracy(tab)|>
data.frame() |>
gt()
accuracy.tab. |
---|
88.16327 |
For our purposes, we’re interested in the probabilities of each variant being preferred based on location. Allegheny county is at the top of our table, with a very strong preference for pop.
Code
$prob |>
knn_pennhead(1) |>
data.frame() |>
gt()
COKE | OTHER | POP | SODA |
---|---|---|---|
0 | 0 | 0.9672131 | 0.03278689 |
And Philadelphia county is at the bottom of our table with a strong preference for soda.
Code
$prob |>
knn_penntail(1) |>
data.frame() |>
gt()
COKE | OTHER | POP | SODA |
---|---|---|---|
0 | 0.03278689 | 0.04918033 | 0.9180328 |
14.2.2 Spatial interpolation
Those probabilities can be used to select both color (for the most probable variant) and the alpha of the color (based on the value of the probability).
Sounds easy enough, but we don’t have values for the entire map. We need to use our known values to estimate those at other unknown points. This is process called spatial interpolation:
https://docs.qgis.org/2.18/en/docs/gentle_gis_introduction/spatial_analysis_interpolation.html
What follows in wholely indebted to Timo Grossenbacher.
https://timogrossenbacher.ch/2018/03/categorical-spatial-interpolation-with-r/
Essentially, we need to configure an empty grid and interpolate that grid with our point data. Another complication is that this process is quite memory and computationally intensive. So we’re ultimately going to split up that grid into batches.
We’ll begin setting up our grid by specifying its width in pixels.
<- 300 width_in_pixels
The width of a grid cell is calculated from the dimensions of our US map and is defined as dx.
<- ceiling( (st_bbox(us)["xmax"] - st_bbox(us)["xmin"]) / width_in_pixels) dx
The height of a grid cell is the same because we’ll use quadratic grid cells, dx == dy.
<- dx dy
Calculate the height in pixels of the resulting grid.
<- floor( (st_bbox(us)["ymax"] - st_bbox(us)["ymin"]) / dy) height_in_pixels
And finally construct the grid with the st_make_grid() function.
<- st_make_grid(us, cellsize = dx, n = c(width_in_pixels, height_in_pixels), what = "centers") grid
Next, we’ll set our k at 1000. This will result is a smoother looking map.
<- 1000 k
Now, we can define a compute_grid()
function that returns the probability of the most likely variant.
<- function(grid, dialects_train, knn) {
compute_grid # create empty result data frame
<- data.frame(dialect = as.factor(NA),
dialects_result lon = st_coordinates(grid)[, 1],
lat = st_coordinates(grid)[, 2])
# run KKNN
<- kknn::kknn(dialect ~ .,
dialects_kknn train = dialects_train,
test = dialects_result,
kernel = "gaussian",
k = knn)
# bring back to result data frame
# only retain the probability of the dominant dialect at that grid cell
<- dialects_result %>%
dialects_result # extract the interpolated dialect at each grid cell with the
# kknn::fitted() function
mutate(dialect = fitted(dialects_kknn),
# only retain the probability of the interpolated dialect,
# discard the others
prob = apply(dialects_kknn$prob, 1, function(x) max(x)))
return(dialects_result)
}
We can set the parameters for parallel and batch processing. Adjust cores and batch size accordingly.
registerDoParallel(cores = 4) # if you have more processing power you can up this number
# Specify number of batches and resulting size of each batch (in grid cells).
<- 60 # increase this number if you run into memory problems
no_batches <- ceiling(length(grid) / no_batches) batch_size
Finally, we’ll generate our result using parallel processing. This will take at least a couple of minutes.
<- foreach(.packages = c("sf", "tidyverse"),
dialects_result batch_no = 1:no_batches,
# after each grid section is computed, rbind the resulting df into one big dialects_result df
.combine = rbind,
# the order of grid computation doesn't matter: this speeds it up even more
.inorder = FALSE) %dopar% {
# compute indices for each grid section, depending on batch_size and current batch
<- (batch_no - 1) * batch_size + 1
start_idx <- batch_no * batch_size
end_idx # specify which section of the grid to interpolate, using regular subsetting
<- grid[start_idx:ifelse(end_idx > length(grid),
grid_batch length(grid),
end_idx)]# apply the actual computation to each grid section
<- compute_grid(grid_batch, dialects_train, k)
df }
We’ll now convert the result into an sf object for mapping.
<- st_as_sf(dialects_result,
dialects_raster coords = c("lon", "lat"),
crs = nalcc,
remove = F)
Remember that we’ve calculated a rectangular grid. As a last step, we can use the outline of the US to mask the borders of our map.
<- dialects_raster[us, ] dialects_raster
And plot the result… (Note that we’re dropping other.)
ggplot(data = dialects_raster %>% filter(dialect != "OTHER")) +
geom_raster(aes(x = lon, y = lat, fill = dialect, alpha = prob)) +
scale_fill_manual(values = c("forestgreen", "steelblue", "tomato")) +
scale_alpha(guide = 'none') +
geom_sf(data = us, alpha = 0, size = 0.25) +
theme_void() +
theme(legend.position = c(0.1, 0.2), legend.box = "horizontal") +
theme(legend.title=element_blank())
How does our result compare to Katz’s?
https://www.foodrepublic.com/2013/07/29/a-final-word-on-the-soda-vs-pop-vs-coke-nomenclature-debate/