Fall 2025
Q: How do I do a (regression, classification, etc.) using text as an input?
A: Turn the text into numbers so it’s a standard statistics problem
But how to turn text into numbers?
Imagine a \(V \times V\) matrix whose entries count how often words \(v_i\) and \(v_j\) occur near each other in a corpus:
| Word 1 | Word 2 | Word 3 | … | |
|---|---|---|---|---|
| Word 1 | \(v_{11}\) | \(v_{12}\) | \(v_{13}\) | … |
| Word 2 | \(v_{21}\) | \(v_{22}\) | \(v_{23}\) | … |
| Word 3 | \(v_{31}\) | \(v_{32}\) | \(v_{33}\) | … |
| … | … | … | … | … |
Call this \(C\), the co-occurrence matrix
What does \(C^{T} C\) do?
We could do PCA on the co-occurrence matrix \(C\):
We can produce a \(V \times p\) embedding matrix \(E\) by taking the first \(p\) principal components
It was a dark and stormy night; the rain fell in torrents, except at occasional intervals, when it was checked by a violent gust of wind which swept up the streets (for it is in London that our scene lies), rattling along the housetops, and fiercely agitating the scanty flame of the lamps that struggled against the darkness. (Edward George Bulwer-Lytton)
Initialize two matrices, \(E\) and \(E'\), with random numbers
Choose a window size \(N\)
night: and, stormy, the, rain
Use the softmax, Luke:
\[ \frac{e^{E_\text{night} \cdot E'_\text{and}}}{\sum_{w'} e^{E_\text{night} \cdot E'_w}} \in (0, 1) \]
If our subject of interest is a whole (sentence, paragraph, text), how do we get an embedding?
Extend to your favorite text classification task
Word embeddings are cool. But: