Compare Corpora
Compare Corpora
The Compare Corpora page lets you compare two sets of texts (your target and reference corpora) to discover which words or tags are unusually frequent in one set compared to the other. This is done using keyness statistics.
What You Can Do
- Generate a keyness table comparing your target and reference corpora
- Choose to compare tokens (words) or tags (POS or DocuScope)
- Filter results by tag
- View results as a table or as a plot
- Download the results as an Excel file
Step-by-Step Guide
1. Generate a Keyness Table
- If you haven’t already, load both a target and a reference corpus
- Click the Keyness Table button in the sidebar to generate the comparison
2. Choose Table Type
- Tokens: Compare individual words, grouped by tag
- Tags Only: Compare just the tags (POS or DocuScope)
4. Filter and Explore
- Use the Select tags to filter box to focus on specific tags
- Switch between Keyness Table and Keyness Plot tabs to view data as a table or bar chart
5. Download Your Results
- Use the Excel button in the sidebar to download the current table
What You See in the Table
- Only items with statistically significant differences (p < 0.01, LL ≥ 6.63) are shown // …existing code…
What makes a strong keyword?
- High log-likelihood (LL ≥ 10): Strong evidence for a difference
- Meaningful log-ratio (|LR| ≥ 1): At least 2× difference in frequency
- Sufficient frequency: Appears multiple times in your corpus (not just once or twice)
Reading the statistics together:
- High LL, high LR: Strong evidence for a large effect (your best keywords)
- High LL, low LR: Strong evidence for a small effect (consistent but modest difference)
- Low LL, high LR: Weak evidence for a large effect (potentially unreliable due to low frequency)
Learn more: Understanding effect sizes in corpus linguistics
Understanding the Statistics
Keyness is a statistical measure that compares observed frequencies in two corpora against what we would expect if both corpora were drawn from the same distribution. It helps you identify linguistic features that distinguish your target texts from your reference texts.
Log-likelihood (LL) - Evidence for an Effect
Log-likelihood tests how well the observed frequencies match what we would expect if the target and reference corpora were samples from the same underlying distribution.
- What it calculates: LL = 2×((a×ln(a/E₁)) + (b×ln(b/E₂)))
Where a and b are observed frequencies, E₁ and E₂ are expected frequencies - What it tells you: How much evidence there is for a difference between the corpora
- How to interpret:
- Higher LL values = stronger evidence that the corpora differ in their use of this feature
- LL ≥ 3.84 = statistically significant (p < 0.05)
- LL ≥ 6.63 = statistically significant (p < 0.01)
- LL ≥ 10.83 = statistically significant (p < 0.001)
- Important: This is a measure of evidence for an effect, not the size of the effect
- This app only shows results with p < 0.01 (LL ≥ 6.63)
Important caveat: Because log-likelihood measures the amount of evidence for an effect, very frequent words like “the” may show statistical significance even when the actual difference is small (low log-ratio). Conversely, rare words might not reach significance even with large proportional differences because there isn’t enough evidence. This is why you need both measures!
For beginners: Think of log-likelihood as asking “How confident can we be that these corpora really do use this word/tag differently?” Higher numbers mean more confidence.
Learn more: Log-likelihood calculator and explanation
Log-ratio (LR) - Magnitude of the Effect
Log-ratio measures how much more frequent a feature is in one corpus compared to another.
- What it tells you: The magnitude (size) of the frequency difference
- How to interpret:
- Log-ratio of 1 = feature is 2× more frequent in target corpus
- Log-ratio of 2 = feature is 4× more frequent in target corpus
- Log-ratio of -1 = feature is 2× more frequent in reference corpus
- Log-ratio of 0 = equal frequency in both corpora
- Why it matters: You can have strong evidence for an effect (high LL) but a small effect size (low LR), or vice versa
Key insight: Use both statistics together! Log-likelihood tells you how much evidence you have for a difference, and log-ratio tells you how large that difference is. Strong keywords have both high LL and meaningful LR values.
Learn more: Understanding effect sizes in corpus linguistics
Reporting Your Results
How to Report Keyness Statistics
When writing about your findings, follow these conventions:
For individual words/features:
- Report both LL and LR values
- Include the significance level
- Provide context about frequency and direction
Examples:
The word evidence was significantly more frequent in academic papers than in student essays (LL = 15.23, p < 0.001, LR = 1.87), appearing approximately 3.6 times more often.
Modal verbs showed significant overuse in student writing compared to published articles (LL = 23.45, p < 0.001, LR = 1.12), occurring about 2.2 times more frequently.
While however appeared significantly more often in the academic corpus (LL = 8.91, p < 0.01), the effect size was modest (LR = 0.65), representing a 1.6-fold increase in frequency.
For patterns across multiple features:
Several features distinguished the two corpora significantly (p < 0.01). The strongest keywords in the target corpus were technical terms like “methodology” (LL = 28.7, LR = 2.1) and “analysis” (LL = 19.3, LR = 1.4), while the reference corpus was characterized by personal pronouns such as “I” (LL = 45.2, LR = -1.8) and “my” (LL = 22.1, LR = -1.5).
Reporting Guidelines
Best practices for academic writing:
- Always report both statistics: LL shows evidence strength, LR shows effect size
- Include significance levels: Use p < 0.05, p < 0.01, or p < 0.001
- Interpret the direction: Specify which corpus shows higher frequency
- Provide context: Convert LR to fold-change (2^LR) for clarity
- Be selective: Focus on the most meaningful differences, not every significant result
- Round appropriately: LL to 2 decimal places, LR to 2-3 decimal places
Reporting Guidelines
Best practices for academic writing:
- Always report both statistics: LL shows evidence strength, LR shows effect size
- Include significance levels: Use p < 0.05, p < 0.01, or p < 0.001
- Interpret the direction: Specify which corpus shows higher frequency
- Be selective: Focus on the most meaningful differences, not every significant result
- Round appropriately: LL to 2 decimal places, log-ratio to 2-3 decimal places
Understanding Your Results: A Beginner’s Guide
What makes a strong keyword?
- High log-likelihood (LL ≥ 10): Strong evidence for a difference
- Meaningful log-ratio (|LR| ≥ 1): At least 2× difference in frequency
- Sufficient frequency: Appears multiple times in your corpus (not just once or twice)
Reading the statistics together:
- High LL, high log-ratio: Strong evidence for a large effect (your best keywords)
- High LL, low log-ratio: Strong evidence for a small effect (consistent but modest difference)
- Low LL, high log-ratio: Weak evidence for a large effect (potentially unreliable due to low frequency)
Tips for New Users
- Start simple: Begin with “Tags Only” to get an overview of broad patterns
- Look for patterns: Focus on words/tags with both high LL and meaningful log-ratio
- Use filters: Narrow down to specific grammatical categories to find targeted differences
- Compare wisely: Ensure your reference corpus is appropriate for your research question
- If you don’t see any data, make sure both corpora are loaded and processed
- Use the plot view to quickly spot the most distinctive features
- Download your results often so you can experiment without losing your work
If You Get Stuck
- Use the reset button on the Manage Corpus Data page if you need to start over
- If you see warnings, check that both corpora are loaded and processed
- Remember: No results might mean your corpora are actually quite similar!