About the Human-AI Parallel English Corpus
The Human-AI Parallel English (HAP-E) Corpus is a cutting-edge dataset that provides a unique opportunity to compare human and AI writing side-by-side. Created by Reinhart et al. (2025), this corpus offers parallel texts where humans and multiple AI models respond to the same writing prompts, making it perfect for exploring the similarities and differences between human and artificial intelligence writing.
What Makes HAP-E Special?
HAP-E is the first corpus designed specifically to enable direct comparisons between human and AI writing:
- Parallel Structure: Each text appears in 7 versions - one human-written and six AI-generated versions from the same prompt
- Multiple AI Models: Includes text from GPT-4o, GPT-4o Mini, and four variants of Meta Llama 3
- Diverse Text Types: Covers 6 major text categories - Academic articles, News, Fiction, Spoken (podcasts), Blogs, and TV/Movie scripts
- Controlled Conditions: All texts respond to identical 500-word prompts, ensuring fair comparisons
What’s in HAP-E?
The HAP-E corpus contains approximately 8,290 parallel text sets representing:
- Academic Writing: Open-access articles from Elsevier journals
- News Articles: Contemporary journalism from U.S. news organizations
- Fiction: Novels and short stories from Project Gutenberg
- Spoken Language: Podcast transcriptions
- Blog Posts: Posts from blogger.com
- Scripts: Television and movie scripts
Each human text is matched with AI continuations from: - GPT-4o and GPT-4o Mini (OpenAI) - Llama 3 8B and Llama 3 70B (base models) - Llama 3 8B Instruct and Llama 3 70B Instruct (instruction-tuned models)
Available Version in DocuScope
HAP-E Mini |
Down-sampled balanced version |
Exploring human vs. AI writing patterns without overwhelming data |
About the Mini Version: The HAP-E Mini available in DocuScope is a carefully curated subset of the full corpus, designed to be manageable for classroom use while maintaining representative examples across all text types and AI models.
Why Use HAP-E?
HAP-E offers unique research opportunities that weren’t possible before:
- AI Detection Research: Learn to identify features that distinguish AI from human writing
- Writing Style Analysis: Compare how different AI models approach the same writing task
- Cross-Genre AI Performance: See how AI writing varies across academic, creative, and journalistic contexts
- Contemporary Writing Studies: Analyze the newest forms of text generation in our digital age
- Critical AI Literacy: Develop skills to evaluate and understand AI-generated content
Research Questions You Can Explore
HAP-E enables investigations into questions at the forefront of digital writing studies:
- Linguistic Fingerprints: What grammatical patterns reveal AI authorship?
- Model Differences: How does GPT-4o writing differ from Llama 3 writing?
- Genre Adaptation: Do AI models struggle more with creative writing than academic writing?
- Instruction Tuning Effects: How does training AI to follow instructions change its writing style?
- Human vs. Machine Patterns: What makes human writing distinctively “human”?
Key Research Findings
The original HAP-E study revealed fascinating patterns:
- AI models use 2-5 times more present participial clauses than humans (“Running quickly, she…”)
- Instruction-tuned models are easier to detect than base models
- AI models favor specific vocabulary (words like “camaraderie,” “tapestry,” “intricate”)
- Different AI models have distinctive “house styles” that can be identified
Getting Started with HAP-E
- Start with Mini: The HAP-E Mini provides manageable data for initial exploration
- Choose Your Focus: Decide whether you want to compare human vs. AI or different AI models
- Pick a Genre: Start with one text type (like academic or news) before comparing across genres
- Consider Research Ethics: Remember that understanding AI writing helps us use these tools more responsibly
Citations and Further Reading
Primary Citation
For the HAP-E corpus and research findings:
APA Format: Reinhart, A., Markey, B., Laudenbach, M., Pantusen, K., Yurko, R., Weinberg, G., & Brown, D. W. (2025). Do LLMs write like humans? Variation in grammatical and rhetorical styles. Proceedings of the National Academy of Sciences, 122(8), e2422455122. https://doi.org/10.1073/pnas.2422455122
For Course Citations
When using HAP-E data in your coursework, cite the corpus research:
Example: “Using the Human-AI Parallel English Corpus (Reinhart et al., 2025), this analysis compares…”
Educational Context
HAP-E is particularly valuable for courses in: - Digital Writing Studies - Computational Linguistics - AI Ethics and Literacy - Contemporary Rhetoric - Data Science and Text Analysis
Ready to explore the frontier of human-AI writing comparison? Head to the Load Corpus guide to get started with HAP-E analysis.