import pybiber as pb
# One-line corpus processing
= pb.PybiberPipeline()
pipeline = pipeline.run_from_folder("path/to/texts") features
pybiber
A comprehensive Python package for linguistic feature extraction and Multi-Dimensional Analysis
The pybiber package provides tools for extracting 67 lexicogrammatical and functional features described by Biber (1988) and widely used for text-type, register, and genre classification tasks in corpus linguistics.
Key Features
- Feature Extraction: Automated extraction of 67 linguistic features from text corpora
- Multi-Dimensional Analysis: Implementation of Biber’s MDA methodology for register analysis
- Principal Component Analysis: Alternative dimensionality reduction approaches
- Visualization Tools: Comprehensive plotting functions for exploratory data analysis
- High Performance: Built on spaCy and Polars for efficient processing
- Pipeline Integration: End-to-end workflows from raw text to statistical analysis
Applications
The pybiber package is suitable for:
- Register and genre analysis in corpus linguistics
- Text classification and machine learning preprocessing
- Diachronic language change studies
- Cross-linguistic variation research
- Academic writing analysis and pedagogical applications
- Stylometric analysis and authorship attribution
Technical Foundation
The package uses spaCy part-of-speech tagging and dependency parsing to extract linguistic features. All data processing leverages the Polars DataFrame library for high-performance analytics.
Feature extraction builds from the outputs of probabilistic taggers, so the accuracy of resulting counts depends on the accuracy of those models. Texts with irregular spellings, non-normative punctuation, etc. may produce unreliable outputs unless taggers are specifically tuned for those domains.
Quick Start
For users eager to jump in, here’s a minimal example:
Installation
You can install the released version of pybiber from PyPI:
pip install pybiber
spaCy Model Installation
Install a spaCy model for linguistic analysis:
python -m spacy download en_core_web_sm
You can also install spaCy models directly via pip:
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl
For other languages, see the spaCy models documentation.
Basic Workflow
Data Requirements
The pybiber package works with text corpora structured as DataFrames. The biber
function expects a Polars DataFrame with: - doc_id
column: Unique identifiers for each document - text
column: Raw text content
This structure follows conventions established by readtext
and quanteda in R.
Step-by-Step Processing
1. Import Libraries and Load Data
import spacy
import pybiber as pb
from pybiber.data import micusp_mini
Let’s examine the structure of our sample corpus:
micusp_mini.head()
doc_id | text |
---|---|
str | str |
"BIO_G0_02_1" | "Ernst Mayr once wrote, "sympat… |
"BIO_G0_03_1" | "The ability of a species to co… |
"BIO_G0_06_1" | "Generally, females make a larg… |
"BIO_G0_12_1" | "In the field of plant biology,… |
"BIO_G0_21_1" | "Parasites in nonhuman animals … |
The micusp_mini
dataset is a subset of the Michigan Corpus of Upper-Level Student Papers, containing academic texts from various disciplines.
To process your own texts, use corpus_from_folder
to read all text files from a directory:
= pb.corpus_from_folder("path/to/your/texts") corpus
2. Initialize spaCy Model
The pybiber package requires a spaCy model with part-of-speech tagging and dependency parsing capabilities:
= spacy.load("en_core_web_sm", disable=["ner"]) nlp
We disable Named Entity Recognition (ner
) to increase processing speed, though this is optional. The essential components for feature extraction are the part-of-speech tagger and dependency parser. Note that PybiberPipeline
disables NER by default.
3. Process the Corpus
Use the CorpusProcessor
to parse your texts:
= pb.CorpusProcessor()
processor = processor.process_corpus(micusp_mini, nlp_model=nlp) df_spacy
Performance: Corpus processing completed in 49.84s
This returns a token-level DataFrame with linguistic annotations, structured similarly to spacyr output:
df_spacy.head()
doc_id | sentence_id | token_id | token | lemma | pos | tag | head_token_id | dep_rel |
---|---|---|---|---|---|---|---|---|
str | u32 | i64 | str | str | str | str | i64 | str |
"BIO_G0_02_1" | 1 | 0 | "Ernst" | "Ernst" | "PROPN" | "NNP" | 1 | "compound" |
"BIO_G0_02_1" | 1 | 1 | "Mayr" | "Mayr" | "PROPN" | "NNP" | 3 | "nsubj" |
"BIO_G0_02_1" | 1 | 2 | "once" | "once" | "ADV" | "RB" | 3 | "advmod" |
"BIO_G0_02_1" | 1 | 3 | "wrote" | "write" | "VERB" | "VBD" | 3 | "ROOT" |
"BIO_G0_02_1" | 1 | 4 | "," | "," | "PUNCT" | "," | 8 | "punct" |
4. Extract Linguistic Features
Finally, aggregate the token-level data into document-level feature counts:
= pb.biber(df_spacy) df_biber
[INFO] Using MATTR for f_43_type_token
[INFO] All features normalized per 1000 tokens except: f_43_type_token and f_44_mean_word_length
The resulting document-feature matrix contains 67 linguistic features plus document identifiers:
print(f"Matrix dimensions: {df_biber.shape}")
df_biber.head()
Matrix dimensions: (170, 68)
doc_id | f_01_past_tense | f_02_perfect_aspect | f_03_present_tense | f_04_place_adverbials | … | f_63_split_auxiliary | f_64_phrasal_coordination | f_65_clausal_coordination | f_66_neg_synthetic | f_67_neg_analytic |
---|---|---|---|---|---|---|---|---|---|---|
str | f64 | f64 | f64 | f64 | … | f64 | f64 | f64 | f64 | f64 |
"BIO_G0_02_1" | 11.574886 | 9.821115 | 61.381971 | 2.104525 | … | 4.910558 | 6.664328 | 4.209049 | 1.403016 | 2.806033 |
"BIO_G0_03_1" | 20.300088 | 3.53045 | 59.13504 | 1.765225 | … | 0.882613 | 7.943513 | 2.647838 | 0.882613 | 7.0609 |
"BIO_G0_06_1" | 9.480034 | 2.585464 | 52.5711 | 0.861821 | … | 6.320023 | 10.054582 | 5.458202 | 0.574548 | 8.905487 |
"BIO_G0_12_1" | 36.900369 | 2.767528 | 23.98524 | 1.845018 | … | 2.767528 | 0.922509 | 1.845018 | 1.845018 | 5.535055 |
"BIO_G0_21_1" | 40.050858 | 2.542912 | 26.700572 | 2.542912 | … | 3.17864 | 7.628735 | 6.993007 | 2.542912 | 2.542912 |
Understanding the Output
Feature Normalization
By default, all features are normalized per 1,000 tokens except: - f_43_type_token
: Type-token ratio (0-1 scale) - f_44_mean_word_length
: Average word length in characters
Set normalize=False
to return absolute counts instead of normalized frequencies.
Document Metadata
Encode important metadata into your document IDs (file names) for downstream analysis. In the micusp_mini
data, the first three letters represent academic disciplines (e.g., BIO=Biology, ENG=English).
Advanced Usage
High-Level Pipeline
For streamlined processing, use the PybiberPipeline
:
# Process a folder of .txt files in one step
= pb.PybiberPipeline(model="en_core_web_sm")
pipeline = pipeline.run_from_folder("path/to/texts", recursive=True) features
Working with Pandas
All DataFrames use Polars for performance. To convert to pandas:
= df_biber.to_pandas() # Requires pandas and pyarrow df_pandas
Next Steps
- Get Started Guide: Detailed walkthrough of the basic workflow
- Feature Categories: Complete list and descriptions of all 67 features
- Biber Analyzer: Multi-Dimensional Analysis and statistical visualization tools
For advanced analytical workflows, explore the BiberAnalyzer
class for factor analysis and dimensional visualization.