import spacy
import pybiber as pb
import polars as pl
Get Started
This guide walks you through the complete pybiber workflow, from installing the package to extracting linguistic features from your text corpus.
Overview
Processing a corpus with pybiber involves four main steps:
- Prepare your corpus - Organize texts in the required DataFrame format
- Initialize a spaCy model - Load a model with POS tagging and dependency parsing
- Parse the corpus - Extract token-level linguistic annotations
- Extract features - Aggregate tokens into document-level feature counts
After generating the document-feature matrix, you can proceed to advanced analyses like classification tasks (Reinhart et al. 2024) or Multi-Dimensional Analysis (Biber 1985). See the Biber Analyzer documentation for statistical analysis workflows.
Prerequisites
Installation
Install pybiber from PyPI:
pip install pybiber
spaCy Model
Install a spaCy model with part-of-speech tagging and dependency parsing:
python -m spacy download en_core_web_sm
The pybiber package requires a spaCy model that performs both part-of-speech tagging and dependency parsing. Most en_core_*
models meet these requirements. For other languages, check the spaCy models page.
Step 1: Preparing a Corpus
Import Libraries
Data Structure Requirements
The pybiber workflow expects a Polars DataFrame with two essential columns: - doc_id
: Unique identifier for each document - text
: Raw text content
This structure follows conventions established by readtext and quanteda in R.
Option 1: Using Sample Data
For this tutorial, we’ll use the included sample dataset:
from pybiber.data import micusp_mini
Let’s examine the structure:
print(f"Corpus shape: {micusp_mini.shape}")
micusp_mini.head()
Corpus shape: (170, 2)
doc_id | text |
---|---|
str | str |
"BIO_G0_02_1" | "Ernst Mayr once wrote, "sympat… |
"BIO_G0_03_1" | "The ability of a species to co… |
"BIO_G0_06_1" | "Generally, females make a larg… |
"BIO_G0_12_1" | "In the field of plant biology,… |
"BIO_G0_21_1" | "Parasites in nonhuman animals … |
The micusp_mini
dataset is a subset of the Michigan Corpus of Upper-Level Student Papers, containing academic texts from various disciplines. Document IDs encode discipline information (e.g., BIO=Biology, ENG=English).
Option 2: Loading Your Own Data
From CSV/Parquet Files
# From CSV
= pl.read_csv("my_corpus.csv")
corpus
# From Parquet (recommended for large datasets)
= pl.read_parquet("my_corpus.parquet")
corpus
# From Hugging Face datasets
= pl.read_parquet(
corpus 'hf://datasets/browndw/human-ai-parallel-corpus-mini/hape_mini-text.parquet'
)
From Text Files in Directory
Use corpus_from_folder
to read all .txt
files from a directory:
# Read all .txt files from a directory
= pb.corpus_from_folder("path/to/text/files")
corpus
# For nested directory structures
= pb.get_text_paths("path/to/corpus", recursive=True)
text_paths = pb.readtext(text_paths) corpus
Custom Corpus Creation
# Create corpus from custom data
import polars as pl
= pl.DataFrame({
corpus "doc_id": ["doc1", "doc2", "doc3"],
"text": [
"This is the first document.",
"Here is another text sample.",
"And this is the third document."
] })
Step 2: Initialize spaCy Model
Load a spaCy model with the required linguistic components:
= spacy.load("en_core_web_sm", disable=["ner"]) nlp
Model Configuration Options
# Option 1: Keep all components (slower but complete)
= spacy.load("en_core_web_sm")
nlp
# Option 2: Disable unnecessary components for speed (recommended)
= spacy.load("en_core_web_sm", disable=["ner"])
nlp
# Option 3: Maximize speed (disable more components)
= spacy.load("en_core_web_sm", disable=["ner", "lemmatizer"]) nlp
Disabling Named Entity Recognition (ner
) typically provides the best speed/functionality balance for feature extraction, as NER isn’t required for Biber features. This is also the default setting in PybiberPipeline
.
Step 3: Parse the Text Data
Using CorpusProcessor
The CorpusProcessor
provides efficient, configurable text processing:
= pb.CorpusProcessor()
processor = processor.process_corpus(micusp_mini, nlp_model=nlp) df_tokens
Performance: Corpus processing completed in 49.87s
The processing time depends on corpus size and system specifications. For the micusp_mini
corpus (~50 documents), expect processing to take about 60 seconds.
Understanding the Token Output
The processor returns a token-level DataFrame with linguistic annotations:
print(f"Token DataFrame shape: {df_tokens.shape}")
10) df_tokens.head(
Token DataFrame shape: (544570, 9)
doc_id | sentence_id | token_id | token | lemma | pos | tag | head_token_id | dep_rel |
---|---|---|---|---|---|---|---|---|
str | u32 | i64 | str | str | str | str | i64 | str |
"BIO_G0_02_1" | 1 | 0 | "Ernst" | "Ernst" | "PROPN" | "NNP" | 1 | "compound" |
"BIO_G0_02_1" | 1 | 1 | "Mayr" | "Mayr" | "PROPN" | "NNP" | 3 | "nsubj" |
"BIO_G0_02_1" | 1 | 2 | "once" | "once" | "ADV" | "RB" | 3 | "advmod" |
"BIO_G0_02_1" | 1 | 3 | "wrote" | "write" | "VERB" | "VBD" | 3 | "ROOT" |
"BIO_G0_02_1" | 1 | 4 | "," | "," | "PUNCT" | "," | 8 | "punct" |
"BIO_G0_02_1" | 1 | 5 | """ | """ | "PUNCT" | "``" | 8 | "punct" |
"BIO_G0_02_1" | 1 | 6 | "sympatric" | "sympatric" | "ADJ" | "JJ" | 7 | "amod" |
"BIO_G0_02_1" | 1 | 7 | "speciation" | "speciation" | "NOUN" | "NN" | 8 | "nsubj" |
"BIO_G0_02_1" | 1 | 8 | "is" | "be" | "AUX" | "VBZ" | 3 | "ccomp" |
"BIO_G0_02_1" | 1 | 9 | "like" | "like" | "ADP" | "IN" | 8 | "prep" |
Key columns include: - doc_id
: Document identifier - token
: Raw token text
- lemma
: Lemmatized form - pos
: Part-of-speech tag (universal) - tag
: Fine-grained POS tag - dep_rel
: Dependency relation - sent_id
: Sentence identifier
Performance Optimization
You can customize processing parameters for better performance:
= pb.CorpusProcessor()
processor = processor.process_corpus(
df_tokens
corpus, =nlp,
nlp_model=4, # Use multiple CPU cores
n_process=100, # Optimize batch size
batch_size=True # Display progress bar
show_progress )
- Small corpora (<1000 docs): batch_size=50-100
- Medium corpora (1000-10000 docs): batch_size=100-200
- Large corpora (>10000 docs): batch_size=200-500
Larger batch sizes may actually slow processing due to memory constraints.
Step 4: Extract Linguistic Features
Basic Feature Extraction
Transform token-level data into document-level feature counts using biber
:
= pb.biber(df_tokens) df_features
[INFO] Using MATTR for f_43_type_token
[INFO] All features normalized per 1000 tokens except: f_43_type_token and f_44_mean_word_length
Understanding the Feature Matrix
The result is a document-feature matrix with 67 linguistic variables:
print(f"Feature matrix shape: {df_features.shape}")
print(f"Features extracted: {df_features.shape[1] - 1}") # Minus doc_id column
df_features.head()
Feature matrix shape: (170, 68)
Features extracted: 67
doc_id | f_01_past_tense | f_02_perfect_aspect | f_03_present_tense | f_04_place_adverbials | … | f_63_split_auxiliary | f_64_phrasal_coordination | f_65_clausal_coordination | f_66_neg_synthetic | f_67_neg_analytic |
---|---|---|---|---|---|---|---|---|---|---|
str | f64 | f64 | f64 | f64 | … | f64 | f64 | f64 | f64 | f64 |
"BIO_G0_02_1" | 11.574886 | 9.821115 | 61.381971 | 2.104525 | … | 4.910558 | 6.664328 | 4.209049 | 1.403016 | 2.806033 |
"BIO_G0_03_1" | 20.300088 | 3.53045 | 59.13504 | 1.765225 | … | 0.882613 | 7.943513 | 2.647838 | 0.882613 | 7.0609 |
"BIO_G0_06_1" | 9.480034 | 2.585464 | 52.5711 | 0.861821 | … | 6.320023 | 10.054582 | 5.458202 | 0.574548 | 8.905487 |
"BIO_G0_12_1" | 36.900369 | 2.767528 | 23.98524 | 1.845018 | … | 2.767528 | 0.922509 | 1.845018 | 1.845018 | 5.535055 |
"BIO_G0_21_1" | 40.050858 | 2.542912 | 26.700572 | 2.542912 | … | 3.17864 | 7.628735 | 6.993007 | 2.542912 | 2.542912 |
Feature Normalization Options
By default, features are normalized per 1,000 tokens, except for two features that use different scales:
# Normalized frequencies (default)
= pb.biber(df_tokens, normalize=True)
df_normalized
# Raw counts
= pb.biber(df_tokens, normalize=False) df_raw
- Most features: Normalized per 1,000 tokens
- f_43_type_token: Type-token ratio (0-1 scale)
- f_44_mean_word_length: Average characters per word
This normalization enables comparison across documents of different lengths.
Type-Token Ratio Options
The package offers two type-token ratio calculations:
# Moving Average Type-Token Ratio (default, recommended)
= pb.biber(df_tokens, force_ttr=False)
df_mattr
# Traditional Type-Token Ratio (for specific comparisons)
= pb.biber(df_tokens, force_ttr=True) df_ttr
- MATTR (default): More robust, calculated using 100-token windows
- Traditional TTR: Simple unique tokens / total tokens ratio
- Use consistent measures when comparing corpora processed separately
Alternative Workflow: High-Level Pipeline
For streamlined processing, use the PybiberPipeline
:
Complete Pipeline Example
# Initialize pipeline with optimal settings
= pb.PybiberPipeline(
pipeline ="en_core_web_sm",
model=True,
disable_ner=4,
n_process=100
batch_size
)
# Process folder of text files
= pipeline.run_from_folder("/path/to/texts", recursive=True)
df_features
# Or process existing corpus DataFrame
= pipeline.run(micusp_mini) df_features
Pipeline with Token Retention
If you need both features and token-level data:
# Return both features and tokens
= pipeline.run(
features, tokens
micusp_mini, =True,
return_tokens=True
normalize )
Data Quality and Validation
Examining Feature Distributions
Before analysis, examine your feature distributions:
# Summary statistics
= df_features.select(pl.selectors.numeric())
feature_columns = feature_columns.describe()
summary print(summary)
shape: (9, 68)
┌─────────┬─────────┬─────────┬─────────┬─────────┬───┬────────┬────────┬────────┬────────┬────────┐
│ statist ┆ f_01_pa ┆ f_02_pe ┆ f_03_pr ┆ f_04_pl ┆ … ┆ f_63_s ┆ f_64_p ┆ f_65_c ┆ f_66_n ┆ f_67_n │
│ ic ┆ st_tens ┆ rfect_a ┆ esent_t ┆ ace_adv ┆ ┆ plit_a ┆ hrasal ┆ lausal ┆ eg_syn ┆ eg_ana │
│ --- ┆ e ┆ spect ┆ ense ┆ erbials ┆ ┆ uxilia ┆ _coord ┆ _coord ┆ thetic ┆ lytic │
│ str ┆ --- ┆ --- ┆ --- ┆ --- ┆ ┆ ry ┆ inatio ┆ inatio ┆ --- ┆ --- │
│ ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ --- ┆ n ┆ n ┆ f64 ┆ f64 │
│ ┆ ┆ ┆ ┆ ┆ ┆ f64 ┆ --- ┆ --- ┆ ┆ │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ f64 ┆ f64 ┆ ┆ │
╞═════════╪═════════╪═════════╪═════════╪═════════╪═══╪════════╪════════╪════════╪════════╪════════╡
│ count ┆ 170.0 ┆ 170.0 ┆ 170.0 ┆ 170.0 ┆ … ┆ 170.0 ┆ 170.0 ┆ 170.0 ┆ 170.0 ┆ 170.0 │
│ null_co ┆ 0.0 ┆ 0.0 ┆ 0.0 ┆ 0.0 ┆ … ┆ 0.0 ┆ 0.0 ┆ 0.0 ┆ 0.0 ┆ 0.0 │
│ unt ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ mean ┆ 19.8328 ┆ 4.35336 ┆ 48.3201 ┆ 2.22213 ┆ … ┆ 3.7340 ┆ 9.1199 ┆ 4.9840 ┆ 1.0312 ┆ 5.8196 │
│ ┆ 55 ┆ 5 ┆ 44 ┆ 3 ┆ ┆ 87 ┆ 05 ┆ 01 ┆ 4 ┆ 86 │
│ std ┆ 16.8183 ┆ 2.89646 ┆ 18.7263 ┆ 1.87932 ┆ … ┆ 1.9150 ┆ 4.3323 ┆ 2.6707 ┆ 1.1123 ┆ 2.9414 │
│ ┆ 59 ┆ 5 ┆ 94 ┆ 4 ┆ ┆ 47 ┆ 36 ┆ 22 ┆ 41 ┆ 46 │
│ min ┆ 0.0 ┆ 0.0 ┆ 0.0 ┆ 0.0 ┆ … ┆ 0.4775 ┆ 0.9225 ┆ 0.0 ┆ 0.0 ┆ 0.0 │
│ ┆ ┆ ┆ ┆ ┆ ┆ 55 ┆ 09 ┆ ┆ ┆ │
│ 25% ┆ 6.87521 ┆ 2.23813 ┆ 35.4150 ┆ 0.92776 ┆ … ┆ 2.2594 ┆ 5.8593 ┆ 3.2523 ┆ 0.1579 ┆ 3.7582 │
│ ┆ 5 ┆ 8 ┆ 8 ┆ 7 ┆ ┆ 61 ┆ 75 ┆ 44 ┆ 03 ┆ 21 │
│ 50% ┆ 15.4215 ┆ 3.82690 ┆ 50.4201 ┆ 1.76782 ┆ … ┆ 3.5211 ┆ 8.3688 ┆ 4.7780 ┆ 0.8358 ┆ 5.3412 │
│ ┆ 22 ┆ 6 ┆ 68 ┆ 6 ┆ ┆ 27 ┆ 62 ┆ 21 ┆ 41 ┆ 46 │
│ 75% ┆ 27.5049 ┆ 5.80671 ┆ 61.6792 ┆ 2.91439 ┆ … ┆ 4.9455 ┆ 11.562 ┆ 6.2350 ┆ 1.4030 ┆ 7.3333 │
│ ┆ 12 ┆ 9 ┆ 68 ┆ ┆ ┆ 98 ┆ 998 ┆ 12 ┆ 16 ┆ 33 │
│ max ┆ 82.7814 ┆ 14.8514 ┆ 98.5626 ┆ 12.8314 ┆ … ┆ 10.030 ┆ 25.821 ┆ 16.734 ┆ 8.4859 ┆ 14.806 │
│ ┆ 57 ┆ 85 ┆ 28 ┆ 8 ┆ ┆ 09 ┆ 596 ┆ 143 ┆ 31 ┆ 378 │
└─────────┴─────────┴─────────┴─────────┴─────────┴───┴────────┴────────┴────────┴────────┴────────┘
Identifying Potential Issues
# Check for zero-variance features
= (
zero_var_features
feature_columns
.std()=True)
.transpose(include_headerfilter(pl.col("column_0") == 0.0)
.
)
if zero_var_features.height > 0:
print("Zero-variance features found:")
print(zero_var_features)
else:
print("No zero-variance features detected")
No zero-variance features detected
Document Length Analysis
# Analyze document lengths from tokens
= (
doc_lengths
df_tokens"doc_id")
.group_by(len()
."len", descending=True)
.sort(
)
print("Document length distribution:")
print(doc_lengths.describe())
Next Steps
Now that you have extracted linguistic features, you can proceed to:
Statistical Analysis
- Biber Analyzer: Multi-Dimensional Analysis and PCA
- Tutorials: Advanced workflows and applications
Feature Understanding
- Feature Categories: Complete descriptions of all 67 features
Further Processing
# Convert to pandas if needed
= df_features.to_pandas()
df_pandas
# Export for external analysis
"biber_features.csv")
df_features.write_csv("biber_features.parquet") df_features.write_parquet(
Troubleshooting
Common Issues
Memory errors with large corpora: - Reduce batch_size
parameter - Process in smaller chunks using the tutorials - Use n_process=1
to reduce memory overhead
Slow processing: - Increase n_process
(up to CPU core count) - Disable unnecessary spaCy components - Use optimal batch_size
for your system
Model loading errors: - Verify spaCy model installation: python -c "import spacy; spacy.load('en_core_web_sm')"
- Reinstall model if needed: python -m spacy download en_core_web_sm
Feature extraction errors: - Ensure corpus has required columns (doc_id
, text
)
- Check for empty documents in your corpus - Validate text encoding (should be UTF-8)
Getting Help
- Documentation: Browse the Reference section
- GitHub Issues: Report bugs or request features
- Community: Check existing issues for similar problems