Get Started

This guide walks you through the complete pybiber workflow, from installing the package to extracting linguistic features from your text corpus.

Overview

Processing a corpus with pybiber involves four main steps:

  1. Prepare your corpus - Organize texts in the required DataFrame format
  2. Initialize a spaCy model - Load a model with POS tagging and dependency parsing
  3. Parse the corpus - Extract token-level linguistic annotations
  4. Extract features - Aggregate tokens into document-level feature counts

After generating the document-feature matrix, you can proceed to advanced analyses like classification tasks (Reinhart et al. 2024) or Multi-Dimensional Analysis (Biber 1985). See the Biber Analyzer documentation for statistical analysis workflows.

Prerequisites

Installation

Install pybiber from PyPI:

pip install pybiber

spaCy Model

Install a spaCy model with part-of-speech tagging and dependency parsing:

python -m spacy download en_core_web_sm
Model Requirements

The pybiber package requires a spaCy model that performs both part-of-speech tagging and dependency parsing. Most en_core_* models meet these requirements. For other languages, check the spaCy models page.

Step 1: Preparing a Corpus

Import Libraries

import spacy
import pybiber as pb
import polars as pl

Data Structure Requirements

The pybiber workflow expects a Polars DataFrame with two essential columns: - doc_id: Unique identifier for each document - text: Raw text content

This structure follows conventions established by readtext and quanteda in R.

Option 1: Using Sample Data

For this tutorial, we’ll use the included sample dataset:

from pybiber.data import micusp_mini

Let’s examine the structure:

print(f"Corpus shape: {micusp_mini.shape}")
micusp_mini.head()
Corpus shape: (170, 2)
shape: (5, 2)
doc_id text
str str
"BIO_G0_02_1" "Ernst Mayr once wrote, "sympat…
"BIO_G0_03_1" "The ability of a species to co…
"BIO_G0_06_1" "Generally, females make a larg…
"BIO_G0_12_1" "In the field of plant biology,…
"BIO_G0_21_1" "Parasites in nonhuman animals …
Note

The micusp_mini dataset is a subset of the Michigan Corpus of Upper-Level Student Papers, containing academic texts from various disciplines. Document IDs encode discipline information (e.g., BIO=Biology, ENG=English).

Option 2: Loading Your Own Data

From CSV/Parquet Files

# From CSV
corpus = pl.read_csv("my_corpus.csv")

# From Parquet (recommended for large datasets)
corpus = pl.read_parquet("my_corpus.parquet")

# From Hugging Face datasets
corpus = pl.read_parquet(
    'hf://datasets/browndw/human-ai-parallel-corpus-mini/hape_mini-text.parquet'
)

From Text Files in Directory

Use corpus_from_folder to read all .txt files from a directory:

# Read all .txt files from a directory
corpus = pb.corpus_from_folder("path/to/text/files")

# For nested directory structures
text_paths = pb.get_text_paths("path/to/corpus", recursive=True)
corpus = pb.readtext(text_paths)

Custom Corpus Creation

# Create corpus from custom data
import polars as pl

corpus = pl.DataFrame({
    "doc_id": ["doc1", "doc2", "doc3"],
    "text": [
        "This is the first document.",
        "Here is another text sample.",
        "And this is the third document."
    ]
})

Step 2: Initialize spaCy Model

Load a spaCy model with the required linguistic components:

nlp = spacy.load("en_core_web_sm", disable=["ner"])

Model Configuration Options

# Option 1: Keep all components (slower but complete)
nlp = spacy.load("en_core_web_sm")

# Option 2: Disable unnecessary components for speed (recommended)
nlp = spacy.load("en_core_web_sm", disable=["ner"])

# Option 3: Maximize speed (disable more components)
nlp = spacy.load("en_core_web_sm", disable=["ner", "lemmatizer"])
Performance Tip

Disabling Named Entity Recognition (ner) typically provides the best speed/functionality balance for feature extraction, as NER isn’t required for Biber features. This is also the default setting in PybiberPipeline.

Step 3: Parse the Text Data

Using CorpusProcessor

The CorpusProcessor provides efficient, configurable text processing:

processor = pb.CorpusProcessor()
df_tokens = processor.process_corpus(micusp_mini, nlp_model=nlp)
Performance: Corpus processing completed in 49.87s

The processing time depends on corpus size and system specifications. For the micusp_mini corpus (~50 documents), expect processing to take about 60 seconds.

Understanding the Token Output

The processor returns a token-level DataFrame with linguistic annotations:

print(f"Token DataFrame shape: {df_tokens.shape}")
df_tokens.head(10)
Token DataFrame shape: (544570, 9)
shape: (10, 9)
doc_id sentence_id token_id token lemma pos tag head_token_id dep_rel
str u32 i64 str str str str i64 str
"BIO_G0_02_1" 1 0 "Ernst" "Ernst" "PROPN" "NNP" 1 "compound"
"BIO_G0_02_1" 1 1 "Mayr" "Mayr" "PROPN" "NNP" 3 "nsubj"
"BIO_G0_02_1" 1 2 "once" "once" "ADV" "RB" 3 "advmod"
"BIO_G0_02_1" 1 3 "wrote" "write" "VERB" "VBD" 3 "ROOT"
"BIO_G0_02_1" 1 4 "," "," "PUNCT" "," 8 "punct"
"BIO_G0_02_1" 1 5 """ """ "PUNCT" "``" 8 "punct"
"BIO_G0_02_1" 1 6 "sympatric" "sympatric" "ADJ" "JJ" 7 "amod"
"BIO_G0_02_1" 1 7 "speciation" "speciation" "NOUN" "NN" 8 "nsubj"
"BIO_G0_02_1" 1 8 "is" "be" "AUX" "VBZ" 3 "ccomp"
"BIO_G0_02_1" 1 9 "like" "like" "ADP" "IN" 8 "prep"

Key columns include: - doc_id: Document identifier - token: Raw token text
- lemma: Lemmatized form - pos: Part-of-speech tag (universal) - tag: Fine-grained POS tag - dep_rel: Dependency relation - sent_id: Sentence identifier

Performance Optimization

You can customize processing parameters for better performance:

processor = pb.CorpusProcessor()
df_tokens = processor.process_corpus(
    corpus, 
    nlp_model=nlp,
    n_process=4,        # Use multiple CPU cores
    batch_size=100,     # Optimize batch size
    show_progress=True  # Display progress bar
)
Batch Size Guidelines
  • Small corpora (<1000 docs): batch_size=50-100
  • Medium corpora (1000-10000 docs): batch_size=100-200
  • Large corpora (>10000 docs): batch_size=200-500

Larger batch sizes may actually slow processing due to memory constraints.

Step 4: Extract Linguistic Features

Basic Feature Extraction

Transform token-level data into document-level feature counts using biber:

df_features = pb.biber(df_tokens)
[INFO] Using MATTR for f_43_type_token
[INFO] All features normalized per 1000 tokens except: f_43_type_token and f_44_mean_word_length

Understanding the Feature Matrix

The result is a document-feature matrix with 67 linguistic variables:

print(f"Feature matrix shape: {df_features.shape}")
print(f"Features extracted: {df_features.shape[1] - 1}")  # Minus doc_id column
df_features.head()
Feature matrix shape: (170, 68)
Features extracted: 67
shape: (5, 68)
doc_id f_01_past_tense f_02_perfect_aspect f_03_present_tense f_04_place_adverbials f_63_split_auxiliary f_64_phrasal_coordination f_65_clausal_coordination f_66_neg_synthetic f_67_neg_analytic
str f64 f64 f64 f64 f64 f64 f64 f64 f64
"BIO_G0_02_1" 11.574886 9.821115 61.381971 2.104525 4.910558 6.664328 4.209049 1.403016 2.806033
"BIO_G0_03_1" 20.300088 3.53045 59.13504 1.765225 0.882613 7.943513 2.647838 0.882613 7.0609
"BIO_G0_06_1" 9.480034 2.585464 52.5711 0.861821 6.320023 10.054582 5.458202 0.574548 8.905487
"BIO_G0_12_1" 36.900369 2.767528 23.98524 1.845018 2.767528 0.922509 1.845018 1.845018 5.535055
"BIO_G0_21_1" 40.050858 2.542912 26.700572 2.542912 3.17864 7.628735 6.993007 2.542912 2.542912

Feature Normalization Options

By default, features are normalized per 1,000 tokens, except for two features that use different scales:

# Normalized frequencies (default)
df_normalized = pb.biber(df_tokens, normalize=True)

# Raw counts
df_raw = pb.biber(df_tokens, normalize=False)
Feature Scaling
  • Most features: Normalized per 1,000 tokens
  • f_43_type_token: Type-token ratio (0-1 scale)
  • f_44_mean_word_length: Average characters per word

This normalization enables comparison across documents of different lengths.

Type-Token Ratio Options

The package offers two type-token ratio calculations:

# Moving Average Type-Token Ratio (default, recommended)
df_mattr = pb.biber(df_tokens, force_ttr=False)

# Traditional Type-Token Ratio (for specific comparisons)
df_ttr = pb.biber(df_tokens, force_ttr=True)
TTR vs MATTR
  • MATTR (default): More robust, calculated using 100-token windows
  • Traditional TTR: Simple unique tokens / total tokens ratio
  • Use consistent measures when comparing corpora processed separately

Alternative Workflow: High-Level Pipeline

For streamlined processing, use the PybiberPipeline:

Complete Pipeline Example

# Initialize pipeline with optimal settings
pipeline = pb.PybiberPipeline(
    model="en_core_web_sm",
    disable_ner=True,
    n_process=4,
    batch_size=100
)

# Process folder of text files
df_features = pipeline.run_from_folder("/path/to/texts", recursive=True)

# Or process existing corpus DataFrame
df_features = pipeline.run(micusp_mini)

Pipeline with Token Retention

If you need both features and token-level data:

# Return both features and tokens
features, tokens = pipeline.run(
    micusp_mini, 
    return_tokens=True,
    normalize=True
)

Data Quality and Validation

Examining Feature Distributions

Before analysis, examine your feature distributions:

# Summary statistics
feature_columns = df_features.select(pl.selectors.numeric())
summary = feature_columns.describe()
print(summary)
shape: (9, 68)
┌─────────┬─────────┬─────────┬─────────┬─────────┬───┬────────┬────────┬────────┬────────┬────────┐
│ statist ┆ f_01_pa ┆ f_02_pe ┆ f_03_pr ┆ f_04_pl ┆ … ┆ f_63_s ┆ f_64_p ┆ f_65_c ┆ f_66_n ┆ f_67_n │
│ ic      ┆ st_tens ┆ rfect_a ┆ esent_t ┆ ace_adv ┆   ┆ plit_a ┆ hrasal ┆ lausal ┆ eg_syn ┆ eg_ana │
│ ---     ┆ e       ┆ spect   ┆ ense    ┆ erbials ┆   ┆ uxilia ┆ _coord ┆ _coord ┆ thetic ┆ lytic  │
│ str     ┆ ---     ┆ ---     ┆ ---     ┆ ---     ┆   ┆ ry     ┆ inatio ┆ inatio ┆ ---    ┆ ---    │
│         ┆ f64     ┆ f64     ┆ f64     ┆ f64     ┆   ┆ ---    ┆ n      ┆ n      ┆ f64    ┆ f64    │
│         ┆         ┆         ┆         ┆         ┆   ┆ f64    ┆ ---    ┆ ---    ┆        ┆        │
│         ┆         ┆         ┆         ┆         ┆   ┆        ┆ f64    ┆ f64    ┆        ┆        │
╞═════════╪═════════╪═════════╪═════════╪═════════╪═══╪════════╪════════╪════════╪════════╪════════╡
│ count   ┆ 170.0   ┆ 170.0   ┆ 170.0   ┆ 170.0   ┆ … ┆ 170.0  ┆ 170.0  ┆ 170.0  ┆ 170.0  ┆ 170.0  │
│ null_co ┆ 0.0     ┆ 0.0     ┆ 0.0     ┆ 0.0     ┆ … ┆ 0.0    ┆ 0.0    ┆ 0.0    ┆ 0.0    ┆ 0.0    │
│ unt     ┆         ┆         ┆         ┆         ┆   ┆        ┆        ┆        ┆        ┆        │
│ mean    ┆ 19.8328 ┆ 4.35336 ┆ 48.3201 ┆ 2.22213 ┆ … ┆ 3.7340 ┆ 9.1199 ┆ 4.9840 ┆ 1.0312 ┆ 5.8196 │
│         ┆ 55      ┆ 5       ┆ 44      ┆ 3       ┆   ┆ 87     ┆ 05     ┆ 01     ┆ 4      ┆ 86     │
│ std     ┆ 16.8183 ┆ 2.89646 ┆ 18.7263 ┆ 1.87932 ┆ … ┆ 1.9150 ┆ 4.3323 ┆ 2.6707 ┆ 1.1123 ┆ 2.9414 │
│         ┆ 59      ┆ 5       ┆ 94      ┆ 4       ┆   ┆ 47     ┆ 36     ┆ 22     ┆ 41     ┆ 46     │
│ min     ┆ 0.0     ┆ 0.0     ┆ 0.0     ┆ 0.0     ┆ … ┆ 0.4775 ┆ 0.9225 ┆ 0.0    ┆ 0.0    ┆ 0.0    │
│         ┆         ┆         ┆         ┆         ┆   ┆ 55     ┆ 09     ┆        ┆        ┆        │
│ 25%     ┆ 6.87521 ┆ 2.23813 ┆ 35.4150 ┆ 0.92776 ┆ … ┆ 2.2594 ┆ 5.8593 ┆ 3.2523 ┆ 0.1579 ┆ 3.7582 │
│         ┆ 5       ┆ 8       ┆ 8       ┆ 7       ┆   ┆ 61     ┆ 75     ┆ 44     ┆ 03     ┆ 21     │
│ 50%     ┆ 15.4215 ┆ 3.82690 ┆ 50.4201 ┆ 1.76782 ┆ … ┆ 3.5211 ┆ 8.3688 ┆ 4.7780 ┆ 0.8358 ┆ 5.3412 │
│         ┆ 22      ┆ 6       ┆ 68      ┆ 6       ┆   ┆ 27     ┆ 62     ┆ 21     ┆ 41     ┆ 46     │
│ 75%     ┆ 27.5049 ┆ 5.80671 ┆ 61.6792 ┆ 2.91439 ┆ … ┆ 4.9455 ┆ 11.562 ┆ 6.2350 ┆ 1.4030 ┆ 7.3333 │
│         ┆ 12      ┆ 9       ┆ 68      ┆         ┆   ┆ 98     ┆ 998    ┆ 12     ┆ 16     ┆ 33     │
│ max     ┆ 82.7814 ┆ 14.8514 ┆ 98.5626 ┆ 12.8314 ┆ … ┆ 10.030 ┆ 25.821 ┆ 16.734 ┆ 8.4859 ┆ 14.806 │
│         ┆ 57      ┆ 85      ┆ 28      ┆ 8       ┆   ┆ 09     ┆ 596    ┆ 143    ┆ 31     ┆ 378    │
└─────────┴─────────┴─────────┴─────────┴─────────┴───┴────────┴────────┴────────┴────────┴────────┘

Identifying Potential Issues

# Check for zero-variance features
zero_var_features = (
    feature_columns
    .std()
    .transpose(include_header=True)
    .filter(pl.col("column_0") == 0.0)
)

if zero_var_features.height > 0:
    print("Zero-variance features found:")
    print(zero_var_features)
else:
    print("No zero-variance features detected")
No zero-variance features detected

Document Length Analysis

# Analyze document lengths from tokens
doc_lengths = (
    df_tokens
    .group_by("doc_id")
    .len()
    .sort("len", descending=True)
)

print("Document length distribution:")
print(doc_lengths.describe())

Next Steps

Now that you have extracted linguistic features, you can proceed to:

Statistical Analysis

Feature Understanding

Further Processing

# Convert to pandas if needed
df_pandas = df_features.to_pandas()

# Export for external analysis
df_features.write_csv("biber_features.csv")
df_features.write_parquet("biber_features.parquet")

Troubleshooting

Common Issues

Memory errors with large corpora: - Reduce batch_size parameter - Process in smaller chunks using the tutorials - Use n_process=1 to reduce memory overhead

Slow processing: - Increase n_process (up to CPU core count) - Disable unnecessary spaCy components - Use optimal batch_size for your system

Model loading errors: - Verify spaCy model installation: python -c "import spacy; spacy.load('en_core_web_sm')" - Reinstall model if needed: python -m spacy download en_core_web_sm

Feature extraction errors: - Ensure corpus has required columns (doc_id, text)
- Check for empty documents in your corpus - Validate text encoding (should be UTF-8)

Getting Help

References

Biber, Douglas. 1985. “Investigating Macroscopic Textual Variation Through Multifeature/Multidimensional Analyses.” https://doi.org/10.1515/ling.1985.23.2.337.
Reinhart, Alex, David West Brown, Ben Markey, Michael Laudenbach, Kachatad Pantusen, Ronald Yurko, and Gordon Weinberg. 2024. “Do LLMs Write Like Humans? Variation in Grammatical and Rhetorical Styles.” arXiv Preprint arXiv:2410.16107. https://doi.org/10.48550/arXiv.2410.16107.