pybiber

A comprehensive Python package for linguistic feature extraction and Multi-Dimensional Analysis

The pybiber package provides tools for extracting 67 lexicogrammatical and functional features described by Biber (1988) and widely used for text-type, register, and genre classification tasks in corpus linguistics.

Key Features

  • Feature Extraction: Automated extraction of 67 linguistic features from text corpora
  • Multi-Dimensional Analysis: Implementation of Biber’s MDA methodology for register analysis
  • Principal Component Analysis: Alternative dimensionality reduction approaches
  • Visualization Tools: Comprehensive plotting functions for exploratory data analysis
  • High Performance: Built on spaCy and Polars for efficient processing
  • Pipeline Integration: End-to-end workflows from raw text to statistical analysis

Applications

The pybiber package is suitable for:

  • Register and genre analysis in corpus linguistics
  • Text classification and machine learning preprocessing
  • Diachronic language change studies
  • Cross-linguistic variation research
  • Academic writing analysis and pedagogical applications
  • Stylometric analysis and authorship attribution

Technical Foundation

The package uses spaCy part-of-speech tagging and dependency parsing to extract linguistic features. All data processing leverages the Polars DataFrame library for high-performance analytics.

Accuracy Considerations

Feature extraction builds from the outputs of probabilistic taggers, so the accuracy of resulting counts depends on the accuracy of those models. Texts with irregular spellings, non-normative punctuation, etc. may produce unreliable outputs unless taggers are specifically tuned for those domains.

Quick Start

For users eager to jump in, here’s a minimal example:

import pybiber as pb

# One-line corpus processing
pipeline = pb.PybiberPipeline()
features = pipeline.run_from_folder("path/to/texts")

Installation

You can install the released version of pybiber from PyPI:

pip install pybiber

spaCy Model Installation

Install a spaCy model for linguistic analysis:

python -m spacy download en_core_web_sm
Alternative Installation Methods

You can also install spaCy models directly via pip:

pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl

For other languages, see the spaCy models documentation.

Basic Workflow

Data Requirements

The pybiber package works with text corpora structured as DataFrames. The biber function expects a Polars DataFrame with: - doc_id column: Unique identifiers for each document - text column: Raw text content

This structure follows conventions established by readtext and quanteda in R.

Step-by-Step Processing

1. Import Libraries and Load Data

import spacy
import pybiber as pb
from pybiber.data import micusp_mini

Let’s examine the structure of our sample corpus:

micusp_mini.head()
shape: (5, 2)
doc_id text
str str
"BIO_G0_02_1" "Ernst Mayr once wrote, "sympat…
"BIO_G0_03_1" "The ability of a species to co…
"BIO_G0_06_1" "Generally, females make a larg…
"BIO_G0_12_1" "In the field of plant biology,…
"BIO_G0_21_1" "Parasites in nonhuman animals …

The micusp_mini dataset is a subset of the Michigan Corpus of Upper-Level Student Papers, containing academic texts from various disciplines.

Building Your Own Corpus

To process your own texts, use corpus_from_folder to read all text files from a directory:

corpus = pb.corpus_from_folder("path/to/your/texts")

2. Initialize spaCy Model

The pybiber package requires a spaCy model with part-of-speech tagging and dependency parsing capabilities:

nlp = spacy.load("en_core_web_sm", disable=["ner"])
Note

We disable Named Entity Recognition (ner) to increase processing speed, though this is optional. The essential components for feature extraction are the part-of-speech tagger and dependency parser. Note that PybiberPipeline disables NER by default.

3. Process the Corpus

Use the CorpusProcessor to parse your texts:

processor = pb.CorpusProcessor()
df_spacy = processor.process_corpus(micusp_mini, nlp_model=nlp)
Performance: Corpus processing completed in 49.84s

This returns a token-level DataFrame with linguistic annotations, structured similarly to spacyr output:

df_spacy.head()
shape: (5, 9)
doc_id sentence_id token_id token lemma pos tag head_token_id dep_rel
str u32 i64 str str str str i64 str
"BIO_G0_02_1" 1 0 "Ernst" "Ernst" "PROPN" "NNP" 1 "compound"
"BIO_G0_02_1" 1 1 "Mayr" "Mayr" "PROPN" "NNP" 3 "nsubj"
"BIO_G0_02_1" 1 2 "once" "once" "ADV" "RB" 3 "advmod"
"BIO_G0_02_1" 1 3 "wrote" "write" "VERB" "VBD" 3 "ROOT"
"BIO_G0_02_1" 1 4 "," "," "PUNCT" "," 8 "punct"

4. Extract Linguistic Features

Finally, aggregate the token-level data into document-level feature counts:

df_biber = pb.biber(df_spacy)
[INFO] Using MATTR for f_43_type_token
[INFO] All features normalized per 1000 tokens except: f_43_type_token and f_44_mean_word_length

The resulting document-feature matrix contains 67 linguistic features plus document identifiers:

print(f"Matrix dimensions: {df_biber.shape}")
df_biber.head()
Matrix dimensions: (170, 68)
shape: (5, 68)
doc_id f_01_past_tense f_02_perfect_aspect f_03_present_tense f_04_place_adverbials f_63_split_auxiliary f_64_phrasal_coordination f_65_clausal_coordination f_66_neg_synthetic f_67_neg_analytic
str f64 f64 f64 f64 f64 f64 f64 f64 f64
"BIO_G0_02_1" 11.574886 9.821115 61.381971 2.104525 4.910558 6.664328 4.209049 1.403016 2.806033
"BIO_G0_03_1" 20.300088 3.53045 59.13504 1.765225 0.882613 7.943513 2.647838 0.882613 7.0609
"BIO_G0_06_1" 9.480034 2.585464 52.5711 0.861821 6.320023 10.054582 5.458202 0.574548 8.905487
"BIO_G0_12_1" 36.900369 2.767528 23.98524 1.845018 2.767528 0.922509 1.845018 1.845018 5.535055
"BIO_G0_21_1" 40.050858 2.542912 26.700572 2.542912 3.17864 7.628735 6.993007 2.542912 2.542912

Understanding the Output

Feature Normalization

Feature Scaling

By default, all features are normalized per 1,000 tokens except: - f_43_type_token: Type-token ratio (0-1 scale) - f_44_mean_word_length: Average word length in characters

Set normalize=False to return absolute counts instead of normalized frequencies.

Document Metadata

Encoding Metadata

Encode important metadata into your document IDs (file names) for downstream analysis. In the micusp_mini data, the first three letters represent academic disciplines (e.g., BIO=Biology, ENG=English).

Advanced Usage

High-Level Pipeline

For streamlined processing, use the PybiberPipeline:

# Process a folder of .txt files in one step
pipeline = pb.PybiberPipeline(model="en_core_web_sm")
features = pipeline.run_from_folder("path/to/texts", recursive=True)

Working with Pandas

All DataFrames use Polars for performance. To convert to pandas:

df_pandas = df_biber.to_pandas()  # Requires pandas and pyarrow

Next Steps

For advanced analytical workflows, explore the BiberAnalyzer class for factor analysis and dimensional visualization.

References

Biber, Douglas. 1988. Variation Across Speech and Writing. Cambridge University Press. https://doi.org/10.1017/CBO9780511621024.