Tutorials and Advanced Usage

This page provides comprehensive tutorials for advanced pybiber workflows, from corpus preparation to statistical analysis and visualization.

Tutorial 1: Building and Processing Large Corpora

Working with Directory Structures

When working with large corpora, organizing your texts in a systematic directory structure is crucial:

corpus/
├── academic/
│   ├── biology/
│   │   ├── paper001.txt
│   │   └── paper002.txt
│   └── literature/
│       ├── essay001.txt
│       └── essay002.txt
└── news/
    ├── politics/
    └── sports/

Recursive Text Processing

Use the pipeline to process nested directories:

import pybiber as pb

# Initialize pipeline with optimized settings
pipeline = pb.PybiberPipeline(
    model="en_core_web_sm",
    disable_ner=True,  # Faster processing
    n_process=4,       # Use multiple cores
    batch_size=100     # Optimize batch size
)

# Process entire directory structure
features = pipeline.run_from_folder(
    "corpus/", 
    recursive=True,
    normalize=True
)

Handling Different Text Formats

While pybiber primarily works with .txt files, you can preprocess other formats:

import polars as pl
from pathlib import Path

# Process CSV with text columns
csv_data = pl.read_csv("articles.csv")
corpus = csv_data.select([
    pl.col("article_id").alias("doc_id"),
    pl.col("content").alias("text")
])

# Process with pipeline
pipeline = pb.PybiberPipeline()
features = pipeline.run(corpus)

Tutorial 2: Corpus Comparison and Classification

Comparing Multiple Corpora

import pybiber as pb
import polars as pl

# Load multiple corpora
academic_corpus = pb.corpus_from_folder("academic_texts/")
news_corpus = pb.corpus_from_folder("news_texts/")

# Add corpus labels
academic_corpus = academic_corpus.with_columns(
    pl.lit("academic").alias("corpus_type")
)
news_corpus = news_corpus.with_columns(
    pl.lit("news").alias("corpus_type")
)

# Combine and process
combined_corpus = pl.concat([academic_corpus, news_corpus])
pipeline = pb.PybiberPipeline()
features = pipeline.run(combined_corpus)

# Extract corpus labels for analysis
features = features.with_columns(
    combined_corpus.select("corpus_type")
)

Feature-Based Classification

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Prepare data for classification
X = features.select(pl.selectors.numeric()).to_numpy()
y = features.get_column("corpus_type").to_numpy()

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Evaluate
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

Tutorial 3: Advanced Multi-Dimensional Analysis

Custom MDA Workflows

import pybiber as pb
import polars as pl
from pybiber.data import micusp_mini

# Process sample data
pipeline = pb.PybiberPipeline()
features = pipeline.run(micusp_mini)

# Extract discipline information
features = features.with_columns(
    pl.col("doc_id").str.extract(r"^([A-Z]+)", 0).alias("discipline")
)

# Initialize analyzer
analyzer = pb.BiberAnalyzer(features, id_column=True)

[INFO] Using MATTR for f_43_type_token
[INFO] All features normalized per 1000 tokens except: f_43_type_token and f_44_mean_word_length

Performance: Corpus processing completed in 74.85s

Customizing Factor Analysis Parameters

# Experiment with different correlation thresholds
analyzer.mda(n_factors=4, cor_min=0.3, threshold=0.4)

# Examine the effect on feature selection
print("Features after correlation filtering:")
print(analyzer.mda_summary.shape)

INFO:pybiber.biber_analyzer:Dropping 11 variable(s) with max |r| <= 0.30: ['f_04_place_adverbials', 'f_05_time_adverbials', 'f_15_gerunds', 'f_18_by_passives', 'f_25_present_participle', 'f_34_sentence_relatives', 'f_35_because', 'f_46_downtoners', 'f_50_discourse_particles', 'f_53_modal_necessity', 'f_64_phrasal_coordination']

Features after correlation filtering:
(4, 6)

Comparing Factor Solutions

import matplotlib.pyplot as plt

# Compare different numbers of factors
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

for i, n_factors in enumerate([2, 3, 4, 5]):
    analyzer.mda(n_factors=n_factors)
    ax = axes[i // 2, i % 2]
    
    # Plot scree plot for each solution
    analyzer.mdaviz_screeplot()
    plt.sca(ax)
    ax.set_title(f"{n_factors} Factor Solution")

plt.tight_layout()
plt.show()

Tutorial 4: Temporal and Diachronic Analysis

Analyzing Language Change Over Time

import polars as pl

# Prepare time-stamped corpus
corpus = pl.read_csv("historical_texts.csv")
corpus = corpus.with_columns([
    pl.col("year").cast(pl.Int32),
    pl.col("decade").cast(pl.String)
])

# Process with pybiber
pipeline = pb.PybiberPipeline()
features = pipeline.run(corpus)

# Add temporal metadata
features = features.join(
    corpus.select(["doc_id", "year", "decade"]),
    on="doc_id"
)

# Analyze by decade
decade_analysis = pb.BiberAnalyzer(
    features.drop("year"), 
    id_column=True
)

# Examine temporal dimensions
decade_analysis.mda(n_factors=3)
decade_analysis.mdaviz_groupmeans(factor=1)

Trend Analysis

# Calculate feature trends over time
trends = (
    features
    .group_by("decade")
    .agg([
        pl.selectors.numeric().mean().suffix("_mean"),
        pl.selectors.numeric().std().suffix("_std")
    ])
    .sort("decade")
)

# Visualize specific feature trends
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.plot(trends["decade"], trends["f_01_past_tense_mean"], 
         marker='o', label='Past Tense')
plt.plot(trends["decade"], trends["f_03_present_tense_mean"], 
         marker='s', label='Present Tense')
plt.xlabel("Decade")
plt.ylabel("Normalized Frequency")
plt.legend()
plt.title("Tense Usage Over Time")
plt.xticks(rotation=45)
plt.show()

Tutorial 5: Cross-Linguistic and Multilingual Analysis

Comparing Languages

# Process different language corpora
english_pipeline = pb.PybiberPipeline(model="en_core_web_sm")
spanish_pipeline = pb.PybiberPipeline(model="es_core_news_sm")

english_features = english_pipeline.run_from_folder("english_texts/")
spanish_features = spanish_pipeline.run_from_folder("spanish_texts/")

# Add language labels
english_features = english_features.with_columns(
    pl.lit("English").alias("language")
)
spanish_features = spanish_features.with_columns(
    pl.lit("Spanish").alias("language")
)

# Combine for comparative analysis
multilingual_features = pl.concat([english_features, spanish_features])

Language-Specific Adaptations

# Customize feature extraction for specific languages
def extract_language_specific_features(tokens, language="en"):
    base_features = pb.biber(tokens, normalize=True)
    
    if language == "es":
        # Add Spanish-specific features
        spanish_features = extract_spanish_subjunctive(tokens)
        base_features = base_features.join(spanish_features, on="doc_id")
    
    return base_features

Tutorial 6: Statistical Validation and Robustness

Cross-Validation of Factor Solutions

from sklearn.model_selection import KFold
import numpy as np

def validate_factor_solution(features, n_factors=3, n_splits=5):
    """Cross-validate factor stability."""
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    loadings_stability = []
    
    for train_idx, _ in kf.split(features):
        # Sample training data
        train_features = features[train_idx]
        
        # Fit MDA
        analyzer = pb.BiberAnalyzer(train_features)
        analyzer.mda(n_factors=n_factors)
        
        # Store loadings
        loadings_stability.append(analyzer.mda_loadings)
    
    # Calculate loading stability metrics
    return analyze_loading_stability(loadings_stability)

Bootstrap Confidence Intervals

def bootstrap_mda_confidence(features, n_bootstrap=1000):
    """Calculate bootstrap confidence intervals for factor loadings."""
    bootstrap_loadings = []
    n_docs = features.shape[0]
    
    for i in range(n_bootstrap):
        # Resample with replacement
        sample_idx = np.random.choice(n_docs, n_docs, replace=True)
        boot_features = features[sample_idx]
        
        # Fit MDA
        analyzer = pb.BiberAnalyzer(boot_features)
        analyzer.mda(n_factors=3)
        bootstrap_loadings.append(analyzer.mda_loadings)
    
    # Calculate confidence intervals
    return calculate_loading_confidence(bootstrap_loadings)

Tutorial 7: Performance Optimization

Memory-Efficient Processing

# Process large corpora in chunks
def process_large_corpus(corpus_path, chunk_size=1000):
    """Process large corpus in memory-efficient chunks."""
    
    # Get all text files
    text_files = list(Path(corpus_path).rglob("*.txt"))
    all_features = []
    
    # Process in chunks
    for i in range(0, len(text_files), chunk_size):
        chunk_files = text_files[i:i + chunk_size]
        
        # Create temporary corpus
        chunk_corpus = pb.readtext(chunk_files)
        
        # Process chunk
        pipeline = pb.PybiberPipeline()
        chunk_features = pipeline.run(chunk_corpus)
        
        all_features.append(chunk_features)
        
        # Clear memory
        del chunk_corpus, chunk_features
    
    # Combine all features
    return pl.concat(all_features)

Parallel Processing Optimization

# Optimize parallel processing parameters
def find_optimal_batch_size(corpus, model="en_core_web_sm"):
    """Find optimal batch size for your system."""
    import time
    
    batch_sizes = [10, 50, 100, 200, 500]
    processing_times = []
    
    for batch_size in batch_sizes:
        pipeline = pb.PybiberPipeline(
            model=model,
            batch_size=batch_size,
            n_process=4
        )
        
        start_time = time.time()
        _ = pipeline.run(corpus.head(1000))  # Test subset
        end_time = time.time()
        
        processing_times.append(end_time - start_time)
    
    # Find optimal batch size
    optimal_idx = np.argmin(processing_times)
    return batch_sizes[optimal_idx]

Best Practices Summary

Data Preparation

Organize texts in systematic directory structures
Encode metadata in filenames or separate files
Clean text appropriately for your spaCy model
Validate corpus structure before processing

Feature Extraction

Choose appropriate spaCy models for your texts
Consider disabling unnecessary components (like NER) for speed
Use appropriate normalization (per 1000 tokens vs. absolute counts)
Monitor memory usage with large corpora

Statistical Analysis

Examine scree plots before selecting number of factors
Validate factor solutions with multiple approaches
Consider cross-validation for robust results
Document all analytical decisions and parameters

Performance Optimization

Experiment with batch sizes and parallel processing
Process large corpora in chunks if memory is limited
Monitor system resources during processing
Cache intermediate results when possible

These tutorials provide a foundation for advanced pybiber usage. Adapt these patterns to your specific research questions and computational constraints.