import spacy
import pybiber as pb
import polars as pl
Get started
Processing a corpus using the pybiber package involves the following steps:
- Preparing a corpus of text data.
- Initiating a model instance.
- Parsing the corpus data.
- Extracting the features into DataFrame.
After the document-feature matrix has been produced, it can be used for further analysis like classification tasks (e.g., Reinhart et al. 2024). Additionally, the pybiber package contains functions for carrying Biber’s Mulit-Dimensional Analysis (Biber 1985), which is a specific implementation of exploratory factor analysis. Refer to the Biber analyzer documentation.
Preparing a corpus
First we will import our libraries:
There are a variety of ways of preparing and reading in a corpus for processing. The spacy_parse
function requires polars DataFrame with a doc_id
and a text
column. Such a DataFrame can be prepared ahead of time and read in using one of polars input functions. For example, the Human-AI Parallel corpus mini can be read directly from huggingface:
= pl.read_parquet('hf://datasets/browndw/human-ai-parallel-corpus-mini/hape_mini-text.parquet')
df df.head()
doc_id | text |
---|---|
str | str |
"gpt4_acad_0005" | "This inherent fascination with… |
"gpt4_acad_0031" | "This approach allows us to not… |
"gpt4_acad_0036" | "Moreover, while turn-final cue… |
"gpt4_acad_0038" | "In doing so, we aimed to unrav… |
"gpt4_acad_0045" | "In exploring the domain of com… |
Alternatively, a corpus of plain text files can be stored in a directory. All of the files can, then, be read into a DataFrame using corpus_from_folder
Here, we will use the MICUSP mini data:
from pybiber.data import micusp_mini
Initiate a spaCy instance
Initiate a model instance:
= spacy.load("en_core_web_sm", disable=["ner"]) nlp
A spaCy model must be installed in your working enviroment using python -m spacy download en_core_web_sm
or an alternative. See information about spaCy models. Also, the pybiber package requires a model that will execute both part-of-speech tagging and dependency parsing.
Parse the text data
To process the corpus, use spacy_parse
. Processing the micusp_mini
corpus should take between 20-30 seconds.
= pb.spacy_parse(corp=micusp_mini, nlp_model=nlp) df_spacy
The number of cores assigned can be specified using n_process
, which can increase processing speed. The batch size can also be adjusted with batch_size
. However, larger batch sizes may actually slow down processessing.
Extract the features
After parsing the corpus, features can then be aggregated using biber
.
= pb.biber(df_spacy) df_biber
Using MATTR for f_43_type_token
All features normalized per 1000 tokens except:
f_43_type_token and f_44_mean_word_length
To return absolute frequencies set normalize=False
.
df_biber.head()
doc_id | f_01_past_tense | f_02_perfect_aspect | f_03_present_tense | f_04_place_adverbials | f_05_time_adverbials | f_06_first_person_pronouns | f_07_second_person_pronouns | f_08_third_person_pronouns | f_09_pronoun_it | f_10_demonstrative_pronoun | f_11_indefinite_pronouns | f_12_proverb_do | f_13_wh_question | f_14_nominalizations | f_15_gerunds | f_16_other_nouns | f_17_agentless_passives | f_18_by_passives | f_19_be_main_verb | f_20_existential_there | f_21_that_verb_comp | f_22_that_adj_comp | f_23_wh_clause | f_24_infinitives | f_25_present_participle | f_26_past_participle | f_27_past_participle_whiz | f_28_present_participle_whiz | f_29_that_subj | f_30_that_obj | f_31_wh_subj | f_32_wh_obj | f_33_pied_piping | f_34_sentence_relatives | f_35_because | f_36_though | f_37_if | f_38_other_adv_sub | f_39_prepositions | f_40_adj_attr | f_41_adj_pred | f_42_adverbs | f_43_type_token | f_44_mean_word_length | f_45_conjuncts | f_46_downtoners | f_47_hedges | f_48_amplifiers | f_49_emphatics | f_50_discourse_particles | f_51_demonstratives | f_52_modal_possibility | f_53_modal_necessity | f_54_modal_predictive | f_55_verb_public | f_56_verb_private | f_57_verb_suasive | f_58_verb_seem | f_59_contractions | f_60_that_deletion | f_61_stranded_preposition | f_62_split_infinitive | f_63_split_auxiliary | f_64_phrasal_coordination | f_65_clausal_coordination | f_66_neg_synthetic | f_67_neg_analytic |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
str | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
"BIO_G0_02_1" | 11.574886 | 9.821115 | 61.381971 | 2.104525 | 6.313574 | 2.104525 | 0.701508 | 17.537706 | 7.716591 | 2.806033 | 0.0 | 1.753771 | 0.350754 | 37.881445 | 6.313574 | 284.110838 | 9.821115 | 3.156787 | 16.485444 | 0.701508 | 10.17187 | 0.0 | 0.701508 | 13.328657 | 4.209049 | 0.0 | 5.96282 | 2.104525 | 1.753771 | 0.350754 | 1.052262 | 0.0 | 0.350754 | 0.701508 | 0.0 | 0.701508 | 0.350754 | 4.209049 | 99.263416 | 86.285514 | 6.664328 | 62.434234 | 0.742811 | 5.3012 | 9.821115 | 3.507541 | 0.0 | 2.455279 | 5.261312 | 0.0 | 19.291477 | 7.015082 | 0.350754 | 4.209049 | 6.664328 | 22.448264 | 4.559804 | 2.806033 | 0.350754 | 0.0 | 0.0 | 0.701508 | 4.910558 | 6.664328 | 4.209049 | 1.403016 | 2.806033 |
"BIO_G0_03_1" | 20.300088 | 3.53045 | 59.13504 | 1.765225 | 0.882613 | 18.534863 | 0.0 | 3.53045 | 16.769638 | 4.413063 | 0.0 | 1.765225 | 0.0 | 43.248014 | 0.882613 | 235.657546 | 2.647838 | 1.765225 | 44.130627 | 3.53045 | 7.0609 | 3.53045 | 1.765225 | 20.300088 | 0.882613 | 0.0 | 0.882613 | 0.882613 | 2.647838 | 0.0 | 5.295675 | 0.882613 | 2.647838 | 4.413063 | 1.765225 | 1.765225 | 3.53045 | 4.413063 | 120.917917 | 85.613416 | 13.239188 | 47.661077 | 0.700499 | 5.152406 | 3.53045 | 1.765225 | 0.0 | 0.882613 | 6.178288 | 0.0 | 7.0609 | 7.0609 | 0.0 | 8.826125 | 1.765225 | 8.826125 | 3.53045 | 1.765225 | 0.0 | 0.0 | 0.882613 | 0.0 | 0.882613 | 7.943513 | 2.647838 | 0.882613 | 7.0609 |
"BIO_G0_06_1" | 9.480034 | 2.585464 | 52.5711 | 0.861821 | 0.287274 | 0.0 | 0.0 | 14.076415 | 5.745475 | 0.287274 | 0.287274 | 0.287274 | 1.149095 | 20.970985 | 1.436369 | 288.997415 | 11.490951 | 2.872738 | 37.920138 | 2.010916 | 4.883654 | 0.574548 | 0.574548 | 16.374605 | 3.160011 | 0.287274 | 0.574548 | 2.010916 | 2.872738 | 0.0 | 1.149095 | 0.0 | 0.287274 | 3.734559 | 3.734559 | 1.149095 | 5.745475 | 2.585464 | 106.291296 | 81.873025 | 15.512784 | 67.79661 | 0.665338 | 5.156845 | 8.618213 | 2.29819 | 0.574548 | 2.29819 | 6.894571 | 0.0 | 4.883654 | 16.949153 | 1.723643 | 7.181844 | 4.021833 | 8.330939 | 4.309107 | 2.29819 | 0.0 | 0.0 | 0.287274 | 1.723643 | 6.320023 | 10.054582 | 5.458202 | 0.574548 | 8.905487 |
"BIO_G0_12_1" | 36.900369 | 2.767528 | 23.98524 | 1.845018 | 1.845018 | 0.0 | 0.0 | 3.690037 | 11.99262 | 0.922509 | 0.0 | 0.922509 | 0.0 | 21.217712 | 0.0 | 298.892989 | 31.365314 | 5.535055 | 20.295203 | 2.767528 | 4.612546 | 3.690037 | 0.922509 | 11.99262 | 1.845018 | 0.0 | 12.915129 | 2.767528 | 4.612546 | 0.0 | 0.0 | 0.0 | 0.0 | 0.922509 | 0.0 | 0.922509 | 0.0 | 10.147601 | 154.059041 | 51.660517 | 9.225092 | 33.210332 | 0.625839 | 5.160681 | 6.457565 | 0.0 | 0.0 | 0.0 | 5.535055 | 0.0 | 14.760148 | 10.147601 | 0.0 | 8.302583 | 1.845018 | 15.682657 | 4.612546 | 0.0 | 0.0 | 0.0 | 0.0 | 0.922509 | 2.767528 | 0.922509 | 1.845018 | 1.845018 | 5.535055 |
"BIO_G0_21_1" | 40.050858 | 2.542912 | 26.700572 | 2.542912 | 0.635728 | 0.635728 | 0.0 | 7.628735 | 2.542912 | 4.450095 | 0.0 | 1.271456 | 0.0 | 26.064844 | 3.17864 | 336.300064 | 21.614749 | 2.542912 | 27.3363 | 3.17864 | 2.542912 | 0.635728 | 0.0 | 9.535919 | 0.635728 | 0.0 | 5.085823 | 1.907184 | 3.814367 | 0.0 | 1.271456 | 0.0 | 0.0 | 3.17864 | 0.635728 | 0.0 | 0.635728 | 3.17864 | 147.488875 | 52.765416 | 8.900191 | 40.050858 | 0.665966 | 5.129435 | 5.085823 | 1.907184 | 0.0 | 0.635728 | 8.264463 | 0.0 | 5.721551 | 3.814367 | 2.542912 | 1.907184 | 2.542912 | 10.807374 | 5.721551 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.17864 | 7.628735 | 6.993007 | 2.542912 | 2.542912 |