import spacy
import pybiber as pb
from pybiber.data import micusp_mini
pybiber
The pybiber package aggregates the lexicogrammatical and functional features described by Biber (1988) and widely used for text-type, register, and genre classification tasks.
The package uses spaCy part-of-speech tagging and dependency parsing to summarize and aggregate patterns.
Because feature extraction builds from the outputs of probabilistic taggers, the accuracy of the resulting counts are reliant on the accuracy of those models. Thus, texts with irregular spellings, non-normative punctuation, etc. will likely produce unreliable outputs, unless taggers are tuned specifically for those purposes.
All DataFrames are rendered using polars. If you prefer to conduct any post-processing using pandas, please refer to the documentation for converting polars to pandas. Note that conversion requires both pandas and pyarrow to be installed into your working environment.
See pseudobibeR for the R implementation.
Installation
You can install the released version of pybiber from PyPI:
pip install pybiber
Install a spaCy model:
python -m spacy download en_core_web_sm
# models can also be installed using pip
# pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl
Usage
To use the pybiber package, you must first import spaCy and initiate an instance. You will also need to create a corpus. The biber
function expects a polars DataFrame with a doc_id
column and a text
column. This follows the convention for readtext
and corpus processing using quanteda in R.
You can see the simple data structure of a corpus:
micusp_mini.head()
doc_id | text |
---|---|
str | str |
"BIO_G0_02_1" | "Ernst Mayr once wrote, "sympat… |
"BIO_G0_03_1" | "The ability of a species to co… |
"BIO_G0_06_1" | "Generally, females make a larg… |
"BIO_G0_12_1" | "In the field of plant biology,… |
"BIO_G0_21_1" | "Parasites in nonhuman animals … |
To build your own corpus, a good place to start is corpus_from_folder
, which reads in all of the text files from a directory.
Initiate an instance
The pybiber package requires a model that will carry out part-of-speech tagging and dependency parsing, like one of spaCy’s en_core
models.
= spacy.load("en_core_web_sm", disable=["ner"]) nlp
Here we are disabling ner
or “Named Entity Recognition” from the pipeline to increase processing speed, but doing so is not necessary.
Process the corpus
To process the corpus, use spacy_parse
. Processing the micusp_mini
corpus should take between 20-30 seconds.
= pb.spacy_parse(micusp_mini, nlp) df_spacy
The function returns a DataFrame, which is structured like a spacyr output.
df_spacy.head()
doc_id | sentence_id | token_id | token | lemma | pos | tag | head_token_id | dep_rel |
---|---|---|---|---|---|---|---|---|
str | u32 | i64 | str | str | str | str | i64 | str |
"BIO_G0_02_1" | 1 | 0 | "Ernst" | "Ernst" | "PROPN" | "NNP" | 1 | "compound" |
"BIO_G0_02_1" | 1 | 1 | "Mayr" | "Mayr" | "PROPN" | "NNP" | 3 | "nsubj" |
"BIO_G0_02_1" | 1 | 2 | "once" | "once" | "ADV" | "RB" | 3 | "advmod" |
"BIO_G0_02_1" | 1 | 3 | "wrote" | "write" | "VERB" | "VBD" | 3 | "ROOT" |
"BIO_G0_02_1" | 1 | 4 | "," | "," | "PUNCT" | "," | 8 | "punct" |
Aggregate features
After parsing the corpus, features can then be aggregated using biber
.
= pb.biber(df_spacy) df_biber
Using MATTR for f_43_type_token
All features normalized per 1000 tokens except:
f_43_type_token and f_44_mean_word_length
In the documentation, note the difference beween type-token ratio (TTR) and moving average type-token ration (MATTR). For most use-cases, forcing TTR is unnecessary, but when comparing multiple corpora that haven’t been processed together, it is important to make sure the same measure is being used.
Also, the default is to normalize frequencies per 1000 tokens. However, absolute frequencies can be returned by setting normalize=False
.
The resulting document-feature matrix has 67 variables and a column of document ids.
df_biber.shape
(170, 68)
Encoding metadata into your document id’s (i.e., file names) is key to further processing and analysis. In the micusp_mini
data, for example, the first three letters before the underscore represent an academic discipline (e.g., BIO for biology, ENG for English, etc.).
df_biber.head()
doc_id | f_01_past_tense | f_02_perfect_aspect | f_03_present_tense | f_04_place_adverbials | f_05_time_adverbials | f_06_first_person_pronouns | f_07_second_person_pronouns | f_08_third_person_pronouns | f_09_pronoun_it | f_10_demonstrative_pronoun | f_11_indefinite_pronouns | f_12_proverb_do | f_13_wh_question | f_14_nominalizations | f_15_gerunds | f_16_other_nouns | f_17_agentless_passives | f_18_by_passives | f_19_be_main_verb | f_20_existential_there | f_21_that_verb_comp | f_22_that_adj_comp | f_23_wh_clause | f_24_infinitives | f_25_present_participle | f_26_past_participle | f_27_past_participle_whiz | f_28_present_participle_whiz | f_29_that_subj | f_30_that_obj | f_31_wh_subj | f_32_wh_obj | f_33_pied_piping | f_34_sentence_relatives | f_35_because | f_36_though | f_37_if | f_38_other_adv_sub | f_39_prepositions | f_40_adj_attr | f_41_adj_pred | f_42_adverbs | f_43_type_token | f_44_mean_word_length | f_45_conjuncts | f_46_downtoners | f_47_hedges | f_48_amplifiers | f_49_emphatics | f_50_discourse_particles | f_51_demonstratives | f_52_modal_possibility | f_53_modal_necessity | f_54_modal_predictive | f_55_verb_public | f_56_verb_private | f_57_verb_suasive | f_58_verb_seem | f_59_contractions | f_60_that_deletion | f_61_stranded_preposition | f_62_split_infinitive | f_63_split_auxiliary | f_64_phrasal_coordination | f_65_clausal_coordination | f_66_neg_synthetic | f_67_neg_analytic |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
str | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
"BIO_G0_02_1" | 11.574886 | 9.821115 | 61.381971 | 2.104525 | 6.313574 | 2.104525 | 0.701508 | 17.537706 | 7.716591 | 2.806033 | 0.0 | 1.753771 | 0.350754 | 37.881445 | 6.313574 | 284.110838 | 9.821115 | 3.156787 | 16.485444 | 0.701508 | 10.17187 | 0.0 | 0.701508 | 13.328657 | 4.209049 | 0.0 | 5.96282 | 2.104525 | 1.753771 | 0.350754 | 1.052262 | 0.0 | 0.350754 | 0.701508 | 0.0 | 0.701508 | 0.350754 | 4.209049 | 99.263416 | 86.285514 | 6.664328 | 62.434234 | 0.742811 | 5.3012 | 9.821115 | 3.507541 | 0.0 | 2.455279 | 5.261312 | 0.0 | 19.291477 | 7.015082 | 0.350754 | 4.209049 | 6.664328 | 22.448264 | 4.559804 | 2.806033 | 0.350754 | 0.0 | 0.0 | 0.701508 | 4.910558 | 6.664328 | 4.209049 | 1.403016 | 2.806033 |
"BIO_G0_03_1" | 20.300088 | 3.53045 | 59.13504 | 1.765225 | 0.882613 | 18.534863 | 0.0 | 3.53045 | 16.769638 | 4.413063 | 0.0 | 1.765225 | 0.0 | 43.248014 | 0.882613 | 235.657546 | 2.647838 | 1.765225 | 44.130627 | 3.53045 | 7.0609 | 3.53045 | 1.765225 | 20.300088 | 0.882613 | 0.0 | 0.882613 | 0.882613 | 2.647838 | 0.0 | 5.295675 | 0.882613 | 2.647838 | 4.413063 | 1.765225 | 1.765225 | 3.53045 | 4.413063 | 120.917917 | 85.613416 | 13.239188 | 47.661077 | 0.700499 | 5.152406 | 3.53045 | 1.765225 | 0.0 | 0.882613 | 6.178288 | 0.0 | 7.0609 | 7.0609 | 0.0 | 8.826125 | 1.765225 | 8.826125 | 3.53045 | 1.765225 | 0.0 | 0.0 | 0.882613 | 0.0 | 0.882613 | 7.943513 | 2.647838 | 0.882613 | 7.0609 |
"BIO_G0_06_1" | 9.480034 | 2.585464 | 52.5711 | 0.861821 | 0.287274 | 0.0 | 0.0 | 14.076415 | 5.745475 | 0.287274 | 0.287274 | 0.287274 | 1.149095 | 20.970985 | 1.436369 | 288.997415 | 11.490951 | 2.872738 | 37.920138 | 2.010916 | 4.883654 | 0.574548 | 0.574548 | 16.374605 | 3.160011 | 0.287274 | 0.574548 | 2.010916 | 2.872738 | 0.0 | 1.149095 | 0.0 | 0.287274 | 3.734559 | 3.734559 | 1.149095 | 5.745475 | 2.585464 | 106.291296 | 81.873025 | 15.512784 | 67.79661 | 0.665338 | 5.156845 | 8.618213 | 2.29819 | 0.574548 | 2.29819 | 6.894571 | 0.0 | 4.883654 | 16.949153 | 1.723643 | 7.181844 | 4.021833 | 8.330939 | 4.309107 | 2.29819 | 0.0 | 0.0 | 0.287274 | 1.723643 | 6.320023 | 10.054582 | 5.458202 | 0.574548 | 8.905487 |
"BIO_G0_12_1" | 36.900369 | 2.767528 | 23.98524 | 1.845018 | 1.845018 | 0.0 | 0.0 | 3.690037 | 11.99262 | 0.922509 | 0.0 | 0.922509 | 0.0 | 21.217712 | 0.0 | 298.892989 | 31.365314 | 5.535055 | 20.295203 | 2.767528 | 4.612546 | 3.690037 | 0.922509 | 11.99262 | 1.845018 | 0.0 | 12.915129 | 2.767528 | 4.612546 | 0.0 | 0.0 | 0.0 | 0.0 | 0.922509 | 0.0 | 0.922509 | 0.0 | 10.147601 | 154.059041 | 51.660517 | 9.225092 | 33.210332 | 0.625839 | 5.160681 | 6.457565 | 0.0 | 0.0 | 0.0 | 5.535055 | 0.0 | 14.760148 | 10.147601 | 0.0 | 8.302583 | 1.845018 | 15.682657 | 4.612546 | 0.0 | 0.0 | 0.0 | 0.0 | 0.922509 | 2.767528 | 0.922509 | 1.845018 | 1.845018 | 5.535055 |
"BIO_G0_21_1" | 40.050858 | 2.542912 | 26.700572 | 2.542912 | 0.635728 | 0.635728 | 0.0 | 7.628735 | 2.542912 | 4.450095 | 0.0 | 1.271456 | 0.0 | 26.064844 | 3.17864 | 336.300064 | 21.614749 | 2.542912 | 27.3363 | 3.17864 | 2.542912 | 0.635728 | 0.0 | 9.535919 | 0.635728 | 0.0 | 5.085823 | 1.907184 | 3.814367 | 0.0 | 1.271456 | 0.0 | 0.0 | 3.17864 | 0.635728 | 0.0 | 0.635728 | 3.17864 | 147.488875 | 52.765416 | 8.900191 | 40.050858 | 0.665966 | 5.129435 | 5.085823 | 1.907184 | 0.0 | 0.635728 | 8.264463 | 0.0 | 5.721551 | 3.814367 | 2.542912 | 1.907184 | 2.542912 | 10.807374 | 5.721551 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.17864 | 7.628735 | 6.993007 | 2.542912 | 2.542912 |