import spacy
import pybiber as pb
import polars as pl
from pybiber.data import micusp_mini
Biber analyzer
Biber’s Mulit-Dimensional Analysis (Biber 1985) is a specific implementation of exploratory factor analysis and has been used in a wide variety of studies. A representative sample of such studies can be seen in the table of contents of a tribute volume.
Multi-Dimensional Analysis (MDA) is a process made up of 4 main steps:
- Identification of relevant variables
- Extraction of factors from variables
- Functional interpretation of factors as dimensions
- Placement of categories on the dimensions
A description of the procedure can be found here.
Create a biber document-feature matrix
First we will import our libraries and some data:
Then process that data:
= spacy.load("en_core_web_sm", disable=["ner"])
nlp = pb.spacy_parse(micusp_mini, nlp_model=nlp)
df_spacy = pb.biber(df_spacy) df_biber
Using MATTR for f_43_type_token
All features normalized per 1000 tokens except:
f_43_type_token and f_44_mean_word_length
Format categories
The MDA procedure requires a categorical variable.
The data that we’re using for this demonstration have the names of disciplines encoded into the doc_id
. The first three letters before the underscore represent an academic discipline (e.g., BIO for biology, ENG for English, etc.).
df_biber.head()
doc_id | f_01_past_tense | f_02_perfect_aspect | f_03_present_tense | f_04_place_adverbials | f_05_time_adverbials | f_06_first_person_pronouns | f_07_second_person_pronouns | f_08_third_person_pronouns | f_09_pronoun_it | f_10_demonstrative_pronoun | f_11_indefinite_pronouns | f_12_proverb_do | f_13_wh_question | f_14_nominalizations | f_15_gerunds | f_16_other_nouns | f_17_agentless_passives | f_18_by_passives | f_19_be_main_verb | f_20_existential_there | f_21_that_verb_comp | f_22_that_adj_comp | f_23_wh_clause | f_24_infinitives | f_25_present_participle | f_26_past_participle | f_27_past_participle_whiz | f_28_present_participle_whiz | f_29_that_subj | f_30_that_obj | f_31_wh_subj | f_32_wh_obj | f_33_pied_piping | f_34_sentence_relatives | f_35_because | f_36_though | f_37_if | f_38_other_adv_sub | f_39_prepositions | f_40_adj_attr | f_41_adj_pred | f_42_adverbs | f_43_type_token | f_44_mean_word_length | f_45_conjuncts | f_46_downtoners | f_47_hedges | f_48_amplifiers | f_49_emphatics | f_50_discourse_particles | f_51_demonstratives | f_52_modal_possibility | f_53_modal_necessity | f_54_modal_predictive | f_55_verb_public | f_56_verb_private | f_57_verb_suasive | f_58_verb_seem | f_59_contractions | f_60_that_deletion | f_61_stranded_preposition | f_62_split_infinitive | f_63_split_auxiliary | f_64_phrasal_coordination | f_65_clausal_coordination | f_66_neg_synthetic | f_67_neg_analytic |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
str | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
"BIO_G0_02_1" | 11.574886 | 9.821115 | 61.381971 | 2.104525 | 6.313574 | 2.104525 | 0.701508 | 17.537706 | 7.716591 | 2.806033 | 0.0 | 1.753771 | 0.350754 | 37.881445 | 6.313574 | 284.110838 | 9.821115 | 3.156787 | 16.485444 | 0.701508 | 10.17187 | 0.0 | 0.701508 | 13.328657 | 4.209049 | 0.0 | 5.96282 | 2.104525 | 1.753771 | 0.350754 | 1.052262 | 0.0 | 0.350754 | 0.701508 | 0.0 | 0.701508 | 0.350754 | 4.209049 | 99.263416 | 86.285514 | 6.664328 | 62.434234 | 0.742811 | 5.3012 | 9.821115 | 3.507541 | 0.0 | 2.455279 | 5.261312 | 0.0 | 19.291477 | 7.015082 | 0.350754 | 4.209049 | 6.664328 | 22.448264 | 4.559804 | 2.806033 | 0.350754 | 0.0 | 0.0 | 0.701508 | 4.910558 | 6.664328 | 4.209049 | 1.403016 | 2.806033 |
"BIO_G0_03_1" | 20.300088 | 3.53045 | 59.13504 | 1.765225 | 0.882613 | 18.534863 | 0.0 | 3.53045 | 16.769638 | 4.413063 | 0.0 | 1.765225 | 0.0 | 43.248014 | 0.882613 | 235.657546 | 2.647838 | 1.765225 | 44.130627 | 3.53045 | 7.0609 | 3.53045 | 1.765225 | 20.300088 | 0.882613 | 0.0 | 0.882613 | 0.882613 | 2.647838 | 0.0 | 5.295675 | 0.882613 | 2.647838 | 4.413063 | 1.765225 | 1.765225 | 3.53045 | 4.413063 | 120.917917 | 85.613416 | 13.239188 | 47.661077 | 0.700499 | 5.152406 | 3.53045 | 1.765225 | 0.0 | 0.882613 | 6.178288 | 0.0 | 7.0609 | 7.0609 | 0.0 | 8.826125 | 1.765225 | 8.826125 | 3.53045 | 1.765225 | 0.0 | 0.0 | 0.882613 | 0.0 | 0.882613 | 7.943513 | 2.647838 | 0.882613 | 7.0609 |
"BIO_G0_06_1" | 9.480034 | 2.585464 | 52.5711 | 0.861821 | 0.287274 | 0.0 | 0.0 | 14.076415 | 5.745475 | 0.287274 | 0.287274 | 0.287274 | 1.149095 | 20.970985 | 1.436369 | 288.997415 | 11.490951 | 2.872738 | 37.920138 | 2.010916 | 4.883654 | 0.574548 | 0.574548 | 16.374605 | 3.160011 | 0.287274 | 0.574548 | 2.010916 | 2.872738 | 0.0 | 1.149095 | 0.0 | 0.287274 | 3.734559 | 3.734559 | 1.149095 | 5.745475 | 2.585464 | 106.291296 | 81.873025 | 15.512784 | 67.79661 | 0.665338 | 5.156845 | 8.618213 | 2.29819 | 0.574548 | 2.29819 | 6.894571 | 0.0 | 4.883654 | 16.949153 | 1.723643 | 7.181844 | 4.021833 | 8.330939 | 4.309107 | 2.29819 | 0.0 | 0.0 | 0.287274 | 1.723643 | 6.320023 | 10.054582 | 5.458202 | 0.574548 | 8.905487 |
"BIO_G0_12_1" | 36.900369 | 2.767528 | 23.98524 | 1.845018 | 1.845018 | 0.0 | 0.0 | 3.690037 | 11.99262 | 0.922509 | 0.0 | 0.922509 | 0.0 | 21.217712 | 0.0 | 298.892989 | 31.365314 | 5.535055 | 20.295203 | 2.767528 | 4.612546 | 3.690037 | 0.922509 | 11.99262 | 1.845018 | 0.0 | 12.915129 | 2.767528 | 4.612546 | 0.0 | 0.0 | 0.0 | 0.0 | 0.922509 | 0.0 | 0.922509 | 0.0 | 10.147601 | 154.059041 | 51.660517 | 9.225092 | 33.210332 | 0.625839 | 5.160681 | 6.457565 | 0.0 | 0.0 | 0.0 | 5.535055 | 0.0 | 14.760148 | 10.147601 | 0.0 | 8.302583 | 1.845018 | 15.682657 | 4.612546 | 0.0 | 0.0 | 0.0 | 0.0 | 0.922509 | 2.767528 | 0.922509 | 1.845018 | 1.845018 | 5.535055 |
"BIO_G0_21_1" | 40.050858 | 2.542912 | 26.700572 | 2.542912 | 0.635728 | 0.635728 | 0.0 | 7.628735 | 2.542912 | 4.450095 | 0.0 | 1.271456 | 0.0 | 26.064844 | 3.17864 | 336.300064 | 21.614749 | 2.542912 | 27.3363 | 3.17864 | 2.542912 | 0.635728 | 0.0 | 9.535919 | 0.635728 | 0.0 | 5.085823 | 1.907184 | 3.814367 | 0.0 | 1.271456 | 0.0 | 0.0 | 3.17864 | 0.635728 | 0.0 | 0.635728 | 3.17864 | 147.488875 | 52.765416 | 8.900191 | 40.050858 | 0.665966 | 5.129435 | 5.085823 | 1.907184 | 0.0 | 0.635728 | 8.264463 | 0.0 | 5.721551 | 3.814367 | 2.542912 | 1.907184 | 2.542912 | 10.807374 | 5.721551 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.17864 | 7.628735 | 6.993007 | 2.542912 | 2.542912 |
The data are down-sampled from the Michigan Corpus of Upper-Level Student Papers.
We can extract that string and place it into a new column called discipline
= (
df_biber
df_biber
.with_columns("doc_id").str.extract(r"^([A-Z])+", 0)
pl.col("discipline")
.alias(
)
)
df_biber.head()
doc_id | f_01_past_tense | f_02_perfect_aspect | f_03_present_tense | f_04_place_adverbials | f_05_time_adverbials | f_06_first_person_pronouns | f_07_second_person_pronouns | f_08_third_person_pronouns | f_09_pronoun_it | f_10_demonstrative_pronoun | f_11_indefinite_pronouns | f_12_proverb_do | f_13_wh_question | f_14_nominalizations | f_15_gerunds | f_16_other_nouns | f_17_agentless_passives | f_18_by_passives | f_19_be_main_verb | f_20_existential_there | f_21_that_verb_comp | f_22_that_adj_comp | f_23_wh_clause | f_24_infinitives | f_25_present_participle | f_26_past_participle | f_27_past_participle_whiz | f_28_present_participle_whiz | f_29_that_subj | f_30_that_obj | f_31_wh_subj | f_32_wh_obj | f_33_pied_piping | f_34_sentence_relatives | f_35_because | f_36_though | f_37_if | f_38_other_adv_sub | f_39_prepositions | f_40_adj_attr | f_41_adj_pred | f_42_adverbs | f_43_type_token | f_44_mean_word_length | f_45_conjuncts | f_46_downtoners | f_47_hedges | f_48_amplifiers | f_49_emphatics | f_50_discourse_particles | f_51_demonstratives | f_52_modal_possibility | f_53_modal_necessity | f_54_modal_predictive | f_55_verb_public | f_56_verb_private | f_57_verb_suasive | f_58_verb_seem | f_59_contractions | f_60_that_deletion | f_61_stranded_preposition | f_62_split_infinitive | f_63_split_auxiliary | f_64_phrasal_coordination | f_65_clausal_coordination | f_66_neg_synthetic | f_67_neg_analytic | discipline |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
str | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | str |
"BIO_G0_02_1" | 11.574886 | 9.821115 | 61.381971 | 2.104525 | 6.313574 | 2.104525 | 0.701508 | 17.537706 | 7.716591 | 2.806033 | 0.0 | 1.753771 | 0.350754 | 37.881445 | 6.313574 | 284.110838 | 9.821115 | 3.156787 | 16.485444 | 0.701508 | 10.17187 | 0.0 | 0.701508 | 13.328657 | 4.209049 | 0.0 | 5.96282 | 2.104525 | 1.753771 | 0.350754 | 1.052262 | 0.0 | 0.350754 | 0.701508 | 0.0 | 0.701508 | 0.350754 | 4.209049 | 99.263416 | 86.285514 | 6.664328 | 62.434234 | 0.742811 | 5.3012 | 9.821115 | 3.507541 | 0.0 | 2.455279 | 5.261312 | 0.0 | 19.291477 | 7.015082 | 0.350754 | 4.209049 | 6.664328 | 22.448264 | 4.559804 | 2.806033 | 0.350754 | 0.0 | 0.0 | 0.701508 | 4.910558 | 6.664328 | 4.209049 | 1.403016 | 2.806033 | "BIO" |
"BIO_G0_03_1" | 20.300088 | 3.53045 | 59.13504 | 1.765225 | 0.882613 | 18.534863 | 0.0 | 3.53045 | 16.769638 | 4.413063 | 0.0 | 1.765225 | 0.0 | 43.248014 | 0.882613 | 235.657546 | 2.647838 | 1.765225 | 44.130627 | 3.53045 | 7.0609 | 3.53045 | 1.765225 | 20.300088 | 0.882613 | 0.0 | 0.882613 | 0.882613 | 2.647838 | 0.0 | 5.295675 | 0.882613 | 2.647838 | 4.413063 | 1.765225 | 1.765225 | 3.53045 | 4.413063 | 120.917917 | 85.613416 | 13.239188 | 47.661077 | 0.700499 | 5.152406 | 3.53045 | 1.765225 | 0.0 | 0.882613 | 6.178288 | 0.0 | 7.0609 | 7.0609 | 0.0 | 8.826125 | 1.765225 | 8.826125 | 3.53045 | 1.765225 | 0.0 | 0.0 | 0.882613 | 0.0 | 0.882613 | 7.943513 | 2.647838 | 0.882613 | 7.0609 | "BIO" |
"BIO_G0_06_1" | 9.480034 | 2.585464 | 52.5711 | 0.861821 | 0.287274 | 0.0 | 0.0 | 14.076415 | 5.745475 | 0.287274 | 0.287274 | 0.287274 | 1.149095 | 20.970985 | 1.436369 | 288.997415 | 11.490951 | 2.872738 | 37.920138 | 2.010916 | 4.883654 | 0.574548 | 0.574548 | 16.374605 | 3.160011 | 0.287274 | 0.574548 | 2.010916 | 2.872738 | 0.0 | 1.149095 | 0.0 | 0.287274 | 3.734559 | 3.734559 | 1.149095 | 5.745475 | 2.585464 | 106.291296 | 81.873025 | 15.512784 | 67.79661 | 0.665338 | 5.156845 | 8.618213 | 2.29819 | 0.574548 | 2.29819 | 6.894571 | 0.0 | 4.883654 | 16.949153 | 1.723643 | 7.181844 | 4.021833 | 8.330939 | 4.309107 | 2.29819 | 0.0 | 0.0 | 0.287274 | 1.723643 | 6.320023 | 10.054582 | 5.458202 | 0.574548 | 8.905487 | "BIO" |
"BIO_G0_12_1" | 36.900369 | 2.767528 | 23.98524 | 1.845018 | 1.845018 | 0.0 | 0.0 | 3.690037 | 11.99262 | 0.922509 | 0.0 | 0.922509 | 0.0 | 21.217712 | 0.0 | 298.892989 | 31.365314 | 5.535055 | 20.295203 | 2.767528 | 4.612546 | 3.690037 | 0.922509 | 11.99262 | 1.845018 | 0.0 | 12.915129 | 2.767528 | 4.612546 | 0.0 | 0.0 | 0.0 | 0.0 | 0.922509 | 0.0 | 0.922509 | 0.0 | 10.147601 | 154.059041 | 51.660517 | 9.225092 | 33.210332 | 0.625839 | 5.160681 | 6.457565 | 0.0 | 0.0 | 0.0 | 5.535055 | 0.0 | 14.760148 | 10.147601 | 0.0 | 8.302583 | 1.845018 | 15.682657 | 4.612546 | 0.0 | 0.0 | 0.0 | 0.0 | 0.922509 | 2.767528 | 0.922509 | 1.845018 | 1.845018 | 5.535055 | "BIO" |
"BIO_G0_21_1" | 40.050858 | 2.542912 | 26.700572 | 2.542912 | 0.635728 | 0.635728 | 0.0 | 7.628735 | 2.542912 | 4.450095 | 0.0 | 1.271456 | 0.0 | 26.064844 | 3.17864 | 336.300064 | 21.614749 | 2.542912 | 27.3363 | 3.17864 | 2.542912 | 0.635728 | 0.0 | 9.535919 | 0.635728 | 0.0 | 5.085823 | 1.907184 | 3.814367 | 0.0 | 1.271456 | 0.0 | 0.0 | 3.17864 | 0.635728 | 0.0 | 0.635728 | 3.17864 | 147.488875 | 52.765416 | 8.900191 | 40.050858 | 0.665966 | 5.129435 | 5.085823 | 1.907184 | 0.0 | 0.635728 | 8.264463 | 0.0 | 5.721551 | 3.814367 | 2.542912 | 1.907184 | 2.542912 | 10.807374 | 5.721551 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.17864 | 7.628735 | 6.993007 | 2.542912 | 2.542912 | "BIO" |
Process the data with BiberAnalyzer
Now the data can be processed using BiberAnalyzer
= pb.BiberAnalyzer(df_biber, id_column=True) df
Determine the number of factors to extract
; df.mdaviz_screeplot()
Extract factors
=3) df.mda(n_factors
Check the summary
df.mda_summary
Factor | F | df | PR(>F) | Signif | R2 |
---|---|---|---|---|---|
str | f64 | list[u32] | f64 | str | f64 |
"factor_1" | 4.391505 | [16, 153] | 4.5532e-7 | "*** p < 0.001" | 0.314713 |
"factor_2" | 10.722254 | [16, 153] | 5.6263e-18 | "*** p < 0.001" | 0.528587 |
"factor_3" | 2.900087 | [16, 153] | 0.000353 | "*** p < 0.001" | 0.232703 |
Plot factors
=2); df.mdaviz_groupmeans(factor
Check the dimension scores
df.mda_dim_scores
doc_id | doc_cat | factor_1 | factor_2 | factor_3 |
---|---|---|---|---|
str | str | f64 | f64 | f64 |
"BIO_G0_02_1" | "BIO" | -3.548914 | 3.836948 | 1.190002 |
"BIO_G0_03_1" | "BIO" | 8.772727 | 5.071433 | 0.551095 |
"BIO_G0_06_1" | "BIO" | 3.248852 | 1.24173 | 0.843617 |
"BIO_G0_12_1" | "BIO" | -4.632679 | -18.348749 | -2.314747 |
"BIO_G0_21_1" | "BIO" | -6.692054 | -11.105658 | -2.356941 |
… | … | … | … | … |
"SOC_G1_08_1" | "SOC" | -8.789114 | -6.140515 | -1.741352 |
"SOC_G1_09_1" | "SOC" | 4.709334 | -3.577475 | 1.785393 |
"SOC_G2_03_1" | "SOC" | 1.381486 | 2.315045 | 0.945825 |
"SOC_G3_07_1" | "SOC" | -7.862552 | -1.629578 | -0.290192 |
"SOC_G3_08_1" | "SOC" | -1.221231 | 5.593183 | -0.677241 |