Biber analyzer

Biber’s Mulit-Dimensional Analysis (Biber 1985) is a specific implementation of exploratory factor analysis and has been used in a wide variety of studies. A representative sample of such studies can be seen in the table of contents of a tribute volume.

Multi-Dimensional Analysis (MDA) is a process made up of 4 main steps:

  1. Identification of relevant variables
  2. Extraction of factors from variables
  3. Functional interpretation of factors as dimensions
  4. Placement of categories on the dimensions

A description of the procedure can be found here.

Create a biber document-feature matrix

First we will import our libraries and some data:

import spacy
import pybiber as pb
import polars as pl
from pybiber.data import micusp_mini

Then process that data:

nlp = spacy.load("en_core_web_sm", disable=["ner"])
df_spacy = pb.spacy_parse(micusp_mini, nlp_model=nlp)
df_biber = pb.biber(df_spacy)
Using MATTR for f_43_type_token

All features normalized per 1000 tokens except:
f_43_type_token and f_44_mean_word_length

Format categories

The MDA procedure requires a categorical variable.

The data that we’re using for this demonstration have the names of disciplines encoded into the doc_id. The first three letters before the underscore represent an academic discipline (e.g., BIO for biology, ENG for English, etc.).

df_biber.head()
shape: (5, 68)
doc_id f_01_past_tense f_02_perfect_aspect f_03_present_tense f_04_place_adverbials f_05_time_adverbials f_06_first_person_pronouns f_07_second_person_pronouns f_08_third_person_pronouns f_09_pronoun_it f_10_demonstrative_pronoun f_11_indefinite_pronouns f_12_proverb_do f_13_wh_question f_14_nominalizations f_15_gerunds f_16_other_nouns f_17_agentless_passives f_18_by_passives f_19_be_main_verb f_20_existential_there f_21_that_verb_comp f_22_that_adj_comp f_23_wh_clause f_24_infinitives f_25_present_participle f_26_past_participle f_27_past_participle_whiz f_28_present_participle_whiz f_29_that_subj f_30_that_obj f_31_wh_subj f_32_wh_obj f_33_pied_piping f_34_sentence_relatives f_35_because f_36_though f_37_if f_38_other_adv_sub f_39_prepositions f_40_adj_attr f_41_adj_pred f_42_adverbs f_43_type_token f_44_mean_word_length f_45_conjuncts f_46_downtoners f_47_hedges f_48_amplifiers f_49_emphatics f_50_discourse_particles f_51_demonstratives f_52_modal_possibility f_53_modal_necessity f_54_modal_predictive f_55_verb_public f_56_verb_private f_57_verb_suasive f_58_verb_seem f_59_contractions f_60_that_deletion f_61_stranded_preposition f_62_split_infinitive f_63_split_auxiliary f_64_phrasal_coordination f_65_clausal_coordination f_66_neg_synthetic f_67_neg_analytic
str f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64
"BIO_G0_02_1" 11.574886 9.821115 61.381971 2.104525 6.313574 2.104525 0.701508 17.537706 7.716591 2.806033 0.0 1.753771 0.350754 37.881445 6.313574 284.110838 9.821115 3.156787 16.485444 0.701508 10.17187 0.0 0.701508 13.328657 4.209049 0.0 5.96282 2.104525 1.753771 0.350754 1.052262 0.0 0.350754 0.701508 0.0 0.701508 0.350754 4.209049 99.263416 86.285514 6.664328 62.434234 0.742811 5.3012 9.821115 3.507541 0.0 2.455279 5.261312 0.0 19.291477 7.015082 0.350754 4.209049 6.664328 22.448264 4.559804 2.806033 0.350754 0.0 0.0 0.701508 4.910558 6.664328 4.209049 1.403016 2.806033
"BIO_G0_03_1" 20.300088 3.53045 59.13504 1.765225 0.882613 18.534863 0.0 3.53045 16.769638 4.413063 0.0 1.765225 0.0 43.248014 0.882613 235.657546 2.647838 1.765225 44.130627 3.53045 7.0609 3.53045 1.765225 20.300088 0.882613 0.0 0.882613 0.882613 2.647838 0.0 5.295675 0.882613 2.647838 4.413063 1.765225 1.765225 3.53045 4.413063 120.917917 85.613416 13.239188 47.661077 0.700499 5.152406 3.53045 1.765225 0.0 0.882613 6.178288 0.0 7.0609 7.0609 0.0 8.826125 1.765225 8.826125 3.53045 1.765225 0.0 0.0 0.882613 0.0 0.882613 7.943513 2.647838 0.882613 7.0609
"BIO_G0_06_1" 9.480034 2.585464 52.5711 0.861821 0.287274 0.0 0.0 14.076415 5.745475 0.287274 0.287274 0.287274 1.149095 20.970985 1.436369 288.997415 11.490951 2.872738 37.920138 2.010916 4.883654 0.574548 0.574548 16.374605 3.160011 0.287274 0.574548 2.010916 2.872738 0.0 1.149095 0.0 0.287274 3.734559 3.734559 1.149095 5.745475 2.585464 106.291296 81.873025 15.512784 67.79661 0.665338 5.156845 8.618213 2.29819 0.574548 2.29819 6.894571 0.0 4.883654 16.949153 1.723643 7.181844 4.021833 8.330939 4.309107 2.29819 0.0 0.0 0.287274 1.723643 6.320023 10.054582 5.458202 0.574548 8.905487
"BIO_G0_12_1" 36.900369 2.767528 23.98524 1.845018 1.845018 0.0 0.0 3.690037 11.99262 0.922509 0.0 0.922509 0.0 21.217712 0.0 298.892989 31.365314 5.535055 20.295203 2.767528 4.612546 3.690037 0.922509 11.99262 1.845018 0.0 12.915129 2.767528 4.612546 0.0 0.0 0.0 0.0 0.922509 0.0 0.922509 0.0 10.147601 154.059041 51.660517 9.225092 33.210332 0.625839 5.160681 6.457565 0.0 0.0 0.0 5.535055 0.0 14.760148 10.147601 0.0 8.302583 1.845018 15.682657 4.612546 0.0 0.0 0.0 0.0 0.922509 2.767528 0.922509 1.845018 1.845018 5.535055
"BIO_G0_21_1" 40.050858 2.542912 26.700572 2.542912 0.635728 0.635728 0.0 7.628735 2.542912 4.450095 0.0 1.271456 0.0 26.064844 3.17864 336.300064 21.614749 2.542912 27.3363 3.17864 2.542912 0.635728 0.0 9.535919 0.635728 0.0 5.085823 1.907184 3.814367 0.0 1.271456 0.0 0.0 3.17864 0.635728 0.0 0.635728 3.17864 147.488875 52.765416 8.900191 40.050858 0.665966 5.129435 5.085823 1.907184 0.0 0.635728 8.264463 0.0 5.721551 3.814367 2.542912 1.907184 2.542912 10.807374 5.721551 0.0 0.0 0.0 0.0 0.0 3.17864 7.628735 6.993007 2.542912 2.542912
Note

The data are down-sampled from the Michigan Corpus of Upper-Level Student Papers.

We can extract that string and place it into a new column called discipline

df_biber = (
    df_biber
    .with_columns(
        pl.col("doc_id").str.extract(r"^([A-Z])+", 0)
        .alias("discipline")
      )
      )

df_biber.head()
shape: (5, 69)
doc_id f_01_past_tense f_02_perfect_aspect f_03_present_tense f_04_place_adverbials f_05_time_adverbials f_06_first_person_pronouns f_07_second_person_pronouns f_08_third_person_pronouns f_09_pronoun_it f_10_demonstrative_pronoun f_11_indefinite_pronouns f_12_proverb_do f_13_wh_question f_14_nominalizations f_15_gerunds f_16_other_nouns f_17_agentless_passives f_18_by_passives f_19_be_main_verb f_20_existential_there f_21_that_verb_comp f_22_that_adj_comp f_23_wh_clause f_24_infinitives f_25_present_participle f_26_past_participle f_27_past_participle_whiz f_28_present_participle_whiz f_29_that_subj f_30_that_obj f_31_wh_subj f_32_wh_obj f_33_pied_piping f_34_sentence_relatives f_35_because f_36_though f_37_if f_38_other_adv_sub f_39_prepositions f_40_adj_attr f_41_adj_pred f_42_adverbs f_43_type_token f_44_mean_word_length f_45_conjuncts f_46_downtoners f_47_hedges f_48_amplifiers f_49_emphatics f_50_discourse_particles f_51_demonstratives f_52_modal_possibility f_53_modal_necessity f_54_modal_predictive f_55_verb_public f_56_verb_private f_57_verb_suasive f_58_verb_seem f_59_contractions f_60_that_deletion f_61_stranded_preposition f_62_split_infinitive f_63_split_auxiliary f_64_phrasal_coordination f_65_clausal_coordination f_66_neg_synthetic f_67_neg_analytic discipline
str f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 f64 str
"BIO_G0_02_1" 11.574886 9.821115 61.381971 2.104525 6.313574 2.104525 0.701508 17.537706 7.716591 2.806033 0.0 1.753771 0.350754 37.881445 6.313574 284.110838 9.821115 3.156787 16.485444 0.701508 10.17187 0.0 0.701508 13.328657 4.209049 0.0 5.96282 2.104525 1.753771 0.350754 1.052262 0.0 0.350754 0.701508 0.0 0.701508 0.350754 4.209049 99.263416 86.285514 6.664328 62.434234 0.742811 5.3012 9.821115 3.507541 0.0 2.455279 5.261312 0.0 19.291477 7.015082 0.350754 4.209049 6.664328 22.448264 4.559804 2.806033 0.350754 0.0 0.0 0.701508 4.910558 6.664328 4.209049 1.403016 2.806033 "BIO"
"BIO_G0_03_1" 20.300088 3.53045 59.13504 1.765225 0.882613 18.534863 0.0 3.53045 16.769638 4.413063 0.0 1.765225 0.0 43.248014 0.882613 235.657546 2.647838 1.765225 44.130627 3.53045 7.0609 3.53045 1.765225 20.300088 0.882613 0.0 0.882613 0.882613 2.647838 0.0 5.295675 0.882613 2.647838 4.413063 1.765225 1.765225 3.53045 4.413063 120.917917 85.613416 13.239188 47.661077 0.700499 5.152406 3.53045 1.765225 0.0 0.882613 6.178288 0.0 7.0609 7.0609 0.0 8.826125 1.765225 8.826125 3.53045 1.765225 0.0 0.0 0.882613 0.0 0.882613 7.943513 2.647838 0.882613 7.0609 "BIO"
"BIO_G0_06_1" 9.480034 2.585464 52.5711 0.861821 0.287274 0.0 0.0 14.076415 5.745475 0.287274 0.287274 0.287274 1.149095 20.970985 1.436369 288.997415 11.490951 2.872738 37.920138 2.010916 4.883654 0.574548 0.574548 16.374605 3.160011 0.287274 0.574548 2.010916 2.872738 0.0 1.149095 0.0 0.287274 3.734559 3.734559 1.149095 5.745475 2.585464 106.291296 81.873025 15.512784 67.79661 0.665338 5.156845 8.618213 2.29819 0.574548 2.29819 6.894571 0.0 4.883654 16.949153 1.723643 7.181844 4.021833 8.330939 4.309107 2.29819 0.0 0.0 0.287274 1.723643 6.320023 10.054582 5.458202 0.574548 8.905487 "BIO"
"BIO_G0_12_1" 36.900369 2.767528 23.98524 1.845018 1.845018 0.0 0.0 3.690037 11.99262 0.922509 0.0 0.922509 0.0 21.217712 0.0 298.892989 31.365314 5.535055 20.295203 2.767528 4.612546 3.690037 0.922509 11.99262 1.845018 0.0 12.915129 2.767528 4.612546 0.0 0.0 0.0 0.0 0.922509 0.0 0.922509 0.0 10.147601 154.059041 51.660517 9.225092 33.210332 0.625839 5.160681 6.457565 0.0 0.0 0.0 5.535055 0.0 14.760148 10.147601 0.0 8.302583 1.845018 15.682657 4.612546 0.0 0.0 0.0 0.0 0.922509 2.767528 0.922509 1.845018 1.845018 5.535055 "BIO"
"BIO_G0_21_1" 40.050858 2.542912 26.700572 2.542912 0.635728 0.635728 0.0 7.628735 2.542912 4.450095 0.0 1.271456 0.0 26.064844 3.17864 336.300064 21.614749 2.542912 27.3363 3.17864 2.542912 0.635728 0.0 9.535919 0.635728 0.0 5.085823 1.907184 3.814367 0.0 1.271456 0.0 0.0 3.17864 0.635728 0.0 0.635728 3.17864 147.488875 52.765416 8.900191 40.050858 0.665966 5.129435 5.085823 1.907184 0.0 0.635728 8.264463 0.0 5.721551 3.814367 2.542912 1.907184 2.542912 10.807374 5.721551 0.0 0.0 0.0 0.0 0.0 3.17864 7.628735 6.993007 2.542912 2.542912 "BIO"

Process the data with BiberAnalyzer

Now the data can be processed using BiberAnalyzer

df = pb.BiberAnalyzer(df_biber, id_column=True)

Determine the number of factors to extract

df.mdaviz_screeplot();

Extract factors

df.mda(n_factors=3)

Check the summary

df.mda_summary
shape: (3, 6)
Factor F df PR(>F) Signif R2
str f64 list[u32] f64 str f64
"factor_1" 4.391505 [16, 153] 4.5532e-7 "*** p < 0.001" 0.314713
"factor_2" 10.722254 [16, 153] 5.6263e-18 "*** p < 0.001" 0.528587
"factor_3" 2.900087 [16, 153] 0.000353 "*** p < 0.001" 0.232703

Plot factors

df.mdaviz_groupmeans(factor=2);

Check the dimension scores

df.mda_dim_scores
shape: (170, 5)
doc_id doc_cat factor_1 factor_2 factor_3
str str f64 f64 f64
"BIO_G0_02_1" "BIO" -3.548914 3.836948 1.190002
"BIO_G0_03_1" "BIO" 8.772727 5.071433 0.551095
"BIO_G0_06_1" "BIO" 3.248852 1.24173 0.843617
"BIO_G0_12_1" "BIO" -4.632679 -18.348749 -2.314747
"BIO_G0_21_1" "BIO" -6.692054 -11.105658 -2.356941
"SOC_G1_08_1" "SOC" -8.789114 -6.140515 -1.741352
"SOC_G1_09_1" "SOC" 4.709334 -3.577475 1.785393
"SOC_G2_03_1" "SOC" 1.381486 2.315045 0.945825
"SOC_G3_07_1" "SOC" -7.862552 -1.629578 -0.290192
"SOC_G3_08_1" "SOC" -1.221231 5.593183 -0.677241

References

Biber, Douglas. 1985. “Investigating Macroscopic Textual Variation Through Multifeature/Multidimensional Analyses.” https://doi.org/10.1515/ling.1985.23.2.337.