Biber analyzer

Biber’s Mulit-Dimensional Analysis (Biber 1985) is a specific implementation of exploratory factor analysis and has been used in a wide variety of studies. A representative sample of such studies can be seen in the table of contents of a tribute volume.

Multi-Dimensional Analysis (MDA) is a process made up of 4 main steps:

Identification of relevant variables
Extraction of factors from variables
Functional interpretation of factors as dimensions
Placement of categories on the dimensions

A description of the procedure can be found here.

Create a biber document-feature matrix

First we will import our libraries and some data:

import spacy
import pybiber as pb
import polars as pl
from pybiber.data import micusp_mini

Then process that data:

nlp = spacy.load("en_core_web_sm", disable=["ner"])
df_spacy = pb.spacy_parse(micusp_mini, nlp_model=nlp)
df_biber = pb.biber(df_spacy)

Using MATTR for f_43_type_token

All features normalized per 1000 tokens except:
f_43_type_token and f_44_mean_word_length

Format categories

The MDA procedure requires a categorical variable.

The data that we’re using for this demonstration have the names of disciplines encoded into the doc_id. The first three letters before the underscore represent an academic discipline (e.g., BIO for biology, ENG for English, etc.).

df_biber.head()

shape: (5, 68)

doc_id	f_01_past_tense	f_02_perfect_aspect	f_03_present_tense	f_04_place_adverbials	f_05_time_adverbials	f_06_first_person_pronouns	f_07_second_person_pronouns	f_08_third_person_pronouns	f_09_pronoun_it	f_10_demonstrative_pronoun	f_11_indefinite_pronouns	f_12_proverb_do	f_13_wh_question	f_14_nominalizations	f_15_gerunds	f_16_other_nouns	f_17_agentless_passives	f_18_by_passives	f_19_be_main_verb	f_20_existential_there	f_21_that_verb_comp	f_22_that_adj_comp	f_23_wh_clause	f_24_infinitives	f_25_present_participle	f_26_past_participle	f_27_past_participle_whiz	f_28_present_participle_whiz	f_29_that_subj	f_30_that_obj	f_31_wh_subj	f_32_wh_obj	f_33_pied_piping	f_34_sentence_relatives	f_35_because	f_36_though	f_37_if	f_38_other_adv_sub	f_39_prepositions	f_40_adj_attr	f_41_adj_pred	f_42_adverbs	f_43_type_token	f_44_mean_word_length	f_45_conjuncts	f_46_downtoners	f_47_hedges	f_48_amplifiers	f_49_emphatics	f_50_discourse_particles	f_51_demonstratives	f_52_modal_possibility	f_53_modal_necessity	f_54_modal_predictive	f_55_verb_public	f_56_verb_private	f_57_verb_suasive	f_58_verb_seem	f_59_contractions	f_60_that_deletion	f_61_stranded_preposition	f_62_split_infinitive	f_63_split_auxiliary	f_64_phrasal_coordination	f_65_clausal_coordination	f_66_neg_synthetic	f_67_neg_analytic
str	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64
"BIO_G0_02_1"	11.574886	9.821115	61.381971	2.104525	6.313574	2.104525	0.701508	17.537706	7.716591	2.806033	0.0	1.753771	0.350754	37.881445	6.313574	284.110838	9.821115	3.156787	16.485444	0.701508	10.17187	0.0	0.701508	13.328657	4.209049	0.0	5.96282	2.104525	1.753771	0.350754	1.052262	0.0	0.350754	0.701508	0.0	0.701508	0.350754	4.209049	99.263416	86.285514	6.664328	62.434234	0.742811	5.3012	9.821115	3.507541	0.0	2.455279	5.261312	0.0	19.291477	7.015082	0.350754	4.209049	6.664328	22.448264	4.559804	2.806033	0.350754	0.0	0.0	0.701508	4.910558	6.664328	4.209049	1.403016	2.806033
"BIO_G0_03_1"	20.300088	3.53045	59.13504	1.765225	0.882613	18.534863	0.0	3.53045	16.769638	4.413063	0.0	1.765225	0.0	43.248014	0.882613	235.657546	2.647838	1.765225	44.130627	3.53045	7.0609	3.53045	1.765225	20.300088	0.882613	0.0	0.882613	0.882613	2.647838	0.0	5.295675	0.882613	2.647838	4.413063	1.765225	1.765225	3.53045	4.413063	120.917917	85.613416	13.239188	47.661077	0.700499	5.152406	3.53045	1.765225	0.0	0.882613	6.178288	0.0	7.0609	7.0609	0.0	8.826125	1.765225	8.826125	3.53045	1.765225	0.0	0.0	0.882613	0.0	0.882613	7.943513	2.647838	0.882613	7.0609
"BIO_G0_06_1"	9.480034	2.585464	52.5711	0.861821	0.287274	0.0	0.0	14.076415	5.745475	0.287274	0.287274	0.287274	1.149095	20.970985	1.436369	288.997415	11.490951	2.872738	37.920138	2.010916	4.883654	0.574548	0.574548	16.374605	3.160011	0.287274	0.574548	2.010916	2.872738	0.0	1.149095	0.0	0.287274	3.734559	3.734559	1.149095	5.745475	2.585464	106.291296	81.873025	15.512784	67.79661	0.665338	5.156845	8.618213	2.29819	0.574548	2.29819	6.894571	0.0	4.883654	16.949153	1.723643	7.181844	4.021833	8.330939	4.309107	2.29819	0.0	0.0	0.287274	1.723643	6.320023	10.054582	5.458202	0.574548	8.905487
"BIO_G0_12_1"	36.900369	2.767528	23.98524	1.845018	1.845018	0.0	0.0	3.690037	11.99262	0.922509	0.0	0.922509	0.0	21.217712	0.0	298.892989	31.365314	5.535055	20.295203	2.767528	4.612546	3.690037	0.922509	11.99262	1.845018	0.0	12.915129	2.767528	4.612546	0.0	0.0	0.0	0.0	0.922509	0.0	0.922509	0.0	10.147601	154.059041	51.660517	9.225092	33.210332	0.625839	5.160681	6.457565	0.0	0.0	0.0	5.535055	0.0	14.760148	10.147601	0.0	8.302583	1.845018	15.682657	4.612546	0.0	0.0	0.0	0.0	0.922509	2.767528	0.922509	1.845018	1.845018	5.535055
"BIO_G0_21_1"	40.050858	2.542912	26.700572	2.542912	0.635728	0.635728	0.0	7.628735	2.542912	4.450095	0.0	1.271456	0.0	26.064844	3.17864	336.300064	21.614749	2.542912	27.3363	3.17864	2.542912	0.635728	0.0	9.535919	0.635728	0.0	5.085823	1.907184	3.814367	0.0	1.271456	0.0	0.0	3.17864	0.635728	0.0	0.635728	3.17864	147.488875	52.765416	8.900191	40.050858	0.665966	5.129435	5.085823	1.907184	0.0	0.635728	8.264463	0.0	5.721551	3.814367	2.542912	1.907184	2.542912	10.807374	5.721551	0.0	0.0	0.0	0.0	0.0	3.17864	7.628735	6.993007	2.542912	2.542912

Note

The data are down-sampled from the Michigan Corpus of Upper-Level Student Papers.

We can extract that string and place it into a new column called discipline

df_biber = (
    df_biber
    .with_columns(
        pl.col("doc_id").str.extract(r"^([A-Z])+", 0)
        .alias("discipline")
      )
      )

df_biber.head()

shape: (5, 69)

doc_id	f_01_past_tense	f_02_perfect_aspect	f_03_present_tense	f_04_place_adverbials	f_05_time_adverbials	f_06_first_person_pronouns	f_07_second_person_pronouns	f_08_third_person_pronouns	f_09_pronoun_it	f_10_demonstrative_pronoun	f_11_indefinite_pronouns	f_12_proverb_do	f_13_wh_question	f_14_nominalizations	f_15_gerunds	f_16_other_nouns	f_17_agentless_passives	f_18_by_passives	f_19_be_main_verb	f_20_existential_there	f_21_that_verb_comp	f_22_that_adj_comp	f_23_wh_clause	f_24_infinitives	f_25_present_participle	f_26_past_participle	f_27_past_participle_whiz	f_28_present_participle_whiz	f_29_that_subj	f_30_that_obj	f_31_wh_subj	f_32_wh_obj	f_33_pied_piping	f_34_sentence_relatives	f_35_because	f_36_though	f_37_if	f_38_other_adv_sub	f_39_prepositions	f_40_adj_attr	f_41_adj_pred	f_42_adverbs	f_43_type_token	f_44_mean_word_length	f_45_conjuncts	f_46_downtoners	f_47_hedges	f_48_amplifiers	f_49_emphatics	f_50_discourse_particles	f_51_demonstratives	f_52_modal_possibility	f_53_modal_necessity	f_54_modal_predictive	f_55_verb_public	f_56_verb_private	f_57_verb_suasive	f_58_verb_seem	f_59_contractions	f_60_that_deletion	f_61_stranded_preposition	f_62_split_infinitive	f_63_split_auxiliary	f_64_phrasal_coordination	f_65_clausal_coordination	f_66_neg_synthetic	f_67_neg_analytic	discipline
str	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	str
"BIO_G0_02_1"	11.574886	9.821115	61.381971	2.104525	6.313574	2.104525	0.701508	17.537706	7.716591	2.806033	0.0	1.753771	0.350754	37.881445	6.313574	284.110838	9.821115	3.156787	16.485444	0.701508	10.17187	0.0	0.701508	13.328657	4.209049	0.0	5.96282	2.104525	1.753771	0.350754	1.052262	0.0	0.350754	0.701508	0.0	0.701508	0.350754	4.209049	99.263416	86.285514	6.664328	62.434234	0.742811	5.3012	9.821115	3.507541	0.0	2.455279	5.261312	0.0	19.291477	7.015082	0.350754	4.209049	6.664328	22.448264	4.559804	2.806033	0.350754	0.0	0.0	0.701508	4.910558	6.664328	4.209049	1.403016	2.806033	"BIO"
"BIO_G0_03_1"	20.300088	3.53045	59.13504	1.765225	0.882613	18.534863	0.0	3.53045	16.769638	4.413063	0.0	1.765225	0.0	43.248014	0.882613	235.657546	2.647838	1.765225	44.130627	3.53045	7.0609	3.53045	1.765225	20.300088	0.882613	0.0	0.882613	0.882613	2.647838	0.0	5.295675	0.882613	2.647838	4.413063	1.765225	1.765225	3.53045	4.413063	120.917917	85.613416	13.239188	47.661077	0.700499	5.152406	3.53045	1.765225	0.0	0.882613	6.178288	0.0	7.0609	7.0609	0.0	8.826125	1.765225	8.826125	3.53045	1.765225	0.0	0.0	0.882613	0.0	0.882613	7.943513	2.647838	0.882613	7.0609	"BIO"
"BIO_G0_06_1"	9.480034	2.585464	52.5711	0.861821	0.287274	0.0	0.0	14.076415	5.745475	0.287274	0.287274	0.287274	1.149095	20.970985	1.436369	288.997415	11.490951	2.872738	37.920138	2.010916	4.883654	0.574548	0.574548	16.374605	3.160011	0.287274	0.574548	2.010916	2.872738	0.0	1.149095	0.0	0.287274	3.734559	3.734559	1.149095	5.745475	2.585464	106.291296	81.873025	15.512784	67.79661	0.665338	5.156845	8.618213	2.29819	0.574548	2.29819	6.894571	0.0	4.883654	16.949153	1.723643	7.181844	4.021833	8.330939	4.309107	2.29819	0.0	0.0	0.287274	1.723643	6.320023	10.054582	5.458202	0.574548	8.905487	"BIO"
"BIO_G0_12_1"	36.900369	2.767528	23.98524	1.845018	1.845018	0.0	0.0	3.690037	11.99262	0.922509	0.0	0.922509	0.0	21.217712	0.0	298.892989	31.365314	5.535055	20.295203	2.767528	4.612546	3.690037	0.922509	11.99262	1.845018	0.0	12.915129	2.767528	4.612546	0.0	0.0	0.0	0.0	0.922509	0.0	0.922509	0.0	10.147601	154.059041	51.660517	9.225092	33.210332	0.625839	5.160681	6.457565	0.0	0.0	0.0	5.535055	0.0	14.760148	10.147601	0.0	8.302583	1.845018	15.682657	4.612546	0.0	0.0	0.0	0.0	0.922509	2.767528	0.922509	1.845018	1.845018	5.535055	"BIO"
"BIO_G0_21_1"	40.050858	2.542912	26.700572	2.542912	0.635728	0.635728	0.0	7.628735	2.542912	4.450095	0.0	1.271456	0.0	26.064844	3.17864	336.300064	21.614749	2.542912	27.3363	3.17864	2.542912	0.635728	0.0	9.535919	0.635728	0.0	5.085823	1.907184	3.814367	0.0	1.271456	0.0	0.0	3.17864	0.635728	0.0	0.635728	3.17864	147.488875	52.765416	8.900191	40.050858	0.665966	5.129435	5.085823	1.907184	0.0	0.635728	8.264463	0.0	5.721551	3.814367	2.542912	1.907184	2.542912	10.807374	5.721551	0.0	0.0	0.0	0.0	0.0	3.17864	7.628735	6.993007	2.542912	2.542912	"BIO"

Process the data with BiberAnalyzer

Now the data can be processed using BiberAnalyzer

df = pb.BiberAnalyzer(df_biber, id_column=True)

Determine the number of factors to extract

df.mdaviz_screeplot();

Extract factors

df.mda(n_factors=3)

Check the summary

df.mda_summary

shape: (3, 6)

Factor	F	df	PR(>F)	Signif	R2
str	f64	list[u32]	f64	str	f64
"factor_1"	4.391505	[16, 153]	4.5532e-7	"*** p < 0.001"	0.314713
"factor_2"	10.722254	[16, 153]	5.6263e-18	"*** p < 0.001"	0.528587
"factor_3"	2.900087	[16, 153]	0.000353	"*** p < 0.001"	0.232703

Plot factors

df.mdaviz_groupmeans(factor=2);

Check the dimension scores

df.mda_dim_scores

shape: (170, 5)

doc_id	doc_cat	factor_1	factor_2	factor_3
str	str	f64	f64	f64
"BIO_G0_02_1"	"BIO"	-3.548914	3.836948	1.190002
"BIO_G0_03_1"	"BIO"	8.772727	5.071433	0.551095
"BIO_G0_06_1"	"BIO"	3.248852	1.24173	0.843617
"BIO_G0_12_1"	"BIO"	-4.632679	-18.348749	-2.314747
"BIO_G0_21_1"	"BIO"	-6.692054	-11.105658	-2.356941
…	…	…	…	…
"SOC_G1_08_1"	"SOC"	-8.789114	-6.140515	-1.741352
"SOC_G1_09_1"	"SOC"	4.709334	-3.577475	1.785393
"SOC_G2_03_1"	"SOC"	1.381486	2.315045	0.945825
"SOC_G3_07_1"	"SOC"	-7.862552	-1.629578	-0.290192
"SOC_G3_08_1"	"SOC"	-1.221231	5.593183	-0.677241

References

Biber, Douglas. 1985. “Investigating Macroscopic Textual Variation Through Multifeature/Multidimensional Analyses.” https://doi.org/10.1515/ling.1985.23.2.337.