Creating and Using an External Corpus File

This vignette walks you through the process of generating an External corpus in the corpus-tagger interface. Saving your processed corpus as an external file (in Parquet format) allows you to skip time-consuming text parsing in the future—loading a Parquet file is nearly instantaneous, while re-processing raw text files with a spaCy model can take several minutes.


Why Use an External Corpus File?

  • Efficiency: Once your corpus is processed and saved as a Parquet file, you can reload it instantly in future sessions.
  • Reproducibility: The Parquet file preserves all token-level annotations, so you can share or archive your processed data.
  • Convenience: No need to re-upload or re-parse your original text files.

Step-by-Step Guide

1. Process Your Corpus

  • Navigate to the Load or process a target corpus page.
  • Under Process a corpus, select your source (e.g., upload plain text files).
  • Complete the processing steps as usual (select model, process files, etc.).

2. Download the Processed Corpus

  • After processing, go to the Download Corpus Files app.
  • In the download options, select Corpus file only.
  • Click the download button to save your corpus as a .parquet file.

3. What Does an External Corpus File Look Like?

The Parquet file contains a table with one row per token, including all relevant annotations:

doc_id token pos_tag ds_tag pos_id ds_id
BIO_G0_02_1 Ernst NP1 Character 1 1
BIO_G0_02_1 Mayr NP1 Character 2 1
BIO_G0_02_1 once RR Narrative 3 2
BIO_G0_02_1 wrote VVD Citation 4 3
BIO_G0_02_1 , Y Citation 5 3
BIO_G0_02_1 Y Untagged 6 4
BIO_G0_02_1 sympatric JJ AcademicTerms 7 5
BIO_G0_02_1 speciation NN1 AcademicTerms 8 6
BIO_G0_02_1 is VBZ InformationExposition 9 7
BIO_G0_02_1 like II InformationExposition 10 7
BIO_G0_02_1 the AT InformationExposition 11 7

Column meanings: - doc_id: Document identifier - token: The word/token - pos_tag: Part-of-speech tag - ds_tag: DocuScope tag - pos_id: POS unit ID - ds_id: DocuScope unit ID


4. Re-using Your External Corpus

  • In the future, simply select External as your corpus source and upload your .parquet file.
  • The corpus will load instantly, ready for analysis or comparison.

Tip: Use external corpus files to streamline your workflow, especially when working with large datasets or collaborating with others!