Large Language Models 🎃

How we got here and how that history can help us evaluate what they produce

David Brown

RSA (Three Rivers)

October 31, 2024

Overview

Overview

How did we get to large language models (LLMs)

  • Our topics
    • Review some history of natural language processing (NLP) and digital writing technologies
    • Look at the architectures of models and how they’ve changed over time
    • Share the results of some research by our team and discuss some of the potential implications

Overview

How did we get to large language models (LLMs)

  • Our goals
    • Show how these technologies fundamentally work.
    • Highlight the affordances and limitations of LLMs and similar technologies.

History

History

Writing is inseparable from technological change (Gabrial 2007)



History

Writing is inseparable from technological change

New surfaces:

papyrus

parchment

wood-pulp paper

History

Writing is inseparable from technological change

History

Writing is inseparable from technological change

New implements:

stylus

metal-tipped pen

mass-produced pencil

History

Writing is inseparable from technological change

History

Writing is inseparable from technological change

New systems:

libraries

postal networks

commercial publishers

History

Writing is inseparable from technological change

  • Technological change is often met with skepticism, if not hostility and fear.

Think of the moral and intellectual training that comes to a student who writes a manuscript with the knowledge that his [sic] errors will stand out on the page as honestly confessed and openly advertised mistakes. (S. Y. G. 1908)

History

Writing is inseparable from technological change

  • Technological change is often met with skepticism, if not hostility and fear.

The eraser is an instrument of the devil because it perpetuates a culture of shame about error. It’s a way of lying to the world, which says ‘I didn’t make a mistake. I got it right first time.’ That’s what happens when you can rub it out and replace it. Instead, we need a culture where children are not afraid to make mistakes, they look at their mistakes and they learn from them, where they are continuously reflecting and improving on what they’ve done, not being enthralled to getting the right answer quickly and looking smart. (Espinoza 2015)

History

Writing is inseparable from technological change

  • On the one hand, we can acknowledge the anxieties that have accompanied new writing technologies in the past.
  • On the other hand, we can recognize that LLMs introduce novel challenges to communicative processes and practices. (Generative models are, after all, the functional opposite of erasers.)

History

Writing is inseparable from technological change

In 2009, researchers at Google published an article that coincided with the release of its N-gram Viewer and the corresponding data tables (Halevy, Norvig, and Pereira 2009).

But the fact that it’s a million times larger than the Brown Corpus outweighs these drawbacks. A trillion-word corpus—along with other Web-derived corpora of millions, billions, or trillions of links, videos, images, tables, and user interactions—captures even very rare aspects of human behavior. So, this corpus could serve as the basis of a complete model for certain tasks—if only we knew how to extract the model from the data.

We will return to this excerpt, but for now, let’s focus on this final claim…

History

The concept of a language model has been around for a long time…

History

The concept of a language model has been around for a long time…


History

The concept of a language model has been around for a long time…


History

The concept of a language model has been around for a long time…

1960-1980

Beginnings of NLP

1980-2015

Towards Computation

2015-

Emergence of ML

History

The concept of a language model has been around for a long time…

The beginnings of NLP

The beginnings of NLP

The question of multiple meanings (or polysemy)

A memo shared with a small group of researchers who were at the forefront of machine translation after WWII, anticipates the challenges and possibilities of the computer analysis of text. (Weaver 1949)

The beginnings of NLP

The question of multiple meanings (or polysemy)

If one examines the words in a book, one at a time as through an opaque mask with a hole in it on word wide, then it is obviously impossible to determine, one at a time, the meaning of words.

The beginnings of NLP

The question of multiple meanings (or polysemy)

But if one lengthens the slit in the opaque mask, until one can see not only the central word in question, but also say [a] N[umber of] words on either side, then if [that] N[umber] is large enough one can unambiguously decide the meaning of the central word.

The beginnings of NLP

The question of multiple meanings (or polysemy)

The practical question is, what minimum value of N will, at least in a tolerable fraction of cases, lead to the correct choice of meaning for the central word?

The beginnings of NLP

The question of multiple meanings (or polysemy)

  • “You shall know a word by the company it keeps.” (Firth 1957)
  • The meaning of word can be determined by examining the contextual window or span around that word.

The beginnings of NLP

The question of multiple meanings (or polysemy)


Preceding Word Following
upscaling generally hold fast during a 4K 60FPS gaming session.
a dragster, going fast in a straight line is actually pretty boring
The benefits of the fast can be maintained long term,
adopted slowly, but comes on fast once it’s hit the mainstream
They simply disagree on how fast to go and how best to get there in superseding it.
which appeared stuck fast in the ground it had plowed up

The beginnings of NLP

The question of multiple meanings (or polysemy)


Preceding Word Following
upscaling generally hold fast during a 4K 60FPS gaming session.
a dragster, going fast in a straight line is actually pretty boring
The benefits of the fast can be maintained long term,
adopted slowly, but comes on fast once it’s hit the mainstream
They simply disagree on how fast to go and how best to get there in superseding it.
which appeared stuck fast in the ground it had plowed up

The beginnings of NLP

The question of multiple meanings (or polysemy)


Preceding Word Following
upscaling generally hold fast during a 4K 60FPS gaming session.
a dragster, going fast in a straight line is actually pretty boring
The benefits of the fast can be maintained long term,
adopted slowly, but comes on fast once it’s hit the mainstream
They simply disagree on how fast to go and how best to get there in superseding it.
which appeared stuck fast in the ground it had plowed up

The beginnings of NLP

  • The “context window” is a fundamental insight that powers the training of LLMs (from word2vec to BERT to ChatGPT).


The beginnings of NLP

As early as the mid-twentieth century, researchers…

  • had considered the potential for a “context window” to solve word-sense disambiguation

  • were developing the statistical tools that would eventually power the training of LLMs (i.e., neural networks).

Question

Why didn’t we have LLMs sooner?

The beginnings of NLP

Question

Why didn’t we have LLMs sooner?

  • limited computational power
  • lack of training data
  • ascendance of universal grammar

The beginnings of NLP

Context free grammar

  • To cope with these limitations (and beliefs about language structure) early models resorted to hard-coding rules

S → NP VP

NP → the N

VP → V NP

V → sings | eats

N → cat | song | canary

– the canary sings the song

– the song eats the cat

The beginnings of NLP

Context free grammar

The ALPAC Report, which was released in 1966, was highly skeptical of these kinds of approaches.

The beginnings of NLP

Context free grammar

…we do not have useful machine translation. Furthermore, there is no immediate or predictable prospect of useful machine translation.

The beginnings of NLP

Context free grammar

Some of the work must be done on a rather large scale, since small-scale experiments and work with miniature models of language have proved seriously deceptive in the past, and one can come to grips with real problems only above a certain scale of grammar size, dictionary size, and available corpus.

Towards computation

Towards computation

Converting words into numbers


  • The make-up or sampling frame of the Brown family of corpora. (Kučera and Francis 1967)
  • From the 15 categories, 2000-word text samples were selected.
  • 2000 x 500 ≈ 1,000,000 words
ID Category Name Number of Texts
A Press: Reportage 44
B Press: Editorial 27
C Press: Reviews 17
D Religion 17
E Skill And Hobbies 36
F Popular Lore 48
G Belles-Lettres 75
H Miscellaneous: Government & House Organs 30
J Learned 80
K Fiction: General 29
L Fiction: Mystery 24
M Fiction: Science 6
N Fiction: Adventure 29
P Fiction: Romance 29
R Humor 9

Towards computation

A document-feature matrix (or a document-term matrix)

Absolute frequency in the Brown Corpus.
doc_id the of and to a in that is was he for it with as his
A01 155 65 40 55 54 40 28 12 18 7 22 19 6 13 12
A02 134 94 33 56 51 36 11 11 15 13 24 15 6 14 6
A03 150 65 40 62 43 42 12 12 11 19 34 5 7 7 9
A04 160 68 45 52 38 58 33 14 13 12 19 13 9 17 6
A05 167 61 34 72 54 42 41 14 15 45 27 9 10 8 6
A06 150 77 37 52 54 43 18 28 12 32 31 10 15 8 15
A07 167 65 43 54 39 55 17 32 12 11 29 8 16 16 9
A08 183 69 29 58 39 49 15 25 4 9 14 22 15 11 8
A09 187 64 44 42 51 37 18 10 28 10 40 8 8 10 9
A10 134 64 36 64 38 48 21 19 10 20 29 12 12 14 9
A11 151 40 49 55 71 58 5 3 14 6 22 6 12 15 11
A12 113 38 44 29 58 54 14 7 14 35 25 16 11 11 13
A13 161 50 47 39 63 55 6 8 27 11 12 12 18 11 20
A14 172 61 41 39 76 40 15 15 20 34 22 14 18 15 17
A15 136 24 46 46 60 35 27 11 19 17 23 7 18 22 17

Towards computation

A document-feature matrix (or a document-term matrix)

Observations.
doc_id the of and to a in that is was he for it with as his
A01 155 65 40 55 54 40 28 12 18 7 22 19 6 13 12
A02 134 94 33 56 51 36 11 11 15 13 24 15 6 14 6
A03 150 65 40 62 43 42 12 12 11 19 34 5 7 7 9
A04 160 68 45 52 38 58 33 14 13 12 19 13 9 17 6
A05 167 61 34 72 54 42 41 14 15 45 27 9 10 8 6
A06 150 77 37 52 54 43 18 28 12 32 31 10 15 8 15
A07 167 65 43 54 39 55 17 32 12 11 29 8 16 16 9
A08 183 69 29 58 39 49 15 25 4 9 14 22 15 11 8
A09 187 64 44 42 51 37 18 10 28 10 40 8 8 10 9
A10 134 64 36 64 38 48 21 19 10 20 29 12 12 14 9
A11 151 40 49 55 71 58 5 3 14 6 22 6 12 15 11
A12 113 38 44 29 58 54 14 7 14 35 25 16 11 11 13
A13 161 50 47 39 63 55 6 8 27 11 12 12 18 11 20
A14 172 61 41 39 76 40 15 15 20 34 22 14 18 15 17
A15 136 24 46 46 60 35 27 11 19 17 23 7 18 22 17

Towards computation

A document-feature matrix (or a document-term matrix)

Variables.
doc_id the of and to a in that is was he for it with as his
A01 155 65 40 55 54 40 28 12 18 7 22 19 6 13 12
A02 134 94 33 56 51 36 11 11 15 13 24 15 6 14 6
A03 150 65 40 62 43 42 12 12 11 19 34 5 7 7 9
A04 160 68 45 52 38 58 33 14 13 12 19 13 9 17 6
A05 167 61 34 72 54 42 41 14 15 45 27 9 10 8 6
A06 150 77 37 52 54 43 18 28 12 32 31 10 15 8 15
A07 167 65 43 54 39 55 17 32 12 11 29 8 16 16 9
A08 183 69 29 58 39 49 15 25 4 9 14 22 15 11 8
A09 187 64 44 42 51 37 18 10 28 10 40 8 8 10 9
A10 134 64 36 64 38 48 21 19 10 20 29 12 12 14 9
A11 151 40 49 55 71 58 5 3 14 6 22 6 12 15 11
A12 113 38 44 29 58 54 14 7 14 35 25 16 11 11 13
A13 161 50 47 39 63 55 6 8 27 11 12 12 18 11 20
A14 172 61 41 39 76 40 15 15 20 34 22 14 18 15 17
A15 136 24 46 46 60 35 27 11 19 17 23 7 18 22 17

Towards computation

Zipf’s Law

Most frequent words:

the

of

and

Towards computation

Zipf’s Law

Towards computation

Zipf’s Law

  • Zipf’s Law: the frequency of a token is inversely proportional to its rank.
  • Most tokens are infrequent.

Emergence of machine learning

Emergence of ML

Let’s return to the excerpt from the Google researchers.

But the fact that it’s a million times larger than the Brown Corpus outweighs these drawbacks. A trillion-word corpus—along with other Web-derived corpora of millions, billions, or trillions of links, videos, images, tables, and user interactions—captures even very rare aspects of human behavior. So, this corpus could serve as the basis of a complete model for certain tasks—if only we knew how to extract the model from the data.

Question

What developments are taking place at this time (the early 2000s)?

Emergence of ML

Question

What developments are taking place at this time (the early 2000s)?

  • Expansion of the Internet.
  • Advances in memory and processing.
  • Oh & Jung publish “GPU implementation of neural networks” in Pattern Recognition. (2004)

Emergence of ML

  • Word2vec is released. (Mikolov et al. 2013)
    • Shallow (2-layer) neural network.
    • Trained using a relatively small context window (~10-12 words).
    • Introduces “embeddings”.

Emergence of ML

Embeddings from a vector model.

Embedding space.
token X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12
the -0.168845 -0.461962 -0.19786 0.274797 -0.125103 -0.706765 -0.003869 0.433402 0.346452 0.040093 0.203006 0.090101
of -0.156332 -0.205717 0.227799 0.168333 -0.124179 -0.942517 -0.006047 0.354843 0.581821 0.178699 0.115994 0.260422
and -0.35421 -0.577716 -0.264329 0.391706 -0.121148 -0.53797 -0.02144 0.292083 0.166029 0.122325 0.112165 0.363388
to -0.188663 -0.475612 -0.165088 0.557264 -0.175906 -0.470315 -0.143031 0.54875 0.024778 0.027579 0.337273 0.035197
a -0.746083 -0.16305 -0.068605 0.621801 0.339663 -0.257003 -0.076629 0.699464 0.013988 -0.189733 0.005224 0.551253
in -0.234889 -0.871305 0.186019 -0.099154 -0.243639 -0.591907 -0.049697 0.545955 -0.03968 0.095335 0.02752 0.127082
i -0.187752 -0.189067 0.54628 0.98086 -0.063788 -0.047667 -0.040399 0.858812 -0.089908 -0.080435 -0.035776 -0.223912
that -0.122229 -1.068663 0.203545 -0.139743 -0.047583 0.257451 0.241411 0.277364 0.427474 0.194155 -0.288758 -0.192211
it -0.116662 -0.151302 -0.215672 0.597628 0.053331 -0.452194 0.164938 0.62079 0.213103 -0.325304 0.023041 -0.108058
he 0.264217 -1.258926 0.16205 0.105418 0.100143 0.165211 0.135424 0.320132 0.002697 0.306683 -0.366043 0.273684
was 0.228962 -0.237645 -0.334393 0.621722 0.014987 0.193712 0.389216 -0.200413 -0.368847 0.021749 0.178661 0.271978
his -0.255955 -0.587171 -0.257692 0.766715 -0.266811 0.074266 0.100724 0.117419 -0.405978 0.719568 -0.003845 0.628617
is -0.491186 -0.487561 -0.074739 -0.115964 0.181698 -0.465943 0.267144 0.468609 0.971713 0.199609 -0.013295 0.234851
as -0.242131 -0.138365 -0.660047 0.03217 -0.077236 -0.672038 -0.807216 0.308242 0.058584 0.009522 0.121514 0.359722
with -0.114715 -0.390149 0.115776 -0.170354 0.055223 -0.112999 -0.208434 0.254655 0.562462 0.362516 -0.401538 0.353187

Emergence of ML

Embeddings from a vector model.

Dimensions.
token X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12
the -0.168845 -0.461962 -0.19786 0.274797 -0.125103 -0.706765 -0.003869 0.433402 0.346452 0.040093 0.203006 0.090101
of -0.156332 -0.205717 0.227799 0.168333 -0.124179 -0.942517 -0.006047 0.354843 0.581821 0.178699 0.115994 0.260422
and -0.35421 -0.577716 -0.264329 0.391706 -0.121148 -0.53797 -0.02144 0.292083 0.166029 0.122325 0.112165 0.363388
to -0.188663 -0.475612 -0.165088 0.557264 -0.175906 -0.470315 -0.143031 0.54875 0.024778 0.027579 0.337273 0.035197
a -0.746083 -0.16305 -0.068605 0.621801 0.339663 -0.257003 -0.076629 0.699464 0.013988 -0.189733 0.005224 0.551253
in -0.234889 -0.871305 0.186019 -0.099154 -0.243639 -0.591907 -0.049697 0.545955 -0.03968 0.095335 0.02752 0.127082
i -0.187752 -0.189067 0.54628 0.98086 -0.063788 -0.047667 -0.040399 0.858812 -0.089908 -0.080435 -0.035776 -0.223912
that -0.122229 -1.068663 0.203545 -0.139743 -0.047583 0.257451 0.241411 0.277364 0.427474 0.194155 -0.288758 -0.192211
it -0.116662 -0.151302 -0.215672 0.597628 0.053331 -0.452194 0.164938 0.62079 0.213103 -0.325304 0.023041 -0.108058
he 0.264217 -1.258926 0.16205 0.105418 0.100143 0.165211 0.135424 0.320132 0.002697 0.306683 -0.366043 0.273684
was 0.228962 -0.237645 -0.334393 0.621722 0.014987 0.193712 0.389216 -0.200413 -0.368847 0.021749 0.178661 0.271978
his -0.255955 -0.587171 -0.257692 0.766715 -0.266811 0.074266 0.100724 0.117419 -0.405978 0.719568 -0.003845 0.628617
is -0.491186 -0.487561 -0.074739 -0.115964 0.181698 -0.465943 0.267144 0.468609 0.971713 0.199609 -0.013295 0.234851
as -0.242131 -0.138365 -0.660047 0.03217 -0.077236 -0.672038 -0.807216 0.308242 0.058584 0.009522 0.121514 0.359722
with -0.114715 -0.390149 0.115776 -0.170354 0.055223 -0.112999 -0.208434 0.254655 0.562462 0.362516 -0.401538 0.353187

Emergence of ML

Embeddings from a vector model.

  • When treated as coordinates in space, embeddings locate words that tend to appear together or in similar contexts near each other.

Emergence of ML

Embeddings from a vector model.

  • The proximity of words can be assessed using measures like cosine similarity.

\[ cosine~similarity = S_{c}(A, B) := cos(\theta) = \frac{A \cdot B}{||A||~||B||} \]

Emergence of ML

Emergence of ML

Embeddings from a vector model.

Tokens closest to fast
token similarity
slow 0.448
quick 0.519
faster 0.568
slower 0.593
speed 0.602
busy 0.646
simple 0.663
food 0.676
speeds 0.688
fastest 0.688
pace 0.697
efficient 0.703
easy 0.707
small 0.710
too 0.712
straight 0.717
rapid 0.717
low 0.718
quickly 0.718
packet 0.718

Emergence of ML

  • After the introduction of vector representations and, a short time later, the transformer architecture (Vaswani et al. 2017), language models have rapidly evolved. They can be grouped into roughly 3 generations.

Emergence of ML

Advances in LLMs…

  • Allowing for out-of-vocabulary words (using sub-words or word-pieces for tokenizing).

  • Adding a sequence layer.

  • Sliding a context window both left-to-right and right-to-left.

  • Implementing self-attention architecture.

  • Training on more and more data.

  • Expanding the context window (from 512 word-pieces for BERT to 128,000 word-pieces for GPT-4 Turbo 128K).

  • Introducing reinforcement learning from human feedback (RLHF) with Instruct GPT.

Emergence of ML

An example of contextual embeddings using BERT

sentences = ["bank",
    "He eventually sold the shares back to the bank at a premium.",
    "The bank strongly resisted cutting interest rates.",
    "The bank will supply and buy back foreign currency.",
    "The bank is pressing us for repayment of the loan.",
    "The bank left its lending rates unchanged.",
    "The river flowed over the bank.",
    "Tall, luxuriant plants grew along the river bank.",
    "His soldiers were arrayed along the river bank.",
    "Wild flowers adorned the river bank.",
    "Two fox cubs romped playfully on the river bank.",
    "The jewels were kept in a bank vault.",
    "You can stow your jewelry away in the bank.",
    "Most of the money was in storage in bank vaults.",
    "The diamonds are shut away in a bank vault somewhere.",
    "Thieves broke into the bank vault.",
    "Can I bank on your support?",
    "You can bank on him to hand you a reasonable bill for your services.",
    "Don't bank on your friends to help you out of trouble.",
    "You can bank on me when you need money.",
    "I bank on your help."]

Emergence of ML

An example of contextual embeddings using BERT

from collections import OrderedDict

context_embeddings = []
context_tokens = []
for sentence in sentences:
    tokenized_text, tokens_tensor, segments_tensors = bert_text_preparation(sentence, tokenizer)
    list_token_embeddings = get_bert_embeddings(tokens_tensor, segments_tensors, model)
  # make ordered dictionary to keep track of the position of each word
    tokens = OrderedDict()
  # loop over tokens in sensitive sentence
    for token in tokenized_text[1:-1]:
        # keep track of position of word and whether it occurs multiple times
        if token in tokens:
            tokens[token] += 1
        else:
        tokens[token] = 1
    # compute the position of the current token
        token_indices = [i for i, t in enumerate(tokenized_text) if t == token]
        current_index = token_indices[tokens[token]-1]
    # get the corresponding embedding
        token_vec = list_token_embeddings[current_index]
    # save values
        context_tokens.append(token)
        context_embeddings.append(token_vec)

Emergence of ML

Emergence of ML

Emergence of ML

Emergence of ML

Emergence of ML

LLMs have a broad range of applications…

  • Generation tasks
    • Chat bots
    • Content creation
    • Summarization
    • Translation
  • Classification tasks
    • Text classification
    • Segment classification

Emergence of ML

Just as they raise questions regarding…

Implications

Implications

  • A group at CMU was inspired by claims that were circulating when ChatGPT was first introduced. (e.g., “Wow! I asked ChatGPT to write a podcast and it looks pretty good!!!”)
  • We wondered what the text it produces looks like when it is repeated. (e.g., “What happens if you ask it to write 100 podcasts?”)
  • We created an experiment at scale, with 10,000 samples, across 6 text-types, and querying 6 different models.
  • Then, we tagged the data using Biber’s (1991) features (which counts things like passives, nominalizations, attributive adjectives, etc.).

Implications

  • The workflow for the creation of a human-AI parallel corpus.

Models were given a 500-word chunk of text and asked to continue writing in the same style, yielding different versions of that second chunk of text.

Implications

The composition of the HAP-E corpus.
Author acad
(n = 1227)
blog
(n = 1526)
fic
(n = 1395)
news
(n = 1322)
spok
(n = 1721)
tvm
(n = 1099)
Total
Human
Chunk 1 602,878 748,901 674,437 644,896 833,699 598,157 4,102,968
Chunk 2 602,683 749,052 675,595 646,030 832,797 591,704 4,097,861
ChatGPT
GPT 4o 633,884 826,590 759,461 708,375 969,210 606,059 4,503,579
GPT 4o Mini 690,828 916,433 837,527 771,324 1,036,125 678,429 4,930,666
Llama Base
Llama 3 70B 510,942 731,889 651,557 572,423 1,098,698 502,012 4,067,521
Llama 3 8B 544,845 735,068 740,610 510,671 1,083,953 525,037 4,140,184
Llama Instruction-Tuned
Llama 3 70B Instruct 580,579 735,501 641,982 627,299 934,911 453,559 3,973,831
Llama 3 8B Instruct 550,463 679,884 599,000 568,703 836,531 448,874 3,683,455
TOTALS 4,717,102 6,123,318 5,580,169 5,049,721 7,625,924 4,403,831 33,500,065

Implications

  • It turns out, machine-authored prose and human-authored prose have some important differences. (Herbold et al. 2023; Reinhart et al. 2024)
  • There are, for example, clear patterns in the relative frequencies of lemmatized words.

Rates of word use by different LLMs (per 1,000 words) compared to the human use of each word in chunk 2 (log scale). Includes all words used more than once per million words in chunk 2. Words are lemmatized to group together inflected forms. Dashed blue lines indicate the range between 10× more and 10× less than human use. Note that the instruction-tuned models show more variation from the diagonal, indicating more deviation in vocabulary use relative to humans.

Implications

Most overrepresented words in LLM texts (rates relative to human chunk 2).
GPT-4o
GPT-4o Mini
Llama 3 70B Instruct
Llama 3 8B Instruct
Llama 3 70B
Llama 3 8B
Word Rate Word Rate Word Rate Word Rate Word Rate Word Rate
1 camaraderie 162 camaraderie 171 unease 63 unease 101 bananas 31 deborah 52
2 tapestry 155 tapestry 147 palpable 47 continuation 52 paperback 30 rambo 22
3 intricate 119 palpable 145 continuation 29 palpable 48 bam 26 matty 20
4 underscore 107 grapple 131 shoutout 28 reminder 33 verona 25 goodnight 18
5 unspoken 102 intricate 129 intricate 27 pang 29 filth 19 ml 15
6 amidst 100 fleeting 124 pang 25 rut 29 rekall 17 merlin 13
7 palpable 95 ignite 122 camaraderie 24 waft 28 denis 14 worcester 11
8 solace 95 vibrant 92 policymaker 24 prioritize 27 darry 12 fay 10
9 fleeting 84 amidst 90 prioritize 24 grapple 24 ebook 12 missy 10
10 unravel 83 cacophony 89 reminder 24 camaraderie 23 janice 12 elisa 10

Implications

  • The over-represented words align with those noted in a study of peer reviews for a large machine learning conference. They measured an uptick of suspiciously positive words after the release of ChatGPT. (Liang et al. 2024)

Implications

  • We also tagged the data using Biber’s feature set.
  • Binary classification (distinguishing chunk 2 of human writing from the writing produced by a model) using random forests produces high accuracy.

Binary classification accuracy when HAP-E is the training data and predicting COCA-AI Parallel corpus (CAP) data.

Model Acc. (in-sample) Acc. (out-of-sample)
ChatGPT
GPT 4o 98.22% 98.51%
GPT 4o Mini 98.73% 98.28%
Llama Instruction-Tuned
Llama 3 70B Instruct 95.33% 94.69%
Llama 3 8B Instruct 96.20% 96.17%
Llama Base
Llama 3 70B 92.43% 94.01%
Llama 3 8B 93.56% 93.45%

Implications

  • In a multi-class random forest, the most important features are generally ones associated with greater information density or a “compressed” style.
The 10 features with the highest importance.
Features in human- and LLM-written text
Rate per 1,000 tokens; LLM rates relative to Chunk 2
Feature
Human
GPT
Llama 3 Instruct
Llama 3 Base
Importance
Chunk 1 Chunk 2 GPT-4o Mini GPT-4o 8B 7B 8B 7B
Present participial clauses 1.7 1.7 481% 527% 224% 261% 94% 102% 1,715.1
‘That’ clauses as subject 2.1 2.1 331% 263% 180% 173% 64% 68% 1,387.2
Mean word length 4.5 4.4 114% 116% 101% 103% 99% 100% 1,070.4
Adverbs 67.9 71.8 86% 82% 73% 75% 102% 102% 916.0
Nominalizations 14.6 14.6 209% 214% 145% 151% 88% 91% 808.9
Phrasal co-ordination 6.7 6.1 144% 194% 187% 170% 92% 97% 776.0
Clausal co-ordination 11.2 12.4 63% 59% 141% 127% 120% 116% 727.3
Other nouns 254.9 240.6 97% 103% 91% 95% 91% 94% 725.5
Downtoners 1.9 1.9 155% 118% 60% 57% 68% 73% 712.8
Attributive adjectives 46.7 43.8 140% 150% 100% 104% 79% 83% 677.7

Implications

  • Carrying our principal component analysis (PCA) on the chunk 2 of the human writing, yields a first principal component (PC1) that strongly resembles Biber’s Demension 1 (Involved vs. Informational Production).

Contributions of features to PC1 (based on human chunk 2).

Implications

Projection of human chunk 2 writing onto the first principal component, based on the 67 Biber features.

Implications

Projection of human chunk 2 writing and Llama Base onto the first principal component.

Implications

Projection of human chunk 2 writing and Llama Instruct onto the first principal component.

Implications

Projection of human chunk 2 writing and GPT 4o onto the first principal component.

Implications

  • We found similar patterns when we examined the differences between student writing and LLM-generated writing using examples for an introductory data science assignment. (Markey et al. 2024)

Projection of student, published, and ChatGPT-generated writing onto the first two linear discriminants, based on the 67 Biber features.

Implications

Human-generated vs. machine-generated text

Top text features discriminating between human- and machine-generated writing. Color-coded cells show the average z score of each feature. R2 and p-value correspond to one-way ANOVAs predicting each feature with text type.
ChatGPT
n:100
Published
n:100
Student
n:100
R2 p-value
Features indicating human-generated writing
adverbs −0.97 0.23 0.74 0.52 0.00
conjuncts −0.75 0.62 0.13 0.32 0.00
modal possibility −0.71 0.04 0.66 0.32 0.00
pronoun it −0.56 0.04 0.53 0.20 0.00
verb private −0.56 0.04 0.51 0.19 0.00
split auxiliary −0.56 0.37 0.19 0.16 0.00
that verb comp −0.48 0.48 0.00 0.15 0.00
prepositions −0.46 0.34 0.12 0.11 0.00
verb suasive −0.45 0.12 0.34 0.11 0.00
verb public −0.47 0.23 0.24 0.11 0.00

Implications

Human-generated vs. machine-generated text

Top text features discriminating between human- and machine-generated writing. Color-coded cells show the average z score of each feature. R2 and p-value correspond to one-way ANOVAs predicting each feature with text type.
ChatGPT
n:100
Published
n:100
Student
n:100
R2 p-value
Features indicating machine-generated writing
mean word length 1.14 −0.13 −1.01 0.78 0.00
modal predictive 1.22 −0.72 −0.51 0.76 0.00
gerunds 1.11 −0.56 −0.55 0.62 0.00
other adv sub 0.94 −0.70 −0.24 0.47 0.00
demonstratives 0.88 −0.70 −0.19 0.43 0.00
nominalizations 0.71 0.00 −0.71 0.34 0.00
phrasal coordination 0.65 −0.20 −0.45 0.22 0.00

Implications

  • ChatGPT can also be expressively restricted, producing, for example, a limited set of modal verbs–one that is different from both expert and novice writers.

Frequency of different modal verbs, often modulating the confidence of claims, in the different types of writing.

Modal verb
Absolute Frequency
Relative Frequency (per 105 words)
ChatGPT Published Student ChatGPT Published Student
Prediction
will 199 96 39 206.75 15.25 28.52
would 0 28 10 0.00 4.45 7.31
'll 0 0 1 0.00 0.00 0.73
Possiblity
can 5 199 68 5.19 31.61 49.72
may 0 91 46 0.00 14.45 33.64
could 0 43 15 0.00 6.83 10.97
might 0 19 6 0.00 3.02 4.39
Necessity
should 0 20 10 0.00 3.18 7.31
must 0 16 0 0.00 2.54 0.00

Implications

Excerpts of ChatGPT-generated writing with Noun + Noun and Adj. + Noun sequences underlined in blue and other characteristic features of machine-generated writing marked in red.

Implications

Key takeaways

  1. ChatGPT produces sentences that are more informationally dense than that of human writing, but the information density is expressed using repetitive grammatical patterns.
  2. The tendency toward greater information density appears to be magnified by reinforcement learning (also called instruction tuning), though the reasons for that magnification are unclear.
  3. Machine-generated writing demonstrates less modulation of stress or confidence.
  4. Broadly, LLMs are not as grammatically nimble as human writers.
  5. LLM-generated text generally produces academic prose expressing limited engagement with core, disciplinary concepts either in the way that experts do or in the way that novice students do. (Though LLMs can arrive at something more like expert writing with iterative prompting.)

Works Cited

Bahl, Lalit R, Frederick Jelinek, and Robert L Mercer. 1983. “A Maximum Likelihood Approach to Continuous Speech Recognition.” Journal Article. IEEE Transactions on Pattern Analysis and Machine Intelligence, no. 2: 179–90.
Biber, Douglas. 1991. Variation Across Speech and Writing. Cambridge University Press.
Brown, Peter F, John Cocke, Stephen A Della Pietra, Vincent J Della Pietra, Frederick Jelinek, John Lafferty, Robert L Mercer, and Paul S Roossin. 1990. “A Statistical Approach to Machine Translation.” Journal Article. Computational Linguistics 16 (2): 79–85.
Espinoza, Javier. 2015. “Erasers Are an ‘Instrument of the Devil’ Which Should Be Banned, Says Academic.” Newspaper Article. https://www.telegraph.co.uk/education/educationnews/11630639/Ban-erasers-from-the-classroom-says-academic.html.
Firth, John Rupert. 1957. Papers in Linguistics, 1934-1951. Book. Oxford: Oxford University Press. https://books.google.com/books?id=ilzingEACAAJ.
Gabrial, Brian. 2007. “History of Writing Technologies.” Book Section. In Handbook of Research on Writing, edited by Charles Bazerman, 23–33. New York: Routledge. https://doi.org/10.4324/9781410616470.ch2.
Halevy, Alon, Peter Norvig, and Fernando Pereira. 2009. “The Unreasonable Effectiveness of Data.” Journal Article. IEEE Intelligent Systems 24 (2): 8–12. https://doi.org/10.1109/MIS.2009.36.
Herbold, Steffen, Annette Hautli-Janisz, Ute Heuer, Zlata Kikteva, and Alexander Trautsch. 2023. “A Large-Scale Comparison of Human-Written Versus ChatGPT-Generated Essays.” Scientific Reports 13 (1): 18617.
Jelinek, Frederick. 1985. “A Real-Time, Isolated-Word, Speech Recognition System for Dictation Transcription.” Conference Proceedings. In ICASSP ’85. IEEE International Conference on Acoustics, Speech, and Signal Processing, 10:858–61. https://doi.org/10.1109/ICASSP.1985.1168313.
Ji, Ziwei, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. “Survey of Hallucination in Natural Language Generation.” Journal Article. ACM Computing Surveys 55 (12): 1–38.
Kučera, Henry, and W. Nelson Francis. 1967. Computational Analysis of Present-Day American English. Book. Providence: Brown University Press.
Liang, Weixin, Zachary Izzo, Yaohui Zhang, Haley Lepp, Hancheng Cao, Xuandong Zhao, Lingjiao Chen, et al. 2024. “Monitoring Ai-Modified Content at Scale: A Case Study on the Impact of Chatgpt on Ai Conference Peer Reviews.” arXiv Preprint arXiv:2403.07183.
Markey, Ben, David West Brown, Michael Laudenbach, and Alan Kohler. 2024. “Dense and Disconnected: Analyzing the Sedimented Style of ChatGPT-Generated Text at Scale.” Written Communication, 07410883241263528.
Mays, Eric, Fred J Damerau, and Robert L Mercer. 1991. “Context Based Spelling Correction.” Journal Article. Information Processing & Management 27 (5): 517–22.
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” Journal Article. arXiv Preprint arXiv:1301.3781.
Oh, Kyoung-Su, and Keechul Jung. 2004. “GPU Implementation of Neural Networks.” Journal Article. Pattern Recognition 37 (6): 1311–14. https://doi.org/https://doi.org/10.1016/j.patcog.2004.01.013.
Perez, Ethan, Sam Ringer, Kamilė Lukošiūtė, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, and Saurav Kadavath. 2022. “Discovering Language Model Behaviors with Model-Written Evaluations.” Journal Article. arXiv Preprint arXiv:2212.09251.
Reinhart, Alex, David West Brown, Ben Markey, Michael Laudenbach, Kachatad Pantusen, Ronald Yurko, and Gordon Weinberg. 2024. “Do LLMs Write Like Humans? Variation in Grammatical and Rhetorical Styles.” arXiv Preprint arXiv:2410.16107.
S. Y. G. 1908. “Do Your Pupils Use Erasers?” Journal Article. Western Teacher: Devoted to Schoolroom Methods. Practical Aids and Usable Materials for Progressive Teachers 16 (5): 175–76. https://books.google.com/books?id=CKOfn_EPJNgC.
Santurkar, Shibani, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. 2023. “Whose Opinions Do Language Models Reflect?” Journal Article. arXiv Preprint arXiv:2303.17548.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” Journal Article. Advances in Neural Information Processing Systems 30: 6000–6010.
Weaver, Warren. 1949. “Translation.” Unpublished Work. New York: The Rockefeller Foundation.
Zhang, Muru, Ofir Press, William Merrill, Alisa Liu, and Noah A Smith. 2023. “How Language Model Hallucinations Can Snowball.” Journal Article. arXiv Preprint arXiv:2305.13534.