Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Larger Transformer-based pre-trained language models have lower perplexity but are less predictive of human reading times.
Regression analyses show a positive log-linear relationship between perplexity and fit to reading times for certain models.
Residual errors reveal systematic deviation of larger variants, such as underpredicting reading times of named entities.
Larger models tend to ‘memorize’ sequences during training, making their surprisal estimates diverge from humanlike expectations.

Paper Content

Introduction

Expectation-based theories of sentence processing suggest that processing difficulty is driven by how predictable upcoming material is.
Information-theoretic surprisal has been shown to be a strong predictor of processing difficulty.
Language models are evaluated as surprisal-based cognitive models of sentence processing.
Larger variants of the pre-trained GPT-2 LM are less predictive of self-paced reading times.
There is a positive relationship between LM perplexity and predictive power of surprisal estimates.

Previous studies have compared surprisal estimates from several types of language models to behavioral measures of processing difficulty
Transformer-based models have been evaluated as models of processing difficulty
There is a negative correlation between language model perplexity and fit to human reading times
Oh et al. (2022) observed a directly contradictory relationship to this using surprisal estimates from pre-trained GPT-2 models
Recent work has shown that surprisal from neural LMs generally tends to underpredict human reading times

Main experiment: predictive power of language model surprisal estimates

Oh et al. (2022) observed a positive correlation between Transformer-based models and self-paced reading times.
GPT-2, GPT-Neo, and OPT LMs were evaluated on self-paced reading times and go-past eyegaze durations.

Response data

Natural Stories Corpus contains data from 181 subjects and 10,245 tokens
Dundee Corpus contains data from 10 subjects and 51,501 tokens
Observations were filtered from both datasets to remove certain words and observations shorter than 100 ms or longer than 3000 ms

Predictors

Four variants of GPT-2 models used in Oh et al. (2022)
Five variants of GPT-Neo models and eight variants of OPT models evaluated
All models are decoder-only autoregressive Transformer-based models
Model capacities summarized in Table 1
Stories and articles tokenized according to BPE tokenizer
Context windows used to calculate surprisal estimates

Regression modeling

Fitted a baseline and full LME model to self-paced reading times and go-past durations
Baseline predictors included word length, index of word position, saccade length and whether or not the previous word was fixated
Random slopes and intercepts included for each subject and word type
Calculated ∆LL values and perplexity of each LM variant

Results

Subsets of Natural Stories and Dundee Corpus were determined by syntactic category
Subsets of Natural Stories were determined by syntactic structure of sentences
Subsets of Dundee Corpus were determined by syntactic category and structure
MSEs of each regression model were higher than those on entire corpus
Subsets of named entities had higher SSEs due to underprediction
Mismatch between human sentence processing and language modeling
Larger LM variants assign lower surprisal values to open-class words
Extra parameters of larger LM variants improve predictions beyond human ability

Calculation of residual errors

17 LME models used to generate predictions for self-paced reading times and go-past durations
Residual errors calculated for each model
Discrepancy between model likelihood and mean squared error
LME models with higher likelihoods achieved similar MSEs to those with lower likelihoods
By-word intercept mostly responsible for discrepancy
17 LME models fitted again to both corpora with by-word random intercepts removed
Removal of by-word random intercepts brought model likelihoods and MSEs closer

Annotation of data points

Part-of-speech
Named entities
Dependency Locality Theory cost
Left-corner parsing

Iterative slope-based analysis of residual errors

Identified subsets of data points that strongly drive the trend in Figure 2
Used linear relationship between log perplexity and MSEs to identify subsets
Excluded identified subset and repeated procedure to identify new subset
Considered only subsets with more than 1% of data points in each corpus
Separated data points in each subset according to underprediction or overprediction
Calculated average surprisal and sum of squared errors for each subset

Discussion and conclusion

Results from five GPT-Neo and eight OPT variants show a positive log-linear relationship between perplexity and fit to reading times
Data used to train each LM family influences the quality of surprisal estimates
Post-hoc analysis of residual errors shows the strongest effect on nouns and adjectives
Larger LM variants with more parameters and better next-word prediction performance assign lower surprisal values
This leads to systematic overprediction at function words
Smaller pre-trained LM variants are more predictive of human reading times
Neural LM surprisal underpredicts the magnitude of garden-path effects and increase in reading times at main verb of deeply embedded sentences
Attention entropy, shift in attention weights, and norm of the gradient of each input token are robust predictors of naturalistic reading times
Researchers should not select the largest pretrained LM available

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Main experiment: predictive power of language model surprisal estimates#

Response data#

Predictors#

Regression modeling#

Results#

Calculation of residual errors#

Annotation of data points#

Iterative slope-based analysis of residual errors#

Discussion and conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Main experiment: predictive power of language model surprisal estimates

Response data

Predictors

Regression modeling

Results

Calculation of residual errors

Annotation of data points

Iterative slope-based analysis of residual errors

Discussion and conclusion