Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Larger Transformer-based pre-trained language models have lower perplexity but are less predictive of human reading times.
  • Regression analyses show a positive log-linear relationship between perplexity and fit to reading times for certain models.
  • Residual errors reveal systematic deviation of larger variants, such as underpredicting reading times of named entities.
  • Larger models tend to ‘memorize’ sequences during training, making their surprisal estimates diverge from humanlike expectations.

Paper Content

Introduction

  • Expectation-based theories of sentence processing suggest that processing difficulty is driven by how predictable upcoming material is.
  • Information-theoretic surprisal has been shown to be a strong predictor of processing difficulty.
  • Language models are evaluated as surprisal-based cognitive models of sentence processing.
  • Larger variants of the pre-trained GPT-2 LM are less predictive of self-paced reading times.
  • There is a positive relationship between LM perplexity and predictive power of surprisal estimates.
  • Previous studies have compared surprisal estimates from several types of language models to behavioral measures of processing difficulty
  • Transformer-based models have been evaluated as models of processing difficulty
  • There is a negative correlation between language model perplexity and fit to human reading times
  • Oh et al. (2022) observed a directly contradictory relationship to this using surprisal estimates from pre-trained GPT-2 models
  • Recent work has shown that surprisal from neural LMs generally tends to underpredict human reading times

Main experiment: predictive power of language model surprisal estimates

  • Oh et al. (2022) observed a positive correlation between Transformer-based models and self-paced reading times.
  • GPT-2, GPT-Neo, and OPT LMs were evaluated on self-paced reading times and go-past eyegaze durations.

Response data

  • Natural Stories Corpus contains data from 181 subjects and 10,245 tokens
  • Dundee Corpus contains data from 10 subjects and 51,501 tokens
  • Observations were filtered from both datasets to remove certain words and observations shorter than 100 ms or longer than 3000 ms

Predictors

  • Four variants of GPT-2 models used in Oh et al. (2022)
  • Five variants of GPT-Neo models and eight variants of OPT models evaluated
  • All models are decoder-only autoregressive Transformer-based models
  • Model capacities summarized in Table 1
  • Stories and articles tokenized according to BPE tokenizer
  • Context windows used to calculate surprisal estimates

Regression modeling

  • Fitted a baseline and full LME model to self-paced reading times and go-past durations
  • Baseline predictors included word length, index of word position, saccade length and whether or not the previous word was fixated
  • Random slopes and intercepts included for each subject and word type
  • Calculated ∆LL values and perplexity of each LM variant

Results

  • Subsets of Natural Stories and Dundee Corpus were determined by syntactic category
  • Subsets of Natural Stories were determined by syntactic structure of sentences
  • Subsets of Dundee Corpus were determined by syntactic category and structure
  • MSEs of each regression model were higher than those on entire corpus
  • Subsets of named entities had higher SSEs due to underprediction
  • Mismatch between human sentence processing and language modeling
  • Larger LM variants assign lower surprisal values to open-class words
  • Extra parameters of larger LM variants improve predictions beyond human ability

Calculation of residual errors

  • 17 LME models used to generate predictions for self-paced reading times and go-past durations
  • Residual errors calculated for each model
  • Discrepancy between model likelihood and mean squared error
  • LME models with higher likelihoods achieved similar MSEs to those with lower likelihoods
  • By-word intercept mostly responsible for discrepancy
  • 17 LME models fitted again to both corpora with by-word random intercepts removed
  • Removal of by-word random intercepts brought model likelihoods and MSEs closer

Annotation of data points

  • Part-of-speech
  • Named entities
  • Dependency Locality Theory cost
  • Left-corner parsing

Iterative slope-based analysis of residual errors

  • Identified subsets of data points that strongly drive the trend in Figure 2
  • Used linear relationship between log perplexity and MSEs to identify subsets
  • Excluded identified subset and repeated procedure to identify new subset
  • Considered only subsets with more than 1% of data points in each corpus
  • Separated data points in each subset according to underprediction or overprediction
  • Calculated average surprisal and sum of squared errors for each subset

Discussion and conclusion

  • Results from five GPT-Neo and eight OPT variants show a positive log-linear relationship between perplexity and fit to reading times
  • Data used to train each LM family influences the quality of surprisal estimates
  • Post-hoc analysis of residual errors shows the strongest effect on nouns and adjectives
  • Larger LM variants with more parameters and better next-word prediction performance assign lower surprisal values
  • This leads to systematic overprediction at function words
  • Smaller pre-trained LM variants are more predictive of human reading times
  • Neural LM surprisal underpredicts the magnitude of garden-path effects and increase in reading times at main verb of deeply embedded sentences
  • Attention entropy, shift in attention weights, and norm of the gradient of each input token are robust predictors of naturalistic reading times
  • Researchers should not select the largest pretrained LM available