Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Larger Transformer-based pre-trained language models have lower perplexity but are less predictive of human reading times.
- Regression analyses show a positive log-linear relationship between perplexity and fit to reading times for certain models.
- Residual errors reveal systematic deviation of larger variants, such as underpredicting reading times of named entities.
- Larger models tend to ‘memorize’ sequences during training, making their surprisal estimates diverge from humanlike expectations.
Paper Content
Introduction
- Expectation-based theories of sentence processing suggest that processing difficulty is driven by how predictable upcoming material is.
- Information-theoretic surprisal has been shown to be a strong predictor of processing difficulty.
- Language models are evaluated as surprisal-based cognitive models of sentence processing.
- Larger variants of the pre-trained GPT-2 LM are less predictive of self-paced reading times.
- There is a positive relationship between LM perplexity and predictive power of surprisal estimates.
Related work
- Previous studies have compared surprisal estimates from several types of language models to behavioral measures of processing difficulty
- Transformer-based models have been evaluated as models of processing difficulty
- There is a negative correlation between language model perplexity and fit to human reading times
- Oh et al. (2022) observed a directly contradictory relationship to this using surprisal estimates from pre-trained GPT-2 models
- Recent work has shown that surprisal from neural LMs generally tends to underpredict human reading times
Main experiment: predictive power of language model surprisal estimates
- Oh et al. (2022) observed a positive correlation between Transformer-based models and self-paced reading times.
- GPT-2, GPT-Neo, and OPT LMs were evaluated on self-paced reading times and go-past eyegaze durations.
Response data
- Natural Stories Corpus contains data from 181 subjects and 10,245 tokens
- Dundee Corpus contains data from 10 subjects and 51,501 tokens
- Observations were filtered from both datasets to remove certain words and observations shorter than 100 ms or longer than 3000 ms
Predictors
- Four variants of GPT-2 models used in Oh et al. (2022)
- Five variants of GPT-Neo models and eight variants of OPT models evaluated
- All models are decoder-only autoregressive Transformer-based models
- Model capacities summarized in Table 1
- Stories and articles tokenized according to BPE tokenizer
- Context windows used to calculate surprisal estimates
Regression modeling
- Fitted a baseline and full LME model to self-paced reading times and go-past durations
- Baseline predictors included word length, index of word position, saccade length and whether or not the previous word was fixated
- Random slopes and intercepts included for each subject and word type
- Calculated ∆LL values and perplexity of each LM variant
Results
- Subsets of Natural Stories and Dundee Corpus were determined by syntactic category
- Subsets of Natural Stories were determined by syntactic structure of sentences
- Subsets of Dundee Corpus were determined by syntactic category and structure
- MSEs of each regression model were higher than those on entire corpus
- Subsets of named entities had higher SSEs due to underprediction
- Mismatch between human sentence processing and language modeling
- Larger LM variants assign lower surprisal values to open-class words
- Extra parameters of larger LM variants improve predictions beyond human ability
Calculation of residual errors
- 17 LME models used to generate predictions for self-paced reading times and go-past durations
- Residual errors calculated for each model
- Discrepancy between model likelihood and mean squared error
- LME models with higher likelihoods achieved similar MSEs to those with lower likelihoods
- By-word intercept mostly responsible for discrepancy
- 17 LME models fitted again to both corpora with by-word random intercepts removed
- Removal of by-word random intercepts brought model likelihoods and MSEs closer
Annotation of data points
- Part-of-speech
- Named entities
- Dependency Locality Theory cost
- Left-corner parsing
Iterative slope-based analysis of residual errors
- Identified subsets of data points that strongly drive the trend in Figure 2
- Used linear relationship between log perplexity and MSEs to identify subsets
- Excluded identified subset and repeated procedure to identify new subset
- Considered only subsets with more than 1% of data points in each corpus
- Separated data points in each subset according to underprediction or overprediction
- Calculated average surprisal and sum of squared errors for each subset
Discussion and conclusion
- Results from five GPT-Neo and eight OPT variants show a positive log-linear relationship between perplexity and fit to reading times
- Data used to train each LM family influences the quality of surprisal estimates
- Post-hoc analysis of residual errors shows the strongest effect on nouns and adjectives
- Larger LM variants with more parameters and better next-word prediction performance assign lower surprisal values
- This leads to systematic overprediction at function words
- Smaller pre-trained LM variants are more predictive of human reading times
- Neural LM surprisal underpredicts the magnitude of garden-path effects and increase in reading times at main verb of deeply embedded sentences
- Attention entropy, shift in attention weights, and norm of the gradient of each input token are robust predictors of naturalistic reading times
- Researchers should not select the largest pretrained LM available