Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Language Models perform poorly on quantification
- ‘Few’-type quantifiers pose a particular challenge for Language Models
- 960 sentences were presented to 22 autoregressive transformer models of differing sizes
- Performance of larger models decreased, suggesting they reflect online rather than offline human processing
Paper Content
Introduction
- Quantifiers can change the meaning of an utterance
- Sentences with the same content words can have opposite meanings
- Language models struggle to predict which quantifier is used in a given context
- Language models have poor performance at generating appropriate continuations following logical quantifiers
- Large language models are being used as general systems for multiple tasks
- It is important that language models can distinguish between sentences with different meanings
- This study evaluates how well language models take into account the meaning of a quantifier when generating text
- Investigates whether there is an inverse scaling relationship with model size
- Negation is challenging for language models
- This study focuses on quantifiers indicating typicality such as most and few
- Uses stimuli from a previously published N400 study
- Tests whether language models show the same pattern of insensitivity towards the quantifiers
Language models
- Analyzed GPT-2, GPT-3, GPT-Neo, OPT, and InstructGPT language models
- Compared different training data and numbers of parameters
Evaluation
- Calculated the surprisal of the critical word in each stimulus sentence
- Considered the surprisal of the critical word given its preceding context
- Converted probability of the target word to surprisal using Equation 1
- Used single and multi-token words
- Compared which of the two possible critical words had a lower surprisal
- Calculated accuracy as fraction of stimulus pairs for which model predicted the appropriate critical word
- Analyzed model sensitivity to the quantifiers
- All code and data will be published online on acceptance
Results
- Accuracy of models increases with size for most-type quantifiers, but decreases for few-type quantifiers
- Small exceptions to this pattern exist
- Sensitivity of models varies, but is generally low
- No clear pattern in sensitivity
Discussion
Inverse scaling with quantifiers
- Models increase in size, they tend to improve at predicting words following most-type quantifiers and get worse at predicting words following few-type quantifiers
- Larger models make predictions increasingly in accordance with typicality, overwhelming any sensitivity to quantifier type
- Sensitivity analysis shows all models have a poor and largely invariant sensitivity overall
Further implications
- Models tend to perform better as they get larger and are trained on more data
- Evidence supports this idea
- Predictions of larger models and those trained on more data correlate with human incremental online predictions
- Easier for humans to process well-formed sentences with plausible semantics
- Predictions of larger models can align less with explicit human judgements
- Language models may struggle to make predictions in line with human offline judgements
- Tailoring training may be necessary to avoid specific known issues