Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Language Models perform poorly on quantification
‘Few’-type quantifiers pose a particular challenge for Language Models
960 sentences were presented to 22 autoregressive transformer models of differing sizes
Performance of larger models decreased, suggesting they reflect online rather than offline human processing

Quantifiers can change the meaning of an utterance
Sentences with the same content words can have opposite meanings
Language models struggle to predict which quantifier is used in a given context
Language models have poor performance at generating appropriate continuations following logical quantifiers
Large language models are being used as general systems for multiple tasks
It is important that language models can distinguish between sentences with different meanings
This study evaluates how well language models take into account the meaning of a quantifier when generating text
Investigates whether there is an inverse scaling relationship with model size
Negation is challenging for language models
This study focuses on quantifiers indicating typicality such as most and few
Uses stimuli from a previously published N400 study
Tests whether language models show the same pattern of insensitivity towards the quantifiers

Calculated the surprisal of the critical word in each stimulus sentence
Considered the surprisal of the critical word given its preceding context
Converted probability of the target word to surprisal using Equation 1
Used single and multi-token words
Compared which of the two possible critical words had a lower surprisal
Calculated accuracy as fraction of stimulus pairs for which model predicted the appropriate critical word
Analyzed model sensitivity to the quantifiers
All code and data will be published online on acceptance

Accuracy of models increases with size for most-type quantifiers, but decreases for few-type quantifiers
Small exceptions to this pattern exist
Sensitivity of models varies, but is generally low
No clear pattern in sensitivity

Models increase in size, they tend to improve at predicting words following most-type quantifiers and get worse at predicting words following few-type quantifiers
Larger models make predictions increasingly in accordance with typicality, overwhelming any sensitivity to quantifier type
Sensitivity analysis shows all models have a poor and largely invariant sensitivity overall

Models tend to perform better as they get larger and are trained on more data
Evidence supports this idea
Predictions of larger models and those trained on more data correlate with human incremental online predictions
Easier for humans to process well-formed sentences with plausible semantics
Predictions of larger models can align less with explicit human judgements
Language models may struggle to make predictions in line with human offline judgements
Tailoring training may be necessary to avoid specific known issues