Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Language Models perform poorly on quantification
  • ‘Few’-type quantifiers pose a particular challenge for Language Models
  • 960 sentences were presented to 22 autoregressive transformer models of differing sizes
  • Performance of larger models decreased, suggesting they reflect online rather than offline human processing

Paper Content

Introduction

  • Quantifiers can change the meaning of an utterance
  • Sentences with the same content words can have opposite meanings
  • Language models struggle to predict which quantifier is used in a given context
  • Language models have poor performance at generating appropriate continuations following logical quantifiers
  • Large language models are being used as general systems for multiple tasks
  • It is important that language models can distinguish between sentences with different meanings
  • This study evaluates how well language models take into account the meaning of a quantifier when generating text
  • Investigates whether there is an inverse scaling relationship with model size
  • Negation is challenging for language models
  • This study focuses on quantifiers indicating typicality such as most and few
  • Uses stimuli from a previously published N400 study
  • Tests whether language models show the same pattern of insensitivity towards the quantifiers

Language models

  • Analyzed GPT-2, GPT-3, GPT-Neo, OPT, and InstructGPT language models
  • Compared different training data and numbers of parameters

Evaluation

  • Calculated the surprisal of the critical word in each stimulus sentence
  • Considered the surprisal of the critical word given its preceding context
  • Converted probability of the target word to surprisal using Equation 1
  • Used single and multi-token words
  • Compared which of the two possible critical words had a lower surprisal
  • Calculated accuracy as fraction of stimulus pairs for which model predicted the appropriate critical word
  • Analyzed model sensitivity to the quantifiers
  • All code and data will be published online on acceptance

Results

  • Accuracy of models increases with size for most-type quantifiers, but decreases for few-type quantifiers
  • Small exceptions to this pattern exist
  • Sensitivity of models varies, but is generally low
  • No clear pattern in sensitivity

Discussion

Inverse scaling with quantifiers

  • Models increase in size, they tend to improve at predicting words following most-type quantifiers and get worse at predicting words following few-type quantifiers
  • Larger models make predictions increasingly in accordance with typicality, overwhelming any sensitivity to quantifier type
  • Sensitivity analysis shows all models have a poor and largely invariant sensitivity overall

Further implications

  • Models tend to perform better as they get larger and are trained on more data
  • Evidence supports this idea
  • Predictions of larger models and those trained on more data correlate with human incremental online predictions
  • Easier for humans to process well-formed sentences with plausible semantics
  • Predictions of larger models can align less with explicit human judgements
  • Language models may struggle to make predictions in line with human offline judgements
  • Tailoring training may be necessary to avoid specific known issues