Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Targeted syntactic evaluations ask models to make judgements with a single context-free sentence as input.
  • This paper investigates the stability of language models’ performance on targeted syntactic evaluations when varying properties of the input context.
  • Results show that model judgements are generally robust when placed in randomly sampled linguistic contexts, but unstable for contexts containing syntactic structures matching those in the critical test content.
  • Performance is improved by providing contexts with matching syntactic structures, and worsened by unacceptable contexts with matching but violated syntactic structures.
  • Changes in model performance are not explainable by simple features matching the context and the test inputs.
  • Sensitivity to specific syntactic features of the context can only be explained by the models’ implicit in-context learning abilities.

Paper Content

Introduction

  • LLMs have been developed to better understand and characterize models’ linguistic capacities
  • Minimal-pair paradigm (MPP) is a popular approach to evaluate models’ knowledge of linguistic phenomena
  • MPP presents models with datasets containing pairs of minimally differing text sequences
  • Syntactic priming literature investigates the effect of linguistic contexts
  • Interaction of context with minimal pair accuracies remains underexplored for multi-sentence contexts
  • Evaluate MPP by utilizing context window
  • Evaluate sensitivity of LLMs’ acceptability preferences in a more realistic evaluation setting
  • Focus on LLMs sensitivity to length of input sequence, similarity of context to minimal pair, and ungrammatical language in context
  • Results show robustness to unrelated, out-of-domain Wikipedia sentences in context
  • Strong sensitivity to in-domain context manipulations and ungrammatical context
  • Explored linguistic features to explain results and found trends cannot be explained by low-level overlap features

Background

  • LLMs performance on acceptability judgements can be affected by mismatching sequence lengths between pre-training and testing scenarios.
  • Recent work has explored the effects of providing additional linguistic context to LLMs by “priming” or prepending their inputs with words/sentences.
  • LLMs can recognize and represent structural similarities between sentences.
  • Priming can be used to elicit learning behavior in LLMs.
  • LLMs can extract higher-level information from their context when processing a new test example of a supervised task.

Approach

  • We use two datasets to evaluate the acceptability judgement perception of large language models with respect to change in the input length
  • BLiMP is a large-scale MPP dataset consisting of 67 different subsets of 1000 English sentence pairs each
  • SyntaxGym is a syntactic evaluation benchmark designed with more stringent evaluation criteria
  • We compute the log-likelihood on the given input using the minicons library
  • We aim to re-evaluate the acceptability accuracy by steadily increasing the token length of the input
  • We analyze how prepending the test examples with additional context affects a given model’s acceptability judgements
  • We construct a context by prepending the same sequence to each target and gradually increasing the length of the context
  • We sample from several possible sources (acceptable sentences, unacceptable sentences, and control sentences)
  • We investigate the length acceptability effect over autoregressive language models with varying scales
  • We use Wikipedia data as a control case to simulate out-of-distribution context
  • We define and test our claims about the effect of length on acceptability with a mixed-effects logistic regression

Main results

  • Models’ accuracy increases with increasing length of grammatical prefixes
  • Unacceptable prefixes reduce models’ accuracy sharply
  • In-domain context has a larger impact on acceptability than out-of-domain context
  • Increase in model size amplifies the negative effect of ungrammatical prefixes
  • Irrelevant context has negligible effect on priming
  • No effect of prefix length for Wikipedia sentences

Prefix similarity analysis

  • Length effects on acceptability judgements are conditional on the similarity of the prefix to the test sentence
  • Syntactic or lexical similarity may explain this phenomenon
  • Syntactic priming may be responsible
  • Lexical overlap may be responsible
  • Correlations between syntactic similarity and lexical overlap with accuracies were tested
  • Correlations were low and non-significant

Discussion

  • Language models’ acceptability judgements are sensitive to the domain and acceptability of the input.
  • Short inputs may not be representative of models’ true abilities.
  • Performance is only sensitive to length when using in-domain examples.
  • Models show consistent behavior with the acceptability of their prefix.
  • Models are sensitive to individual prompts, ordering of in-context examples, and choice of prompt and output verbalizer.
  • Length effects on performance may be more significant with in-distribution training examples.

Conclusion