Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Targeted syntactic evaluations ask models to make judgements with a single context-free sentence as input.
- This paper investigates the stability of language models’ performance on targeted syntactic evaluations when varying properties of the input context.
- Results show that model judgements are generally robust when placed in randomly sampled linguistic contexts, but unstable for contexts containing syntactic structures matching those in the critical test content.
- Performance is improved by providing contexts with matching syntactic structures, and worsened by unacceptable contexts with matching but violated syntactic structures.
- Changes in model performance are not explainable by simple features matching the context and the test inputs.
- Sensitivity to specific syntactic features of the context can only be explained by the models’ implicit in-context learning abilities.
Paper Content
Introduction
- LLMs have been developed to better understand and characterize models’ linguistic capacities
- Minimal-pair paradigm (MPP) is a popular approach to evaluate models’ knowledge of linguistic phenomena
- MPP presents models with datasets containing pairs of minimally differing text sequences
- Syntactic priming literature investigates the effect of linguistic contexts
- Interaction of context with minimal pair accuracies remains underexplored for multi-sentence contexts
- Evaluate MPP by utilizing context window
- Evaluate sensitivity of LLMs’ acceptability preferences in a more realistic evaluation setting
- Focus on LLMs sensitivity to length of input sequence, similarity of context to minimal pair, and ungrammatical language in context
- Results show robustness to unrelated, out-of-domain Wikipedia sentences in context
- Strong sensitivity to in-domain context manipulations and ungrammatical context
- Explored linguistic features to explain results and found trends cannot be explained by low-level overlap features
Background
- LLMs performance on acceptability judgements can be affected by mismatching sequence lengths between pre-training and testing scenarios.
- Recent work has explored the effects of providing additional linguistic context to LLMs by “priming” or prepending their inputs with words/sentences.
- LLMs can recognize and represent structural similarities between sentences.
- Priming can be used to elicit learning behavior in LLMs.
- LLMs can extract higher-level information from their context when processing a new test example of a supervised task.
Approach
- We use two datasets to evaluate the acceptability judgement perception of large language models with respect to change in the input length
- BLiMP is a large-scale MPP dataset consisting of 67 different subsets of 1000 English sentence pairs each
- SyntaxGym is a syntactic evaluation benchmark designed with more stringent evaluation criteria
- We compute the log-likelihood on the given input using the minicons library
- We aim to re-evaluate the acceptability accuracy by steadily increasing the token length of the input
- We analyze how prepending the test examples with additional context affects a given model’s acceptability judgements
- We construct a context by prepending the same sequence to each target and gradually increasing the length of the context
- We sample from several possible sources (acceptable sentences, unacceptable sentences, and control sentences)
- We investigate the length acceptability effect over autoregressive language models with varying scales
- We use Wikipedia data as a control case to simulate out-of-distribution context
- We define and test our claims about the effect of length on acceptability with a mixed-effects logistic regression
Main results
- Models’ accuracy increases with increasing length of grammatical prefixes
- Unacceptable prefixes reduce models’ accuracy sharply
- In-domain context has a larger impact on acceptability than out-of-domain context
- Increase in model size amplifies the negative effect of ungrammatical prefixes
- Irrelevant context has negligible effect on priming
- No effect of prefix length for Wikipedia sentences
Prefix similarity analysis
- Length effects on acceptability judgements are conditional on the similarity of the prefix to the test sentence
- Syntactic or lexical similarity may explain this phenomenon
- Syntactic priming may be responsible
- Lexical overlap may be responsible
- Correlations between syntactic similarity and lexical overlap with accuracies were tested
- Correlations were low and non-significant
Discussion
- Language models’ acceptability judgements are sensitive to the domain and acceptability of the input.
- Short inputs may not be representative of models’ true abilities.
- Performance is only sensitive to length when using in-domain examples.
- Models show consistent behavior with the acceptability of their prefix.
- Models are sensitive to individual prompts, ordering of in-context examples, and choice of prompt and output verbalizer.
- Length effects on performance may be more significant with in-distribution training examples.