Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Targeted syntactic evaluations ask models to make judgements with a single context-free sentence as input.
This paper investigates the stability of language models’ performance on targeted syntactic evaluations when varying properties of the input context.
Results show that model judgements are generally robust when placed in randomly sampled linguistic contexts, but unstable for contexts containing syntactic structures matching those in the critical test content.
Performance is improved by providing contexts with matching syntactic structures, and worsened by unacceptable contexts with matching but violated syntactic structures.
Changes in model performance are not explainable by simple features matching the context and the test inputs.
Sensitivity to specific syntactic features of the context can only be explained by the models’ implicit in-context learning abilities.

LLMs have been developed to better understand and characterize models’ linguistic capacities
Minimal-pair paradigm (MPP) is a popular approach to evaluate models’ knowledge of linguistic phenomena
MPP presents models with datasets containing pairs of minimally differing text sequences
Syntactic priming literature investigates the effect of linguistic contexts
Interaction of context with minimal pair accuracies remains underexplored for multi-sentence contexts
Evaluate MPP by utilizing context window
Evaluate sensitivity of LLMs’ acceptability preferences in a more realistic evaluation setting
Focus on LLMs sensitivity to length of input sequence, similarity of context to minimal pair, and ungrammatical language in context
Results show robustness to unrelated, out-of-domain Wikipedia sentences in context
Strong sensitivity to in-domain context manipulations and ungrammatical context
Explored linguistic features to explain results and found trends cannot be explained by low-level overlap features

LLMs performance on acceptability judgements can be affected by mismatching sequence lengths between pre-training and testing scenarios.
Recent work has explored the effects of providing additional linguistic context to LLMs by “priming” or prepending their inputs with words/sentences.
LLMs can recognize and represent structural similarities between sentences.
Priming can be used to elicit learning behavior in LLMs.
LLMs can extract higher-level information from their context when processing a new test example of a supervised task.

We use two datasets to evaluate the acceptability judgement perception of large language models with respect to change in the input length
BLiMP is a large-scale MPP dataset consisting of 67 different subsets of 1000 English sentence pairs each
SyntaxGym is a syntactic evaluation benchmark designed with more stringent evaluation criteria
We compute the log-likelihood on the given input using the minicons library
We aim to re-evaluate the acceptability accuracy by steadily increasing the token length of the input
We analyze how prepending the test examples with additional context affects a given model’s acceptability judgements
We construct a context by prepending the same sequence to each target and gradually increasing the length of the context
We sample from several possible sources (acceptable sentences, unacceptable sentences, and control sentences)
We investigate the length acceptability effect over autoregressive language models with varying scales
We use Wikipedia data as a control case to simulate out-of-distribution context
We define and test our claims about the effect of length on acceptability with a mixed-effects logistic regression

Models’ accuracy increases with increasing length of grammatical prefixes
Unacceptable prefixes reduce models’ accuracy sharply
In-domain context has a larger impact on acceptability than out-of-domain context
Increase in model size amplifies the negative effect of ungrammatical prefixes
Irrelevant context has negligible effect on priming
No effect of prefix length for Wikipedia sentences

Length effects on acceptability judgements are conditional on the similarity of the prefix to the test sentence
Syntactic or lexical similarity may explain this phenomenon
Syntactic priming may be responsible
Lexical overlap may be responsible
Correlations between syntactic similarity and lexical overlap with accuracies were tested
Correlations were low and non-significant

Language models’ acceptability judgements are sensitive to the domain and acceptability of the input.
Short inputs may not be representative of models’ true abilities.
Performance is only sensitive to length when using in-domain examples.
Models show consistent behavior with the acceptability of their prefix.
Models are sensitive to individual prompts, ordering of in-context examples, and choice of prompt and output verbalizer.
Length effects on performance may be more significant with in-distribution training examples.