Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Knowledge of syntax includes knowledge of rare, idiosyncratic constructions.
LLMs must overcome frequency biases to master such constructions.
Prompted GPT-3 to give acceptability judgments on the English-language Article + Adjective + Numeral + Noun construction.
Validated prompt using CoLA corpus of acceptability judgments.
Compared GPT-3’s judgments to crowdsourced human judgments on a subset of sentences.
GPT-3’s judgments broadly similar to human judgments and generally align with proposed constraints in literature.
In some cases, GPT-3’s judgments and human judgments diverge from literature and from each other.

AANN construction is a type of English phrase
It is made up of an article, adjective, numeral, and noun
Usually the numeral comes before the adjective, but in this case the adjective comes first
The article is usually singular, but in this case it is followed by a plural noun phrase
A dozen or so papers have been written on the construction
The presence of the modifier is crucial
The type of modifier is also crucial
Prior work has focused on characterizing the semantic and syntactic constraints
LLMs have access to construction information
LLMs capture verb argument construction biases
LLMs have fine-grained lexical semantic information
Sentences with similar constructions cluster in embedding space
LLMs must overcome statistical regularities to get the AANN construction right

Attaining acceptability judgments from language models is difficult
Used prompting paradigm to elicit acceptability judgments
Prompt created by combining CoLA training sentences and handcrafted sentences
Prompt tested on CoLA dev set, accuracy of 84%, Matthew’s correlation coefficient of 0.63
Used prompting technique to test AANN construction
Varied main sentence template, adjective, numeral, and nominal
Nouns chosen to work with some templates and not others
Adjectives behave differently depending on whether they are quantitative, qualitative, or ambiguous
Numerals focused on “three” and “five”
Generated semantically plausible sentences
Obtained acceptability judgments from GPT-3 and human raters on Mechanical Turk

Tested AANN construction as laid out in literature
Used GPT-3 and human raters to rate AANN construction vs. default and 4 degenerate conditions
126 raters rating 3 sentences each, for 378 total ratings
Results show AANN construction rated as good as default and lower ratings for 4 degenerate versions
Mixed effect regression shows no significant difference between AANN and default, but significant difference between AANN and 4 degenerate conditions

Experiment focused on AANN construction
Varied kinds of adjectives and nouns
190 raters rated 3,420 sentences
Tested whether measure-like nouns, qualitative adjectives, and stubbornly distributive adjectives are acceptable in AANN
Results showed significant differences in how nouns interact with adjectives
Qualitative adjectives scored lowest for art and object nouns
Colors and other stubbornly distributive adjectives showed lowest acceptability

Qualitative adjectives must appear before quantitative ones in AANN
Experiment ran with 3 templates, 5 adjectives, noun “days” and numeral “three” or “five”
GPT-3 prefers the order dispreferred in the literature
Humans showed no clear preference according to the model

GPT-3 can recognize and use the form of the AANN construction in a human-like way
Future work should study not just the form of the construction but its construal
LLMs can recognize the construction but fail at tests of understanding its meaning
GPT-3’s performance on the AANN construction demands a significant amount of constructional knowledge
Future work should explore how LLMs override heuristics
The AANN construction is sensitive to context
GPT-3 and human preference for adjective order was studied
Mean GPT-3 acceptability ratings in the AANN construction for plural and singular verb agreement was studied
Fixed effect coefficients for GPT-3 and human annotators comparing the adjective class x noun class manipulation was studied