Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Knowledge of syntax includes knowledge of rare, idiosyncratic constructions.
- LLMs must overcome frequency biases to master such constructions.
- Prompted GPT-3 to give acceptability judgments on the English-language Article + Adjective + Numeral + Noun construction.
- Validated prompt using CoLA corpus of acceptability judgments.
- Compared GPT-3’s judgments to crowdsourced human judgments on a subset of sentences.
- GPT-3’s judgments broadly similar to human judgments and generally align with proposed constraints in literature.
- In some cases, GPT-3’s judgments and human judgments diverge from literature and from each other.
Paper Content
Introduction
- AANN construction is a type of English phrase
- It is made up of an article, adjective, numeral, and noun
- Usually the numeral comes before the adjective, but in this case the adjective comes first
- The article is usually singular, but in this case it is followed by a plural noun phrase
- A dozen or so papers have been written on the construction
- The presence of the modifier is crucial
- The type of modifier is also crucial
- Prior work has focused on characterizing the semantic and syntactic constraints
- LLMs have access to construction information
- LLMs capture verb argument construction biases
- LLMs have fine-grained lexical semantic information
- Sentences with similar constructions cluster in embedding space
- LLMs must overcome statistical regularities to get the AANN construction right
Methods
- Attaining acceptability judgments from language models is difficult
- Used prompting paradigm to elicit acceptability judgments
- Prompt created by combining CoLA training sentences and handcrafted sentences
- Prompt tested on CoLA dev set, accuracy of 84%, Matthew’s correlation coefficient of 0.63
- Used prompting technique to test AANN construction
- Varied main sentence template, adjective, numeral, and nominal
- Nouns chosen to work with some templates and not others
- Adjectives behave differently depending on whether they are quantitative, qualitative, or ambiguous
- Numerals focused on “three” and “five”
- Generated semantically plausible sentences
- Obtained acceptability judgments from GPT-3 and human raters on Mechanical Turk
Exp. 1: aann fundamentals
- Tested AANN construction as laid out in literature
- Used GPT-3 and human raters to rate AANN construction vs. default and 4 degenerate conditions
- 126 raters rating 3 sentences each, for 378 total ratings
- Results show AANN construction rated as good as default and lower ratings for 4 degenerate versions
- Mixed effect regression shows no significant difference between AANN and default, but significant difference between AANN and 4 degenerate conditions
Exp. 2: adjectives and nouns
- Experiment focused on AANN construction
- Varied kinds of adjectives and nouns
- 190 raters rated 3,420 sentences
- Tested whether measure-like nouns, qualitative adjectives, and stubbornly distributive adjectives are acceptable in AANN
- Results showed significant differences in how nouns interact with adjectives
- Qualitative adjectives scored lowest for art and object nouns
- Colors and other stubbornly distributive adjectives showed lowest acceptability
Exp. 3: adjective order
- Qualitative adjectives must appear before quantitative ones in AANN
- Experiment ran with 3 templates, 5 adjectives, noun “days” and numeral “three” or “five”
- GPT-3 prefers the order dispreferred in the literature
- Humans showed no clear preference according to the model
Exp. 4: verb agreement
- AANN construction challenges number agreement
- Subjects sometimes take singular verbs
- Subjects sometimes take plural verbs
- Subjects sometimes take either singular or plural verbs
Conclusion
- GPT-3 can recognize and use the form of the AANN construction in a human-like way
- Future work should study not just the form of the construction but its construal
- LLMs can recognize the construction but fail at tests of understanding its meaning
- GPT-3’s performance on the AANN construction demands a significant amount of constructional knowledge
- Future work should explore how LLMs override heuristics
- The AANN construction is sensitive to context
- GPT-3 and human preference for adjective order was studied
- Mean GPT-3 acceptability ratings in the AANN construction for plural and singular verb agreement was studied
- Fixed effect coefficients for GPT-3 and human annotators comparing the adjective class x noun class manipulation was studied