Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Knowledge of syntax includes knowledge of rare, idiosyncratic constructions.
  • LLMs must overcome frequency biases to master such constructions.
  • Prompted GPT-3 to give acceptability judgments on the English-language Article + Adjective + Numeral + Noun construction.
  • Validated prompt using CoLA corpus of acceptability judgments.
  • Compared GPT-3’s judgments to crowdsourced human judgments on a subset of sentences.
  • GPT-3’s judgments broadly similar to human judgments and generally align with proposed constraints in literature.
  • In some cases, GPT-3’s judgments and human judgments diverge from literature and from each other.

Paper Content

Introduction

  • AANN construction is a type of English phrase
  • It is made up of an article, adjective, numeral, and noun
  • Usually the numeral comes before the adjective, but in this case the adjective comes first
  • The article is usually singular, but in this case it is followed by a plural noun phrase
  • A dozen or so papers have been written on the construction
  • The presence of the modifier is crucial
  • The type of modifier is also crucial
  • Prior work has focused on characterizing the semantic and syntactic constraints
  • LLMs have access to construction information
  • LLMs capture verb argument construction biases
  • LLMs have fine-grained lexical semantic information
  • Sentences with similar constructions cluster in embedding space
  • LLMs must overcome statistical regularities to get the AANN construction right

Methods

  • Attaining acceptability judgments from language models is difficult
  • Used prompting paradigm to elicit acceptability judgments
  • Prompt created by combining CoLA training sentences and handcrafted sentences
  • Prompt tested on CoLA dev set, accuracy of 84%, Matthew’s correlation coefficient of 0.63
  • Used prompting technique to test AANN construction
  • Varied main sentence template, adjective, numeral, and nominal
  • Nouns chosen to work with some templates and not others
  • Adjectives behave differently depending on whether they are quantitative, qualitative, or ambiguous
  • Numerals focused on “three” and “five”
  • Generated semantically plausible sentences
  • Obtained acceptability judgments from GPT-3 and human raters on Mechanical Turk

Exp. 1: aann fundamentals

  • Tested AANN construction as laid out in literature
  • Used GPT-3 and human raters to rate AANN construction vs. default and 4 degenerate conditions
  • 126 raters rating 3 sentences each, for 378 total ratings
  • Results show AANN construction rated as good as default and lower ratings for 4 degenerate versions
  • Mixed effect regression shows no significant difference between AANN and default, but significant difference between AANN and 4 degenerate conditions

Exp. 2: adjectives and nouns

  • Experiment focused on AANN construction
  • Varied kinds of adjectives and nouns
  • 190 raters rated 3,420 sentences
  • Tested whether measure-like nouns, qualitative adjectives, and stubbornly distributive adjectives are acceptable in AANN
  • Results showed significant differences in how nouns interact with adjectives
  • Qualitative adjectives scored lowest for art and object nouns
  • Colors and other stubbornly distributive adjectives showed lowest acceptability

Exp. 3: adjective order

  • Qualitative adjectives must appear before quantitative ones in AANN
  • Experiment ran with 3 templates, 5 adjectives, noun “days” and numeral “three” or “five”
  • GPT-3 prefers the order dispreferred in the literature
  • Humans showed no clear preference according to the model

Exp. 4: verb agreement

  • AANN construction challenges number agreement
  • Subjects sometimes take singular verbs
  • Subjects sometimes take plural verbs
  • Subjects sometimes take either singular or plural verbs

Conclusion

  • GPT-3 can recognize and use the form of the AANN construction in a human-like way
  • Future work should study not just the form of the construction but its construal
  • LLMs can recognize the construction but fail at tests of understanding its meaning
  • GPT-3’s performance on the AANN construction demands a significant amount of constructional knowledge
  • Future work should explore how LLMs override heuristics
  • The AANN construction is sensitive to context
  • GPT-3 and human preference for adjective order was studied
  • Mean GPT-3 acceptability ratings in the AANN construction for plural and singular verb agreement was studied
  • Fixed effect coefficients for GPT-3 and human annotators comparing the adjective class x noun class manipulation was studied