Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

LLMs can generate coherent, grammatical and seemingly meaningful text.
LLMs are capable of performing tasks that require abstract knowledge and reasoning.
LLMs show impressive performance on tasks requiring formal linguistic competence, but fail on tasks requiring functional competence.

Paper Content

Introduction

Alan Turing proposed the Turing test to determine if an agent is a human or a machine.
The Turing test has shaped the way society thinks of machine intelligence.
Two fallacies related to the language-thought relationship exist: “good at language -> good at thought” and “bad at thought -> bad at language”.
LLMs have been successful in developing linguistic knowledge from large corpora of language data.
LLMs do not, in and of themselves, model human thought.
Formal linguistic competence involves knowledge of rules and statistical regularities of language.
Functional linguistic competence involves the ability to use language in the real world.
Human language and thought are dissociable.
LLMs have promise as scientific models of formal language processing.
LLMs fail to model human thought in several domains.
LLMs can be combined with other architectures to learn from more than just language data.

Motivation for the distinction between formal vs. functional linguistic competence

Language is robustly dissociated from other high-level cognition, perception and action
Human language processing draws on a set of interconnected brain areas in the frontal and temporal lobes
Language network responds to stimulus features rather than task demands
Language network is sensitive to linguistic regularities at all levels
Language network is remarkably selective for language alone
Individuals with severe aphasia have intact non-linguistic cognitive abilities
Brain imaging studies show language network is selective for language processing
Language models are surprisingly successful at mastering formal linguistic competence
Statistical language models use word prediction task
LLMs use vectors in a high-dimensional space and neural networks
LLMs stand in contrast to models that use explicit, structured hierarchical representations
N-grams and word embedding models achieved some success in various domains
LLMs have succeeded on tests of general language understanding and linguistic competence
Transformer models learn a lot about the structure of language

What large language models can do: a case study

GPT-3 can use complex linguistic features
Uses pronouns correctly
Uses passive voice correctly
Uses prepositions correctly
Maintains coherence
Uses discourse relationships
Can be prompted to perform tasks
Lacks underlying meaning
Current versions of GPT-3 are different from original version

Large language models learn core aspects of human language processing

LLMs must encode abstract phonological, morphological, syntactic, and semantic rules to be useful models of language processing in humans
LLMs learn hierarchical structure and abstraction
Hierarchical structure manifests in many ways, such as non-local feature agreement
LLMs can handle grammatical tasks that require operating over hierarchical structure
LLMs can handle other structure-sensitive constructions, like filler-gap dependencies and negative polarity
LLMs learn abstract rule knowledge, such as part-of-speech categories, parses, named entities, and semantic roles
LLMs can apply morphosyntactic rules to novel words
LLMs rely on lexical-semantic cues to some extent
Humans use diverse cues in language learning and processing that sometimes override or conflict with strict hierarchical syntactic processing
Humans rely on memorizing previously seen input, as opposed to purely learning abstract rules
LLMs show evidence of representing hierarchical structure and abstract linguistic patterns

Llms resemble the human language-selective network

LLMs can predict activity in the language brain network
LLMs and the language brain network likely have the same objective: next-word prediction
LLMs and the language brain network have similar functional response properties
LLMs and the language brain network are sensitive to abstract hierarchical rules, isolated phrases/sentences, naturalistic narratives, and jabberwocky stimuli
LLMs and the language brain network are sensitive to specific word co-occurrences
LLMs and the language brain network have similar internal architectures

Limitations of llms as human-like language learners and processors

LLMs learn some aspects of hierarchical structure and abstraction, but not fully human-like
LLMs can pick up on statistical regularities to achieve good performance without learning relevant linguistic information
LLMs can be misled by simple frequency effects
LLMs generate output based on a combination of word co-occurrence knowledge and abstract morphosyntactic rules
LLMs require vastly more data than a child is exposed to
LLMs may be biased towards English and other European languages
Evidence of strong performance in a variety of languages is growing

Interim conclusions

LLMs generate highly coherent, grammatical texts that can be indistinguishable from human output
LLMs exhibit knowledge of hierarchical structure and abstract linguistic categories
LLMs have overturned claims about the fundamental impossibility of acquiring certain linguistic knowledge
LLMs have substantial value in the scientific study of language learning and processing
LLMs acquire large amounts of factual knowledge
LLMs succeed at some types of mathematical reasoning
LLMs reproduce many stereotypes and social biases
LLMs struggle with non-language-specific capabilities

How llms fail

LLMs can use word co-occurrence patterns to “hack” tasks.
Researchers can construct unusual prompts to prevent LLMs from “hacking”.
GPT-3 struggles with out-of-distribution problems.
LLMs have limitations when it comes to non-shallow reasoning tasks.
GPT-3 has high formal linguistic competence.

Limitations of llms as real-life language users

LLMs are thought to be precursors to AGI.
Real-life language use requires non-linguistic cognitive skills.
Four key capacities are needed for language use: formal reasoning, world knowledge, situation modeling, and social reasoning.
Language and formal reasoning rely on distinct cognitive and neural systems in humans.
LLMs can appear to solve math problems but actually rely on heuristics and fail on more complex problems.
LLMs can be tricked by distractors and generate inconsistent outputs.
LLMs have impaired knowledge of domains that are underreported.
LLMs fail on commonsense reasoning tasks.
Language and semantic knowledge rely on distinct neural circuits in humans.
Situation modeling is not a language-specific skill.

Interim conclusions

Language use requires integrating language into a broader cognitive framework.
Models that master many syntactic and distributional properties of human language still cannot use language in human-like ways.
LLMs struggle with formal reasoning, acquiring comprehensive and consistent world knowledge, tracking objects, relations and events in long inputs, and generating utterances intentionally or inferring communicative intent from linguistic input.
LLMs succeed at general pattern completion, style transfer, and long-and short-term memory.

Building models that talk and think like humans

Modularity is required to build models that talk and think like humans
Curated data combined with diverse objective functions is needed
Separate benchmarks for formal and functional competence are necessary

Modularity

Functional competence and formal linguistic competence are distinct capabilities.
Biological intelligent systems are highly modular.
Future language models can master both formal and functional linguistic competence by establishing a division of labor.
Two ways to implement this division of labor: Architectural Modularity and Emergent Modularity.
Modular models are capable of achieving high task performance, are more efficient, and show high generalizability.

Curated data and diverse objective functions

Training LLMs on large “naturalistic” text corpora from the web is insufficient to induce the emergence of functional linguistic competence
This approach is biased towards low-level input properties
Text corpora does not faithfully reflect the world
LLMs have difficulty generalizing out-of-distribution
Large amount of naturalistic data required for non-linguistic capacities to emerge
Adjusting training data and/or objective function yields improved results

Separate benchmarks for formal and functional competence

It is important to develop benchmarks to evaluate formal and functional linguistic competence.
Existing benchmarks evaluate formal linguistic competence.
No single benchmark exists for evaluating functional linguistic competence.
It is possible to disentangle word-co-occurrence-based hacks and true reasoning capabilities.
It is important to target particular skills known to be separable in humans.

General conclusion

Discourse around language models consists of overclaiming and underclaiming
LLMs are successful on tasks that require structural and statistical linguistic competence
LLMs are underused in linguistics and cognitive science
LLMs fail on tasks that reflect real-life language use
LLMs demonstrate the possibility of learning complex syntactic features from linguistic input
LLMs need to be combined with models that represent abstract knowledge and support complex reasoning to reach AGI

Link to paper#

Abstract#

Paper Content#

Introduction#

Motivation for the distinction between formal vs. functional linguistic competence#

What large language models can do: a case study#

Large language models learn core aspects of human language processing#

Llms resemble the human language-selective network#

Limitations of llms as human-like language learners and processors#

Interim conclusions#

How llms fail#

Limitations of llms as real-life language users#

Interim conclusions#

Building models that talk and think like humans#

Modularity#

Curated data and diverse objective functions#

Separate benchmarks for formal and functional competence#

General conclusion#

Link to paper

Abstract

Paper Content

Introduction

Motivation for the distinction between formal vs. functional linguistic competence

What large language models can do: a case study

Large language models learn core aspects of human language processing

Llms resemble the human language-selective network

Limitations of llms as human-like language learners and processors

Interim conclusions

How llms fail

Limitations of llms as real-life language users

Interim conclusions

Building models that talk and think like humans

Modularity

Curated data and diverse objective functions

Separate benchmarks for formal and functional competence

General conclusion