Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- LLMs can generate coherent, grammatical and seemingly meaningful text.
- LLMs are capable of performing tasks that require abstract knowledge and reasoning.
- LLMs show impressive performance on tasks requiring formal linguistic competence, but fail on tasks requiring functional competence.
Paper Content
Introduction
- Alan Turing proposed the Turing test to determine if an agent is a human or a machine.
- The Turing test has shaped the way society thinks of machine intelligence.
- Two fallacies related to the language-thought relationship exist: “good at language -> good at thought” and “bad at thought -> bad at language”.
- LLMs have been successful in developing linguistic knowledge from large corpora of language data.
- LLMs do not, in and of themselves, model human thought.
- Formal linguistic competence involves knowledge of rules and statistical regularities of language.
- Functional linguistic competence involves the ability to use language in the real world.
- Human language and thought are dissociable.
- LLMs have promise as scientific models of formal language processing.
- LLMs fail to model human thought in several domains.
- LLMs can be combined with other architectures to learn from more than just language data.
Motivation for the distinction between formal vs. functional linguistic competence
- Language is robustly dissociated from other high-level cognition, perception and action
- Human language processing draws on a set of interconnected brain areas in the frontal and temporal lobes
- Language network responds to stimulus features rather than task demands
- Language network is sensitive to linguistic regularities at all levels
- Language network is remarkably selective for language alone
- Individuals with severe aphasia have intact non-linguistic cognitive abilities
- Brain imaging studies show language network is selective for language processing
- Language models are surprisingly successful at mastering formal linguistic competence
- Statistical language models use word prediction task
- LLMs use vectors in a high-dimensional space and neural networks
- LLMs stand in contrast to models that use explicit, structured hierarchical representations
- N-grams and word embedding models achieved some success in various domains
- LLMs have succeeded on tests of general language understanding and linguistic competence
- Transformer models learn a lot about the structure of language
What large language models can do: a case study
- GPT-3 can use complex linguistic features
- Uses pronouns correctly
- Uses passive voice correctly
- Uses prepositions correctly
- Maintains coherence
- Uses discourse relationships
- Can be prompted to perform tasks
- Lacks underlying meaning
- Current versions of GPT-3 are different from original version
Large language models learn core aspects of human language processing
- LLMs must encode abstract phonological, morphological, syntactic, and semantic rules to be useful models of language processing in humans
- LLMs learn hierarchical structure and abstraction
- Hierarchical structure manifests in many ways, such as non-local feature agreement
- LLMs can handle grammatical tasks that require operating over hierarchical structure
- LLMs can handle other structure-sensitive constructions, like filler-gap dependencies and negative polarity
- LLMs learn abstract rule knowledge, such as part-of-speech categories, parses, named entities, and semantic roles
- LLMs can apply morphosyntactic rules to novel words
- LLMs rely on lexical-semantic cues to some extent
- Humans use diverse cues in language learning and processing that sometimes override or conflict with strict hierarchical syntactic processing
- Humans rely on memorizing previously seen input, as opposed to purely learning abstract rules
- LLMs show evidence of representing hierarchical structure and abstract linguistic patterns
Llms resemble the human language-selective network
- LLMs can predict activity in the language brain network
- LLMs and the language brain network likely have the same objective: next-word prediction
- LLMs and the language brain network have similar functional response properties
- LLMs and the language brain network are sensitive to abstract hierarchical rules, isolated phrases/sentences, naturalistic narratives, and jabberwocky stimuli
- LLMs and the language brain network are sensitive to specific word co-occurrences
- LLMs and the language brain network have similar internal architectures
Limitations of llms as human-like language learners and processors
- LLMs learn some aspects of hierarchical structure and abstraction, but not fully human-like
- LLMs can pick up on statistical regularities to achieve good performance without learning relevant linguistic information
- LLMs can be misled by simple frequency effects
- LLMs generate output based on a combination of word co-occurrence knowledge and abstract morphosyntactic rules
- LLMs require vastly more data than a child is exposed to
- LLMs may be biased towards English and other European languages
- Evidence of strong performance in a variety of languages is growing
Interim conclusions
- LLMs generate highly coherent, grammatical texts that can be indistinguishable from human output
- LLMs exhibit knowledge of hierarchical structure and abstract linguistic categories
- LLMs have overturned claims about the fundamental impossibility of acquiring certain linguistic knowledge
- LLMs have substantial value in the scientific study of language learning and processing
- LLMs acquire large amounts of factual knowledge
- LLMs succeed at some types of mathematical reasoning
- LLMs reproduce many stereotypes and social biases
- LLMs struggle with non-language-specific capabilities
How llms fail
- LLMs can use word co-occurrence patterns to “hack” tasks.
- Researchers can construct unusual prompts to prevent LLMs from “hacking”.
- GPT-3 struggles with out-of-distribution problems.
- LLMs have limitations when it comes to non-shallow reasoning tasks.
- GPT-3 has high formal linguistic competence.
Limitations of llms as real-life language users
- LLMs are thought to be precursors to AGI.
- Real-life language use requires non-linguistic cognitive skills.
- Four key capacities are needed for language use: formal reasoning, world knowledge, situation modeling, and social reasoning.
- Language and formal reasoning rely on distinct cognitive and neural systems in humans.
- LLMs can appear to solve math problems but actually rely on heuristics and fail on more complex problems.
- LLMs can be tricked by distractors and generate inconsistent outputs.
- LLMs have impaired knowledge of domains that are underreported.
- LLMs fail on commonsense reasoning tasks.
- Language and semantic knowledge rely on distinct neural circuits in humans.
- Situation modeling is not a language-specific skill.
Interim conclusions
- Language use requires integrating language into a broader cognitive framework.
- Models that master many syntactic and distributional properties of human language still cannot use language in human-like ways.
- LLMs struggle with formal reasoning, acquiring comprehensive and consistent world knowledge, tracking objects, relations and events in long inputs, and generating utterances intentionally or inferring communicative intent from linguistic input.
- LLMs succeed at general pattern completion, style transfer, and long-and short-term memory.
Building models that talk and think like humans
- Modularity is required to build models that talk and think like humans
- Curated data combined with diverse objective functions is needed
- Separate benchmarks for formal and functional competence are necessary
Modularity
- Functional competence and formal linguistic competence are distinct capabilities.
- Biological intelligent systems are highly modular.
- Future language models can master both formal and functional linguistic competence by establishing a division of labor.
- Two ways to implement this division of labor: Architectural Modularity and Emergent Modularity.
- Modular models are capable of achieving high task performance, are more efficient, and show high generalizability.
Curated data and diverse objective functions
- Training LLMs on large “naturalistic” text corpora from the web is insufficient to induce the emergence of functional linguistic competence
- This approach is biased towards low-level input properties
- Text corpora does not faithfully reflect the world
- LLMs have difficulty generalizing out-of-distribution
- Large amount of naturalistic data required for non-linguistic capacities to emerge
- Adjusting training data and/or objective function yields improved results
Separate benchmarks for formal and functional competence
- It is important to develop benchmarks to evaluate formal and functional linguistic competence.
- Existing benchmarks evaluate formal linguistic competence.
- No single benchmark exists for evaluating functional linguistic competence.
- It is possible to disentangle word-co-occurrence-based hacks and true reasoning capabilities.
- It is important to target particular skills known to be separable in humans.
General conclusion
- Discourse around language models consists of overclaiming and underclaiming
- LLMs are successful on tasks that require structural and statistical linguistic competence
- LLMs are underused in linguistics and cognitive science
- LLMs fail on tasks that reflect real-life language use
- LLMs demonstrate the possibility of learning complex syntactic features from linguistic input
- LLMs need to be combined with models that represent abstract knowledge and support complex reasoning to reach AGI