Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters.
- Trained on trillions of tokens using publicly available datasets.
- LLaMA-13B outperforms GPT-3 (175B) on most benchmarks.
- LLaMA-65B is competitive with the best models.
- Release all models to research community.
Paper Content
Introduction
- LLMs trained on large corpora of texts can perform new tasks from instructions or examples
- Scaling models to a sufficient size results in few-shot properties
- More parameters does not always lead to better performance
- Objective of scaling laws is to determine how to best scale dataset and model sizes for a particular training compute budget
- Focus of this work is to train a series of language models that achieve best possible performance at various inference budgets
- Models range from 7B to 65B parameters with competitive performance compared to best existing LLMs
- Training approach is similar to methods described in previous work and is inspired by Chinchilla scaling laws
Pre-training data
- Training dataset is a mixture of several sources
- Data sources are publicly available and compatible with open sourcing
- 67% of dataset is English CommonCrawl
- Preprocessed with CCNet pipeline
- 4.5% of dataset is Gutenberg and Books3
- 2% of dataset is Stack Exchange
- Tokenized with bytepair encoding algorithm
- 1.4T tokens after tokenization
Architecture
- Based on transformer architecture
- Leveraged various improvements
- Normalize input of each sub-layer
- Use RMSNorm normalizing function
- Replace ReLU with SwiGLU activation
- Use 2 3 4d instead of 4d
- Add rotary positional embeddings
Optimizer
- Trained using AdamW optimizer
- Hyper-parameters: β 1 = 0.9, β 2 = 0.95
- Cosine learning rate schedule, final learning rate 10% of maximal learning rate
- Weight decay of 0.1, gradient clipping of 1.0
- 2,000 warmup steps, learning rate and batch size vary with model size
Efficient implementation
- Used an efficient implementation of the causal multi-head attention to reduce memory usage and runtime
- Reduced the amount of activations that are recomputed during the backward pass with checkpointing
- Used model and sequence parallelism to reduce memory usage
- Overlapped the computation of activations and the communication between GPUs over the network
- Training a 65B-parameter model processes around 380 tokens/sec/GPU
Main results
- Zero-shot tasks: provide a textual description and a test example
- Few-shot tasks: provide a few examples (1-64) and a test example
- Compare LLaMA with other foundation models
- Evaluate LLaMA on free-form generation and multiple choice tasks
- Select completion with highest likelihood given context
Common sense reasoning
- 8 standard common sense reasoning benchmarks are considered
- Evaluation is done in the zero-shot setting
- LLaMA-65B outperforms Chinchilla-70B on all benchmarks except BoolQ
- LLaMA-13B outperforms GPT-3 on most benchmarks despite being 10x smaller
Closed-book question answering
- LLaMA is compared to existing large language models on two closed-book question answering benchmarks.
- Performance is reported in a closed book setting, meaning models do not have access to documents.
- LLaMA-65B achieves state-of-the-art performance in zero-shot and few-shot settings.
- LLaMA-13B is competitive with GPT-3 and Chinchilla despite being 5-10x smaller.
Reading comprehension
- Evaluated models on RACE reading comprehension benchmark
- Dataset collected from English reading comprehension exams for Chinese students
- Followed evaluation setup from Brown et al. (2020)
- LLaMA-65B competitive with PaLM-540B
- LLaMA-13B outperforms GPT-3 by a few percents
Mathematical reasoning
- Evaluated models on two mathematical reasoning benchmarks: MATH and GSM8k
- Compared with PaLM and Minerva
- Minerva is a series of PaLM models finetuned on 38.5B tokens from ArXiv and Math Web Pages
- PaLM and LLaMA not finetuned on mathematical data
- LLaMA-65B outperforms Minerva-62B without being finetuned on mathematical data
Code generation
- Evaluated ability of models to write code from natural language description on two benchmarks: HumanEval and MBPP
- Model receives description of program in sentences and input-output examples
- HumanEval also receives function signature
- Compared pass@1 scores of models with existing language models not finetuned on code
- LLaMA outperforms other general models such as LaMDA and PaLM
- Finetuning on code tokens can improve performance on code tasks
Massive multitask language understanding
- MMLU is a benchmark consisting of multiple choice questions from various domains
- Results are reported in Table 9
- LLaMA-65B is behind Chinchilla-70B and PaLM-540B by a few percent
- Limited amount of pre-training data (177GB) may explain why LLaMA-65B is behind
- Gopher outperforms GPT-3 on this benchmark due to larger amount of books used in pre-training (2TB)
Evolution of performance during training
- Tracked performance of models on question answering and common sense benchmarks
- Performance improves steadily and correlates with training perplexity
- Reported results of LLaMA-I on MMLU and compared with existing instruction finetuned models
- Despite simplicity of instruction finetuning approach, reached 68.9% on MMLU
- Still far from state-of-the-art of 77.4% on MMLU
Bias, toxicity and misinformation
- Large language models can reproduce and amplify existing biases in training data.
- Large language models can generate toxic or offensive content.
- It is important to understand the potential harm of language models.
Realtoxicityprompts
- Language models can generate toxic language
- RealToxicityPrompts benchmark is used to measure toxicity
- Greedy decoder is used to generate 100k prompts
- Toxicity increases with size of model
Crows-pairs
- Evaluated biases in model on CrowS-Pairs dataset
- Measured biases in 9 categories
- Compared model to GPT-3 and OPT-175B
- Model slightly better than both on average
- Model particularly biased in religion, age and gender
Winogender
- Investigated biases of model on gender category using WinoGender benchmark
- WinoGender made of Winograd schema, biases evaluated by determining if model co-reference resolution performance impacted by pronoun gender
- Sentences have 3 mentions: occupation, participant, pronoun
- Model compared perplexity of continuations of nurse and patient to perform co-reference resolution
- Model significantly better at performing co-reference resolution for “their/them/someone” pronouns than “her/her/she” and “his/him/he” pronouns
- Model likely using majority gender of occupation to perform co-reference resolution instead of using evidence of sentence
- Model makes more errors on “gotcha” examples, showing it captures societal biases related to gender and occupation
Truthfulqa
- TruthfulQA measures the truthfulness of a model
- Questions are written in diverse style, cover 38 categories and are designed to be adversarial
- Co-reference resolution accuracy for the LLaMA models is higher for “their/them/someone” pronouns than “her/her/she” and “his/him/he”
- Performance of GPT-3 and our model is reported in Table 14
- Our model scores higher than GPT-3 in both categories, but rate of correct answers is still low
Carbon footprint
- Training of models consumed a lot of energy and caused carbon dioxide emissions
- Estimate Watt-hour and tons of carbon emissions using a formula
- Carbon emission depends on location of data center
- Used US national average carbon intensity factor of 0.385 kg CO 2 eq/KWh
- Estimated 2,638 MWh and 1,015 tCO 2 eq emissions for developing models
Related work
- Language models are probability distributions over sequences of words, tokens or characters
- Task is often framed as next token prediction
- Language modeling has been proposed as a benchmark to measure progress toward artificial intelligence
- Traditionally, language models were based on n-gram count statistics
- Neural networks have been applied to language modelling task
- Transformer networks have led to important improvements
- Scaling for language models has been studied for both model and dataset sizes
- Power laws exist between model and dataset sizes and performance of system
Conclusion
- Presented a series of language models that are competitive with state-of-the-art models
- LLaMA-13B outperforms GPT-3 while being more than 10x smaller
- LLaMA-65B is competitive with Chinchilla-70B and PaLM-540B
- Achieved state-of-the-art performance by training on publicly available data
- Released models to research community to accelerate development of large language models
- Finetuning models on instructions lead to promising results
- Plan to release larger models trained on larger pretraining corpora
- Evaluated LLaMA on Natural Questions and TriviaQA