Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters.
  • Trained on trillions of tokens using publicly available datasets.
  • LLaMA-13B outperforms GPT-3 (175B) on most benchmarks.
  • LLaMA-65B is competitive with the best models.
  • Release all models to research community.

Paper Content

Introduction

  • LLMs trained on large corpora of texts can perform new tasks from instructions or examples
  • Scaling models to a sufficient size results in few-shot properties
  • More parameters does not always lead to better performance
  • Objective of scaling laws is to determine how to best scale dataset and model sizes for a particular training compute budget
  • Focus of this work is to train a series of language models that achieve best possible performance at various inference budgets
  • Models range from 7B to 65B parameters with competitive performance compared to best existing LLMs
  • Training approach is similar to methods described in previous work and is inspired by Chinchilla scaling laws

Pre-training data

  • Training dataset is a mixture of several sources
  • Data sources are publicly available and compatible with open sourcing
  • 67% of dataset is English CommonCrawl
  • Preprocessed with CCNet pipeline
  • 4.5% of dataset is Gutenberg and Books3
  • 2% of dataset is Stack Exchange
  • Tokenized with bytepair encoding algorithm
  • 1.4T tokens after tokenization

Architecture

  • Based on transformer architecture
  • Leveraged various improvements
  • Normalize input of each sub-layer
  • Use RMSNorm normalizing function
  • Replace ReLU with SwiGLU activation
  • Use 2 3 4d instead of 4d
  • Add rotary positional embeddings

Optimizer

  • Trained using AdamW optimizer
  • Hyper-parameters: β 1 = 0.9, β 2 = 0.95
  • Cosine learning rate schedule, final learning rate 10% of maximal learning rate
  • Weight decay of 0.1, gradient clipping of 1.0
  • 2,000 warmup steps, learning rate and batch size vary with model size

Efficient implementation

  • Used an efficient implementation of the causal multi-head attention to reduce memory usage and runtime
  • Reduced the amount of activations that are recomputed during the backward pass with checkpointing
  • Used model and sequence parallelism to reduce memory usage
  • Overlapped the computation of activations and the communication between GPUs over the network
  • Training a 65B-parameter model processes around 380 tokens/sec/GPU

Main results

  • Zero-shot tasks: provide a textual description and a test example
  • Few-shot tasks: provide a few examples (1-64) and a test example
  • Compare LLaMA with other foundation models
  • Evaluate LLaMA on free-form generation and multiple choice tasks
  • Select completion with highest likelihood given context

Common sense reasoning

  • 8 standard common sense reasoning benchmarks are considered
  • Evaluation is done in the zero-shot setting
  • LLaMA-65B outperforms Chinchilla-70B on all benchmarks except BoolQ
  • LLaMA-13B outperforms GPT-3 on most benchmarks despite being 10x smaller

Closed-book question answering

  • LLaMA is compared to existing large language models on two closed-book question answering benchmarks.
  • Performance is reported in a closed book setting, meaning models do not have access to documents.
  • LLaMA-65B achieves state-of-the-art performance in zero-shot and few-shot settings.
  • LLaMA-13B is competitive with GPT-3 and Chinchilla despite being 5-10x smaller.

Reading comprehension

  • Evaluated models on RACE reading comprehension benchmark
  • Dataset collected from English reading comprehension exams for Chinese students
  • Followed evaluation setup from Brown et al. (2020)
  • LLaMA-65B competitive with PaLM-540B
  • LLaMA-13B outperforms GPT-3 by a few percents

Mathematical reasoning

  • Evaluated models on two mathematical reasoning benchmarks: MATH and GSM8k
  • Compared with PaLM and Minerva
  • Minerva is a series of PaLM models finetuned on 38.5B tokens from ArXiv and Math Web Pages
  • PaLM and LLaMA not finetuned on mathematical data
  • LLaMA-65B outperforms Minerva-62B without being finetuned on mathematical data

Code generation

  • Evaluated ability of models to write code from natural language description on two benchmarks: HumanEval and MBPP
  • Model receives description of program in sentences and input-output examples
  • HumanEval also receives function signature
  • Compared pass@1 scores of models with existing language models not finetuned on code
  • LLaMA outperforms other general models such as LaMDA and PaLM
  • Finetuning on code tokens can improve performance on code tasks

Massive multitask language understanding

  • MMLU is a benchmark consisting of multiple choice questions from various domains
  • Results are reported in Table 9
  • LLaMA-65B is behind Chinchilla-70B and PaLM-540B by a few percent
  • Limited amount of pre-training data (177GB) may explain why LLaMA-65B is behind
  • Gopher outperforms GPT-3 on this benchmark due to larger amount of books used in pre-training (2TB)

Evolution of performance during training

  • Tracked performance of models on question answering and common sense benchmarks
  • Performance improves steadily and correlates with training perplexity
  • Reported results of LLaMA-I on MMLU and compared with existing instruction finetuned models
  • Despite simplicity of instruction finetuning approach, reached 68.9% on MMLU
  • Still far from state-of-the-art of 77.4% on MMLU

Bias, toxicity and misinformation

  • Large language models can reproduce and amplify existing biases in training data.
  • Large language models can generate toxic or offensive content.
  • It is important to understand the potential harm of language models.

Realtoxicityprompts

  • Language models can generate toxic language
  • RealToxicityPrompts benchmark is used to measure toxicity
  • Greedy decoder is used to generate 100k prompts
  • Toxicity increases with size of model

Crows-pairs

  • Evaluated biases in model on CrowS-Pairs dataset
  • Measured biases in 9 categories
  • Compared model to GPT-3 and OPT-175B
  • Model slightly better than both on average
  • Model particularly biased in religion, age and gender

Winogender

  • Investigated biases of model on gender category using WinoGender benchmark
  • WinoGender made of Winograd schema, biases evaluated by determining if model co-reference resolution performance impacted by pronoun gender
  • Sentences have 3 mentions: occupation, participant, pronoun
  • Model compared perplexity of continuations of nurse and patient to perform co-reference resolution
  • Model significantly better at performing co-reference resolution for “their/them/someone” pronouns than “her/her/she” and “his/him/he” pronouns
  • Model likely using majority gender of occupation to perform co-reference resolution instead of using evidence of sentence
  • Model makes more errors on “gotcha” examples, showing it captures societal biases related to gender and occupation

Truthfulqa

  • TruthfulQA measures the truthfulness of a model
  • Questions are written in diverse style, cover 38 categories and are designed to be adversarial
  • Co-reference resolution accuracy for the LLaMA models is higher for “their/them/someone” pronouns than “her/her/she” and “his/him/he”
  • Performance of GPT-3 and our model is reported in Table 14
  • Our model scores higher than GPT-3 in both categories, but rate of correct answers is still low

Carbon footprint

  • Training of models consumed a lot of energy and caused carbon dioxide emissions
  • Estimate Watt-hour and tons of carbon emissions using a formula
  • Carbon emission depends on location of data center
  • Used US national average carbon intensity factor of 0.385 kg CO 2 eq/KWh
  • Estimated 2,638 MWh and 1,015 tCO 2 eq emissions for developing models
  • Language models are probability distributions over sequences of words, tokens or characters
  • Task is often framed as next token prediction
  • Language modeling has been proposed as a benchmark to measure progress toward artificial intelligence
  • Traditionally, language models were based on n-gram count statistics
  • Neural networks have been applied to language modelling task
  • Transformer networks have led to important improvements
  • Scaling for language models has been studied for both model and dataset sizes
  • Power laws exist between model and dataset sizes and performance of system

Conclusion

  • Presented a series of language models that are competitive with state-of-the-art models
  • LLaMA-13B outperforms GPT-3 while being more than 10x smaller
  • LLaMA-65B is competitive with Chinchilla-70B and PaLM-540B
  • Achieved state-of-the-art performance by training on publicly available data
  • Released models to research community to accelerate development of large language models
  • Finetuning models on instructions lead to promising results
  • Plan to release larger models trained on larger pretraining corpora
  • Evaluated LLaMA on Natural Questions and TriviaQA