Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters.
Trained on trillions of tokens using publicly available datasets.
LLaMA-13B outperforms GPT-3 (175B) on most benchmarks.
LLaMA-65B is competitive with the best models.
Release all models to research community.

Paper Content

Introduction

LLMs trained on large corpora of texts can perform new tasks from instructions or examples
Scaling models to a sufficient size results in few-shot properties
More parameters does not always lead to better performance
Objective of scaling laws is to determine how to best scale dataset and model sizes for a particular training compute budget
Focus of this work is to train a series of language models that achieve best possible performance at various inference budgets
Models range from 7B to 65B parameters with competitive performance compared to best existing LLMs
Training approach is similar to methods described in previous work and is inspired by Chinchilla scaling laws

Pre-training data

Training dataset is a mixture of several sources
Data sources are publicly available and compatible with open sourcing
67% of dataset is English CommonCrawl
Preprocessed with CCNet pipeline
4.5% of dataset is Gutenberg and Books3
2% of dataset is Stack Exchange
Tokenized with bytepair encoding algorithm
1.4T tokens after tokenization

Architecture

Based on transformer architecture
Leveraged various improvements
Normalize input of each sub-layer
Use RMSNorm normalizing function
Replace ReLU with SwiGLU activation
Use 2 3 4d instead of 4d
Add rotary positional embeddings

Optimizer

Trained using AdamW optimizer
Hyper-parameters: β 1 = 0.9, β 2 = 0.95
Cosine learning rate schedule, final learning rate 10% of maximal learning rate
Weight decay of 0.1, gradient clipping of 1.0
2,000 warmup steps, learning rate and batch size vary with model size

Efficient implementation

Used an efficient implementation of the causal multi-head attention to reduce memory usage and runtime
Reduced the amount of activations that are recomputed during the backward pass with checkpointing
Used model and sequence parallelism to reduce memory usage
Overlapped the computation of activations and the communication between GPUs over the network
Training a 65B-parameter model processes around 380 tokens/sec/GPU

Main results

Zero-shot tasks: provide a textual description and a test example
Few-shot tasks: provide a few examples (1-64) and a test example
Compare LLaMA with other foundation models
Evaluate LLaMA on free-form generation and multiple choice tasks
Select completion with highest likelihood given context

Common sense reasoning

8 standard common sense reasoning benchmarks are considered
Evaluation is done in the zero-shot setting
LLaMA-65B outperforms Chinchilla-70B on all benchmarks except BoolQ
LLaMA-13B outperforms GPT-3 on most benchmarks despite being 10x smaller

Closed-book question answering

LLaMA is compared to existing large language models on two closed-book question answering benchmarks.
Performance is reported in a closed book setting, meaning models do not have access to documents.
LLaMA-65B achieves state-of-the-art performance in zero-shot and few-shot settings.
LLaMA-13B is competitive with GPT-3 and Chinchilla despite being 5-10x smaller.

Reading comprehension

Evaluated models on RACE reading comprehension benchmark
Dataset collected from English reading comprehension exams for Chinese students
Followed evaluation setup from Brown et al. (2020)
LLaMA-65B competitive with PaLM-540B
LLaMA-13B outperforms GPT-3 by a few percents

Mathematical reasoning

Evaluated models on two mathematical reasoning benchmarks: MATH and GSM8k
Compared with PaLM and Minerva
Minerva is a series of PaLM models finetuned on 38.5B tokens from ArXiv and Math Web Pages
PaLM and LLaMA not finetuned on mathematical data
LLaMA-65B outperforms Minerva-62B without being finetuned on mathematical data

Code generation

Evaluated ability of models to write code from natural language description on two benchmarks: HumanEval and MBPP
Model receives description of program in sentences and input-output examples
HumanEval also receives function signature
Compared pass@1 scores of models with existing language models not finetuned on code
LLaMA outperforms other general models such as LaMDA and PaLM
Finetuning on code tokens can improve performance on code tasks

Massive multitask language understanding

MMLU is a benchmark consisting of multiple choice questions from various domains
Results are reported in Table 9
LLaMA-65B is behind Chinchilla-70B and PaLM-540B by a few percent
Limited amount of pre-training data (177GB) may explain why LLaMA-65B is behind
Gopher outperforms GPT-3 on this benchmark due to larger amount of books used in pre-training (2TB)

Evolution of performance during training

Tracked performance of models on question answering and common sense benchmarks
Performance improves steadily and correlates with training perplexity
Reported results of LLaMA-I on MMLU and compared with existing instruction finetuned models
Despite simplicity of instruction finetuning approach, reached 68.9% on MMLU
Still far from state-of-the-art of 77.4% on MMLU

Bias, toxicity and misinformation

Large language models can reproduce and amplify existing biases in training data.
Large language models can generate toxic or offensive content.
It is important to understand the potential harm of language models.

Realtoxicityprompts

Language models can generate toxic language
RealToxicityPrompts benchmark is used to measure toxicity
Greedy decoder is used to generate 100k prompts
Toxicity increases with size of model

Crows-pairs

Evaluated biases in model on CrowS-Pairs dataset
Measured biases in 9 categories
Compared model to GPT-3 and OPT-175B
Model slightly better than both on average
Model particularly biased in religion, age and gender

Winogender

Investigated biases of model on gender category using WinoGender benchmark
WinoGender made of Winograd schema, biases evaluated by determining if model co-reference resolution performance impacted by pronoun gender
Sentences have 3 mentions: occupation, participant, pronoun
Model compared perplexity of continuations of nurse and patient to perform co-reference resolution
Model significantly better at performing co-reference resolution for “their/them/someone” pronouns than “her/her/she” and “his/him/he” pronouns
Model likely using majority gender of occupation to perform co-reference resolution instead of using evidence of sentence
Model makes more errors on “gotcha” examples, showing it captures societal biases related to gender and occupation

Truthfulqa

TruthfulQA measures the truthfulness of a model
Questions are written in diverse style, cover 38 categories and are designed to be adversarial
Co-reference resolution accuracy for the LLaMA models is higher for “their/them/someone” pronouns than “her/her/she” and “his/him/he”
Performance of GPT-3 and our model is reported in Table 14
Our model scores higher than GPT-3 in both categories, but rate of correct answers is still low

Carbon footprint

Training of models consumed a lot of energy and caused carbon dioxide emissions
Estimate Watt-hour and tons of carbon emissions using a formula
Carbon emission depends on location of data center
Used US national average carbon intensity factor of 0.385 kg CO 2 eq/KWh
Estimated 2,638 MWh and 1,015 tCO 2 eq emissions for developing models

Language models are probability distributions over sequences of words, tokens or characters
Task is often framed as next token prediction
Language modeling has been proposed as a benchmark to measure progress toward artificial intelligence
Traditionally, language models were based on n-gram count statistics
Neural networks have been applied to language modelling task
Transformer networks have led to important improvements
Scaling for language models has been studied for both model and dataset sizes
Power laws exist between model and dataset sizes and performance of system

Conclusion

Presented a series of language models that are competitive with state-of-the-art models
LLaMA-13B outperforms GPT-3 while being more than 10x smaller
LLaMA-65B is competitive with Chinchilla-70B and PaLM-540B
Achieved state-of-the-art performance by training on publicly available data
Released models to research community to accelerate development of large language models
Finetuning models on instructions lead to promising results
Plan to release larger models trained on larger pretraining corpora
Evaluated LLaMA on Natural Questions and TriviaQA

Link to paper#

Abstract#

Paper Content#

Introduction#

Pre-training data#

Architecture#

Optimizer#

Efficient implementation#

Main results#

Common sense reasoning#

Closed-book question answering#

Reading comprehension#

Mathematical reasoning#

Code generation#

Massive multitask language understanding#

Evolution of performance during training#

Bias, toxicity and misinformation#

Realtoxicityprompts#

Crows-pairs#

Winogender#

Truthfulqa#

Carbon footprint#

Related work#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Pre-training data

Architecture

Optimizer

Efficient implementation

Main results

Common sense reasoning

Closed-book question answering

Reading comprehension

Mathematical reasoning

Code generation

Massive multitask language understanding

Evolution of performance during training

Bias, toxicity and misinformation

Realtoxicityprompts

Crows-pairs

Winogender

Truthfulqa

Carbon footprint

Related work

Conclusion