
Abstract
We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| arithmetic-reasoning-on-gsm8k | LLaMA 13B | Accuracy: 17.8 Parameters (Billion): 13 |
| arithmetic-reasoning-on-gsm8k | LLaMA 33B-maj1@k | Accuracy: 53.1 Parameters (Billion): 33 |
| arithmetic-reasoning-on-gsm8k | LLaMA 7B | Accuracy: 11.0 Parameters (Billion): 7 |
| arithmetic-reasoning-on-gsm8k | LLaMA 33B | Accuracy: 35.6 Parameters (Billion): 33 |
| arithmetic-reasoning-on-gsm8k | LLaMA 7B (maj1@k) | Accuracy: 18.1 Parameters (Billion): 7 |
| arithmetic-reasoning-on-gsm8k | LLaMA 65B | Accuracy: 50.9 Parameters (Billion): 65 |
| arithmetic-reasoning-on-gsm8k | LLaMA 13B-maj1@k | Accuracy: 29.3 Parameters (Billion): 13 |
| arithmetic-reasoning-on-gsm8k | LLaMA 65B-maj1@k | Accuracy: 69.7 Parameters (Billion): 65 |
| code-generation-on-mbpp | LLaMA 33B (0-shot) | Accuracy: 30.2 |
| code-generation-on-mbpp | LLaMA 13B (0-shot) | Accuracy: 22 |
| code-generation-on-mbpp | LLaMA 65B (0-shot) | Accuracy: 37.7 |
| code-generation-on-mbpp | LLaMA 7B (0-shot) | Accuracy: 17.7 |
| common-sense-reasoning-on-arc-challenge | LLaMA 65B (zero-shot) | Accuracy: 56.0 |
| common-sense-reasoning-on-arc-challenge | LLaMA 7B (zero-shot) | Accuracy: 47.6 |
| common-sense-reasoning-on-arc-challenge | LLaMA 13B (zero-shot) | Accuracy: 52.7 |
| common-sense-reasoning-on-arc-challenge | LLaMA 33B (zero-shot) | Accuracy: 57.8 |
| common-sense-reasoning-on-arc-easy | LLaMA 13B (0-shot) | Accuracy: 74.8 |
| common-sense-reasoning-on-arc-easy | LLaMA 7B (0-shot) | Accuracy: 72.8 |
| common-sense-reasoning-on-arc-easy | LLaMA 33B (0-shot) | Accuracy: 80.0 |
| common-sense-reasoning-on-arc-easy | LLaMA 65B (0-shot) | Accuracy: 78.9 |
| common-sense-reasoning-on-winogrande | LLaMA 13B (0-shot) | Accuracy: 73.0 |
| common-sense-reasoning-on-winogrande | LLaMA 33B (0-shot) | Accuracy: 76.0 |
| common-sense-reasoning-on-winogrande | LLaMA 7B (0-shot) | Accuracy: 70.1 |
| common-sense-reasoning-on-winogrande | LLaMA 65B (0-shot) | Accuracy: 77.0 |
| few-shot-learning-on-medconceptsqa | meta-llama/Meta-Llama-3-8B-Instruct | Accuracy: 25.653 |
| math-word-problem-solving-on-math | LLaMA 13B | Accuracy: 3.9 Parameters (Billions): 13 |
| math-word-problem-solving-on-math | LLaMA 13B-maj1@k | Accuracy: 8.8 Parameters (Billions): 13 |
| math-word-problem-solving-on-math | LLaMA 7B | Accuracy: 2.9 Parameters (Billions): 7 |
| math-word-problem-solving-on-math | LLaMA 7B-maj1@k | Accuracy: 6.9 Parameters (Billions): 7 |
| math-word-problem-solving-on-math | LLaMA 65B | Accuracy: 10.6 Parameters (Billions): 65 |
| math-word-problem-solving-on-math | LLaMA 33B | Accuracy: 7.1 Parameters (Billions): 33 |
| math-word-problem-solving-on-math | LLaMA 65B (maj1@k) | Accuracy: 20.5 Parameters (Billions): 65 |
| math-word-problem-solving-on-math | LLaMA 33B-maj1@k | Accuracy: 15.2 Parameters (Billions): 33 |
| multi-task-language-understanding-on-mmlu | LLaMA 65B (fine-tuned) | Average (%): 68.9 |
| multi-task-language-understanding-on-mmlu | LLaMA 65B (5-shot) | Average (%): 63.4 |
| multi-task-language-understanding-on-mmlu | LLaMA 33B (5-shot) | Average (%): 57.8 |
| question-answering-on-boolq | LLaMA 7B (zero-shot) | Accuracy: 76.5 |
| question-answering-on-boolq | LLaMA 65B (0-shot) | Accuracy: 85.3 |
| question-answering-on-boolq | LLaMA 33B (0-shot) | Accuracy: 83.1 |
| question-answering-on-boolq | LLaMA 13B (zero-shot) | Accuracy: 78.1 |
| question-answering-on-natural-questions | LLaMA 65B (few-shot, k=5) | EM: 35.0 |
| question-answering-on-natural-questions | LLaMA 65B (few-shot, k=64) | EM: 39.9 |
| question-answering-on-natural-questions | LLaMA 33B (zero-shot) | EM: 24.9 |
| question-answering-on-natural-questions | LLaMA 65B (one-shot) | EM: 31.0 |
| question-answering-on-obqa | LLaMA 7B (zero-shot) | Accuracy: 57.2 |
| question-answering-on-obqa | LLaMA 13B (zero-shot) | Accuracy: 56.4 |
| question-answering-on-obqa | LLaMA 65B (zero-shot) | Accuracy: 60.2 |
| question-answering-on-obqa | LLaMA 33B (zero-shot) | Accuracy: 58.6 |
| question-answering-on-piqa | LLaMA 33B (0-shot) | Accuracy: 82.3 |
| question-answering-on-piqa | LLaMA 7B (0-shot) | Accuracy: 79.8 |
| question-answering-on-piqa | LLaMA 13B (0-shot) | Accuracy: 80.1 |
| question-answering-on-piqa | LLaMA 65B (0-shot) | Accuracy: 82.8 |
| question-answering-on-social-iqa | LLaMA 13B (zero-shot) | Accuracy: 50.4 |
| question-answering-on-social-iqa | LLaMA 7B (zero-shot) | Accuracy: 48.9 |
| question-answering-on-social-iqa | LLaMA 65B (zero-shot) | Accuracy: 52.3 |
| question-answering-on-social-iqa | LLaMA 33B (zero-shot) | Accuracy: 50.4 |
| question-answering-on-timequestions | Llama3 | P@1: 17.8 |
| question-answering-on-triviaqa | LLaMA 65B (few-shot, k=64) | EM: 73.0 |
| question-answering-on-triviaqa | LLaMA 65B (one-shot) | EM: 71.6 |
| question-answering-on-triviaqa | LLaMA 65B (few-shot, k=5) | EM: 72.6 |
| question-answering-on-triviaqa | LLaMA 65B (zero-shot) | EM: 68.2 |
| question-answering-on-truthfulqa | LLaMA 65B | % info: 53 % true: 57 |
| question-answering-on-truthfulqa | LLaMA 7B | % info: 29 % true: 33 |
| question-answering-on-truthfulqa | LLaMA 13B | % info: 41 % true: 47 |
| question-answering-on-truthfulqa | LLaMA 33B | % info: 48 % true: 52 |
| reading-comprehension-on-race | LLaMA 33B (zero-shot) | Accuracy (High): 48.3 Accuracy (Middle): 64.1 |
| reading-comprehension-on-race | LLaMA 65B (zero-shot) | Accuracy (High): 51.6 Accuracy (Middle): 67.9 |
| reading-comprehension-on-race | LLaMA 7B (zero-shot) | Accuracy (High): 46.9 Accuracy (Middle): 61.1 |
| reading-comprehension-on-race | LLaMA 13B (zero-shot) | Accuracy (High): 47.2 Accuracy (Middle): 61.6 |
| stereotypical-bias-analysis-on-crows-pairs | LLaMA 65B | Age: 70.1 Disability: 66.7 Gender: 70.6 Nationality: 64.2 Overall: 66.6 Physical Appearance: 77.8 Race/Color: 57.0 Religion: 70.6 Sexual Orientation: 81.0 Socioeconomic status: 71.5 |
| zero-shot-learning-on-medconceptsqa | meta-llama/Meta-Llama-3-8B-Instruct | Accuracy: 25.840 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.