Command Palette
Search for a command to run...
Tom B. Brown; Benjamin Mann; Nick Ryder; Melanie Subbiah; Jared Kaplan; Prafulla Dhariwal; Arvind Neelakantan; Pranav Shyam; Girish Sastry; Amanda Askell; Sandhini Agarwal; Ariel Herbert-Voss; Gretchen Krueger; Tom Henighan; Rewon Child; Aditya Ramesh; Daniel M. Ziegler; Jeffrey Wu; Clemens Winter; Christopher Hesse; Mark Chen; Eric Sigler; Mateusz Litwin; Scott Gray; Benjamin Chess; Jack Clark; Christopher Berner; Sam McCandlish; Alec Radford; Ilya Sutskever; Dario Amodei

摘要
近期的研究表明,通过在大规模文本语料库上进行预训练,然后针对特定任务进行微调,可以在许多自然语言处理(NLP)任务和基准测试中取得显著进展。尽管该方法在架构上通常对任务不敏感,但仍需要数千甚至数万个特定任务的微调数据集。相比之下,人类通常只需几个示例或简单的指令就能完成新的语言任务——这是当前的自然语言处理系统仍难以实现的能力。本文展示了通过大幅扩展语言模型可以显著提升其在任务不可知、少量样本条件下的性能,有时甚至能与先前的最佳微调方法相媲美。具体而言,我们训练了GPT-3,一个具有1750亿参数的自回归语言模型,其参数量是非稀疏语言模型中最大的,比任何前一个非稀疏语言模型多出10倍,并在少量样本条件下测试了其性能。对于所有任务,GPT-3均未进行任何梯度更新或微调,仅通过与模型的纯文本交互来指定任务和少量示例。GPT-3在多个自然语言处理数据集上表现出色,包括翻译、问答和完形填空任务,以及一些需要即时推理或领域适应的任务,如重组单词、在一个句子中使用新词或执行三位数算术运算。同时,我们也发现了一些GPT-3在少量样本学习方面仍然存在困难的数据集,以及一些由于在大规模网络语料库上训练而面临方法论问题的数据集。最后,我们发现GPT-3能够生成新闻文章样本,这些样本让人类评估者难以区分是由机器还是由人类撰写的。我们讨论了这一发现及其对社会的影响,并探讨了GPT-3的整体影响。
代码仓库
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| answerability-prediction-on-peerqa | GPT-3.5-Turbo-0613-16k | Macro F1: 0.3304 |
| common-sense-reasoning-on-arc-challenge | GPT-3 175B (0-shot) | Accuracy: 51.4 |
| common-sense-reasoning-on-arc-challenge | GPT-3 175B (1 shot) | Accuracy: 53.2 |
| common-sense-reasoning-on-arc-easy | GPT-3 175B (1 shot) | Accuracy: 71.2 |
| common-sense-reasoning-on-arc-easy | GPT-3 175B (0-shot) | Accuracy: 68.8 |
| common-sense-reasoning-on-record | GPT-3 Large 760M (0-shot) | EM: 82.1 |
| common-sense-reasoning-on-winogrande | GPT-3 Large 760M (0-shot) | Accuracy: 57.4 |
| common-sense-reasoning-on-winogrande | GPT-3 175B (0-shot) | Accuracy: 70.2 |
| coreference-resolution-on-winograd-schema | GPT-3 175B (few-shot) | Accuracy: 80.1 |
| few-shot-learning-on-medconceptsqa | gpt-3.5-turbo | Accuracy: 41.476 |
| language-modelling-on-lambada | GPT-3 175B (Few-Shot) | Accuracy: 86.4 Perplexity: 1.92 |
| language-modelling-on-lambada | GPT-3 13B (Zero-Shot) | Accuracy: 72.5 Perplexity: 3.56 |
| language-modelling-on-lambada | GPT-3 2.7B (Zero-Shot) | Accuracy: 67.1 Perplexity: 4.60 |
| language-modelling-on-lambada | GPT-3 6.7B (Zero-Shot) | Accuracy: 70.3 Perplexity: 4.00 |
| language-modelling-on-lambada | GPT-3 175B (Zero-Shot) | Accuracy: 76.2 Perplexity: 3.00 |
| language-modelling-on-penn-treebank-word | GPT-3 (Zero-Shot) | Params: 175000M Test perplexity: 20.5 |
| multi-task-language-understanding-on-mmlu | GPT-3 175B (5-shot) | Average (%): 43.9 |
| natural-language-inference-on-anli-test | GPT-3 | A1: 36.8 A2: 34 A3: 40.2 |
| natural-language-inference-on-commitmentbank | GPT-3 175B (Few-Shot) | Accuracy: 75.6 |
| natural-language-inference-on-commitmentbank | GPT-3 175B (few-shot, k=32) | F1: 52 |
| natural-language-inference-on-rte | GPT-3 175B (few-shot, k=32) | Accuracy: 69% |
| question-answering-on-boolq | GPT-3 175B (few-shot, k=32) | Accuracy: 76.4 |
| question-answering-on-boolq | GPT-3 75B (0-shot) | Accuracy: 60.5 |
| question-answering-on-copa | GPT-3 175B (few-shot, k=32) | Accuracy: 92 |
| question-answering-on-copa | GPT-3 Large 760M (0-shot) | Accuracy: 73.0 |
| question-answering-on-copa | GPT-3 13B (few-shot, k=32) | Accuracy: 86 |
| question-answering-on-copa | GPT-3 175B (0-shot) | Accuracy: 91 |
| question-answering-on-copa | GPT-3 175B (1-shot) | Accuracy: 87 |
| question-answering-on-coqa | GPT-3 175B (few-shot, k=32) | Overall: 85 |
| question-answering-on-drop-test | GPT-3 175B (few-shot, k=32) | F1: 36.5 |
| question-answering-on-multirc | GPT-3 175B (Few-Shot) | F1: 75.4 |
| question-answering-on-natural-questions | GPT-3 175B (Few-Shot, k=64) | EM: 29.9 |
| question-answering-on-obqa | GPT-3 175B (zero-shot) | Accuracy: 57.6 |
| question-answering-on-openbookqa | GPT-3 175B (few-shot, k=32) | Accuracy: 65.4 |
| question-answering-on-peerqa | GPT-3.5-Turbo-0613-16k | AlignScore: 0.1378 Prometheus-2 Answer Correctness: 3.0408 Rouge-L: 0.2414 |
| question-answering-on-piqa | GPT-3 175B (0-shot) | Accuracy: 81.0 |
| question-answering-on-piqa | GPT-3 Large 760M (0-shot) | Accuracy: 72.9 |
| question-answering-on-quac | GPT-3 175B (few-shot, k=32) | F1: 44.3 |
| question-answering-on-race | GPT-3 175B (few-shot, k=32) | RACE-m: 58.1 |
| question-answering-on-race | GPT-3 175B (Few-Shot) | RACE-h: 46.8 |
| question-answering-on-story-cloze | GPT-3 175B (Few-Shot) | Accuracy: 87.7 |
| question-answering-on-storycloze | GPT-3 Large 760M (zero-shot) | Accuracy: 72.4 |
| question-answering-on-triviaqa | GPT-3 175B (Few-Shot) | EM: 71.2 |
| question-answering-on-webquestions | GPT-3-175B (Few-Shot) | EM: 41.5 |
| question-answering-on-webquestions | GPT-3-175B (Zero-Shot) | EM: 14.4 |
| question-answering-on-webquestions | GPT-3-175B (One-Shot) | EM: 25.3 |
| question-answering-on-webquestions | Few-shot | EM: 44.7 |
| reading-comprehension-on-race | GPT-3 175B (zero-shot) | Accuracy (High): 45.5 |
| reading-comprehension-on-race | GPT-3 175B (0-shot) | Accuracy (Middle): 58.4 |
| unsupervised-machine-translation-on-wmt2014-1 | GPT-3 175B (Few-Shot) | BLEU: 39.2 |
| unsupervised-machine-translation-on-wmt2014-2 | GPT-3 175B (Few-Shot) | BLEU: 32.6 |
| unsupervised-machine-translation-on-wmt2016 | GPT-3 175B (Few-Shot) | BLEU: 29.7 |
| unsupervised-machine-translation-on-wmt2016-1 | GPT-3 175B (Few-Shot) | BLEU: 40.6 |
| unsupervised-machine-translation-on-wmt2016-2 | GPT-3 175B (Few-Shot) | BLEU: 21 |
| unsupervised-machine-translation-on-wmt2016-3 | GPT-3 175B (Few-Shot) | BLEU: 39.5 |
| word-sense-disambiguation-on-words-in-context | GPT-3 175B (few-shot, k=32) | Accuracy: 49.4 |
| zero-shot-learning-on-medconceptsqa | gpt-3.5-turbo | Accuracy: 37.058 |