6 个月前

语言模型是少样本学习者

Tom B. Brown; Benjamin Mann; Nick Ryder; Melanie Subbiah; Jared Kaplan; Prafulla Dhariwal; Arvind Neelakantan; Pranav Shyam; Girish Sastry; Amanda Askell; Sandhini Agarwal; Ariel Herbert-Voss; Gretchen Krueger; Tom Henighan; Rewon Child; Aditya Ramesh; Daniel M. Ziegler; Jeffrey Wu; Clemens Winter; Christopher Hesse; Mark Chen; Eric Sigler; Mateusz Litwin; Scott Gray; Benjamin Chess; Jack Clark; Christopher Berner; Sam McCandlish; Alec Radford; Ilya Sutskever; Dario Amodei

摘要

近期的研究表明，通过在大规模文本语料库上进行预训练，然后针对特定任务进行微调，可以在许多自然语言处理（NLP）任务和基准测试中取得显著进展。尽管该方法在架构上通常对任务不敏感，但仍需要数千甚至数万个特定任务的微调数据集。相比之下，人类通常只需几个示例或简单的指令就能完成新的语言任务——这是当前的自然语言处理系统仍难以实现的能力。本文展示了通过大幅扩展语言模型可以显著提升其在任务不可知、少量样本条件下的性能，有时甚至能与先前的最佳微调方法相媲美。具体而言，我们训练了GPT-3，一个具有1750亿参数的自回归语言模型，其参数量是非稀疏语言模型中最大的，比任何前一个非稀疏语言模型多出10倍，并在少量样本条件下测试了其性能。对于所有任务，GPT-3均未进行任何梯度更新或微调，仅通过与模型的纯文本交互来指定任务和少量示例。GPT-3在多个自然语言处理数据集上表现出色，包括翻译、问答和完形填空任务，以及一些需要即时推理或领域适应的任务，如重组单词、在一个句子中使用新词或执行三位数算术运算。同时，我们也发现了一些GPT-3在少量样本学习方面仍然存在困难的数据集，以及一些由于在大规模网络语料库上训练而面临方法论问题的数据集。最后，我们发现GPT-3能够生成新闻文章样本，这些样本让人类评估者难以区分是由机器还是由人类撰写的。我们讨论了这一发现及其对社会的影响，并探讨了GPT-3的整体影响。

代码仓库

Samyu0304/thought-propagation

GitHub 中提及

mindspore-ai/models/tree/master/official/nlp/gpt

mindspore

ai21labs/lm-evaluation

GitHub 中提及

juletx/lm-evaluation-harness

pytorch

GitHub 中提及

um-arm-lab/efficient-eng-2-ltl

pytorch

GitHub 中提及

abhaskumarsinha/Corpus2GPT

pytorch

haiyang-w/git

pytorch

GitHub 中提及

neuralmagic/lm-evaluation-harness

pytorch

GitHub 中提及

ltruncel/Microsoft_Azure_50daysofudacity

GitHub 中提及

shreyashankar/gpt3-sandbox

GitHub 中提及

EightRice/atn_GPT-3

GitHub 中提及

EleutherAI/gpt-neo

GitHub 中提及

abhaskumarsinha/MinimalGPT

fywalter/label-bias

pytorch

GitHub 中提及

hazyresearch/ama_prompting

GitHub 中提及

hojjat-mokhtarabadi/promptsource

GitHub 中提及

openai/gpt-3

官方

GitHub 中提及

RUCAIBox/LLMBox

GitHub 中提及

allenai/macaw

pytorch

GitHub 中提及

PaddlePaddle/PaddleNLP/tree/develop/examples/language_model/gpt-3

paddle

crazydigger/Callibration-of-GPT

pytorch

GitHub 中提及

smile-data/smile

pytorch

GitHub 中提及

karpathy/build-nanogpt

pytorch

GitHub 中提及

volcengine/vegiantmodel

pytorch

GitHub 中提及

national-center-for-ai-saudi-arabia/lm-evaluation-harness

jax

GitHub 中提及

asahi417/relbert

GitHub 中提及

openbiolink/promptsource

GitHub 中提及

facebookresearch/anli

pytorch

GitHub 中提及

ramanakshay/nanogpt

pytorch

GitHub 中提及

vilm-ai/viet-llm-eval

jax

GitHub 中提及

lambert-x/prolab

pytorch

GitHub 中提及

NVIDIA/NeMo-Curator

GitHub 中提及

roberttwomey/machine-imagination-workshop

GitHub 中提及

scrayish/ML_NLP

pytorch

GitHub 中提及

EleutherAI/lm_evaluation_harness

jax

GitHub 中提及

smarton-empower/smarton-ai

GitHub 中提及

ncoop57/gpt-code-clippy

jax

GitHub 中提及

insait-institute/lm-evaluation-harness-bg

jax

GitHub 中提及

kyegomez/GPT3

pytorch

VachanVY/gpt.jax

jax

GitHub 中提及

nlx-group/overlapy

GitHub 中提及

mbzuai-paris/lm-evaluation-harness-atlas-chat

pytorch

GitHub 中提及

ggml-org/llama.cpp

pytorch

GitHub 中提及

ggerganov/llama.cpp

pytorch

GitHub 中提及

bigscience-workshop/promptsource

GitHub 中提及

sambanova/lm-evaluation-harness

jax

GitHub 中提及

codedotal/gpt-code-clippy

jax

GitHub 中提及

grantslatton/llama.cpp

GitHub 中提及

postech-ami/smile-dataset

pytorch

GitHub 中提及

Sypherd/lm-evaluation-harness

pytorch

GitHub 中提及

x-lance/neusym-rag

GitHub 中提及

hilberthit/gpt-3

GitHub 中提及

tonyzhaozh/few-shot-learning

pytorch

GitHub 中提及

gmum/dl-mo-2021

GitHub 中提及

zphang/lm_evaluation_harness

GitHub 中提及

contextlab/abstract2paper

GitHub 中提及

turkunlp/megatron-deepspeed

pytorch

GitHub 中提及

Mind23-2/MindCode-138

mindspore

roberttwomey/machine-imagination-isea

GitHub 中提及

karpathy/llm.c

pytorch

GitHub 中提及

ethanjperez/true_few_shot

pytorch

GitHub 中提及

longhao-chen/aicas2024

pytorch

GitHub 中提及

EleutherAI/lm-evaluation-harness

jax

GitHub 中提及

milmor/GPT

opengptx/lm-evaluation-harness

pytorch

GitHub 中提及

bigscience-workshop/Megatron-DeepSpeed

pytorch

GitHub 中提及

asahi417/lmppl

GitHub 中提及

基准测试

基准	方法	指标
answerability-prediction-on-peerqa	GPT-3.5-Turbo-0613-16k	Macro F1: 0.3304
common-sense-reasoning-on-arc-challenge	GPT-3 175B (0-shot)	Accuracy: 51.4
common-sense-reasoning-on-arc-challenge	GPT-3 175B (1 shot)	Accuracy: 53.2
common-sense-reasoning-on-arc-easy	GPT-3 175B (1 shot)	Accuracy: 71.2
common-sense-reasoning-on-arc-easy	GPT-3 175B (0-shot)	Accuracy: 68.8
common-sense-reasoning-on-record	GPT-3 Large 760M (0-shot)	EM: 82.1
common-sense-reasoning-on-winogrande	GPT-3 Large 760M (0-shot)	Accuracy: 57.4
common-sense-reasoning-on-winogrande	GPT-3 175B (0-shot)	Accuracy: 70.2
coreference-resolution-on-winograd-schema	GPT-3 175B (few-shot)	Accuracy: 80.1
few-shot-learning-on-medconceptsqa	gpt-3.5-turbo	Accuracy: 41.476
language-modelling-on-lambada	GPT-3 175B (Few-Shot)	Accuracy: 86.4 Perplexity: 1.92
language-modelling-on-lambada	GPT-3 13B (Zero-Shot)	Accuracy: 72.5 Perplexity: 3.56
language-modelling-on-lambada	GPT-3 2.7B (Zero-Shot)	Accuracy: 67.1 Perplexity: 4.60
language-modelling-on-lambada	GPT-3 6.7B (Zero-Shot)	Accuracy: 70.3 Perplexity: 4.00
language-modelling-on-lambada	GPT-3 175B (Zero-Shot)	Accuracy: 76.2 Perplexity: 3.00
language-modelling-on-penn-treebank-word	GPT-3 (Zero-Shot)	Params: 175000M Test perplexity: 20.5
multi-task-language-understanding-on-mmlu	GPT-3 175B (5-shot)	Average (%): 43.9
natural-language-inference-on-anli-test	GPT-3	A1: 36.8 A2: 34 A3: 40.2
natural-language-inference-on-commitmentbank	GPT-3 175B (Few-Shot)	Accuracy: 75.6
natural-language-inference-on-commitmentbank	GPT-3 175B (few-shot, k=32)	F1: 52
natural-language-inference-on-rte	GPT-3 175B (few-shot, k=32)	Accuracy: 69%
question-answering-on-boolq	GPT-3 175B (few-shot, k=32)	Accuracy: 76.4
question-answering-on-boolq	GPT-3 75B (0-shot)	Accuracy: 60.5
question-answering-on-copa	GPT-3 175B (few-shot, k=32)	Accuracy: 92
question-answering-on-copa	GPT-3 Large 760M (0-shot)	Accuracy: 73.0
question-answering-on-copa	GPT-3 13B (few-shot, k=32)	Accuracy: 86
question-answering-on-copa	GPT-3 175B (0-shot)	Accuracy: 91
question-answering-on-copa	GPT-3 175B (1-shot)	Accuracy: 87
question-answering-on-coqa	GPT-3 175B (few-shot, k=32)	Overall: 85
question-answering-on-drop-test	GPT-3 175B (few-shot, k=32)	F1: 36.5
question-answering-on-multirc	GPT-3 175B (Few-Shot)	F1: 75.4
question-answering-on-natural-questions	GPT-3 175B (Few-Shot, k=64)	EM: 29.9
question-answering-on-obqa	GPT-3 175B (zero-shot)	Accuracy: 57.6
question-answering-on-openbookqa	GPT-3 175B (few-shot, k=32)	Accuracy: 65.4
question-answering-on-peerqa	GPT-3.5-Turbo-0613-16k	AlignScore: 0.1378 Prometheus-2 Answer Correctness: 3.0408 Rouge-L: 0.2414
question-answering-on-piqa	GPT-3 175B (0-shot)	Accuracy: 81.0
question-answering-on-piqa	GPT-3 Large 760M (0-shot)	Accuracy: 72.9
question-answering-on-quac	GPT-3 175B (few-shot, k=32)	F1: 44.3
question-answering-on-race	GPT-3 175B (few-shot, k=32)	RACE-m: 58.1
question-answering-on-race	GPT-3 175B (Few-Shot)	RACE-h: 46.8
question-answering-on-story-cloze	GPT-3 175B (Few-Shot)	Accuracy: 87.7
question-answering-on-storycloze	GPT-3 Large 760M (zero-shot)	Accuracy: 72.4
question-answering-on-triviaqa	GPT-3 175B (Few-Shot)	EM: 71.2
question-answering-on-webquestions	GPT-3-175B (Few-Shot)	EM: 41.5
question-answering-on-webquestions	GPT-3-175B (Zero-Shot)	EM: 14.4
question-answering-on-webquestions	GPT-3-175B (One-Shot)	EM: 25.3
question-answering-on-webquestions	Few-shot	EM: 44.7
reading-comprehension-on-race	GPT-3 175B (zero-shot)	Accuracy (High): 45.5
reading-comprehension-on-race	GPT-3 175B (0-shot)	Accuracy (Middle): 58.4
unsupervised-machine-translation-on-wmt2014-1	GPT-3 175B (Few-Shot)	BLEU: 39.2
unsupervised-machine-translation-on-wmt2014-2	GPT-3 175B (Few-Shot)	BLEU: 32.6
unsupervised-machine-translation-on-wmt2016	GPT-3 175B (Few-Shot)	BLEU: 29.7
unsupervised-machine-translation-on-wmt2016-1	GPT-3 175B (Few-Shot)	BLEU: 40.6
unsupervised-machine-translation-on-wmt2016-2	GPT-3 175B (Few-Shot)	BLEU: 21
unsupervised-machine-translation-on-wmt2016-3	GPT-3 175B (Few-Shot)	BLEU: 39.5
word-sense-disambiguation-on-words-in-context	GPT-3 175B (few-shot, k=32)	Accuracy: 49.4
zero-shot-learning-on-medconceptsqa	gpt-3.5-turbo	Accuracy: 37.058

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程

即用型 GPU

最优价格

立即开始

Hyper Newsletters

订阅我们的最新资讯

我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新

邮件发送服务由 MailChimp 提供

HyperAI超神经

算力平台

6 个月前

语言模型是少样本学习者

查看论文详情

Tom B. Brown; Benjamin Mann; Nick Ryder; Melanie Subbiah; Jared Kaplan; Prafulla Dhariwal; Arvind Neelakantan; Pranav Shyam; Girish Sastry; Amanda Askell; Sandhini Agarwal; Ariel Herbert-Voss; Gretchen Krueger; Tom Henighan; Rewon Child; Aditya Ramesh; Daniel M. Ziegler; Jeffrey Wu; Clemens Winter; Christopher Hesse; Mark Chen; Eric Sigler; Mateusz Litwin; Scott Gray; Benjamin Chess; Jack Clark; Christopher Berner; Sam McCandlish; Alec Radford; Ilya Sutskever; Dario Amodei

摘要

代码仓库

Samyu0304/thought-propagation

GitHub 中提及

mindspore-ai/models/tree/master/official/nlp/gpt

mindspore

ai21labs/lm-evaluation

GitHub 中提及

juletx/lm-evaluation-harness

pytorch

GitHub 中提及

um-arm-lab/efficient-eng-2-ltl

pytorch

GitHub 中提及

abhaskumarsinha/Corpus2GPT

pytorch

haiyang-w/git

pytorch

GitHub 中提及

neuralmagic/lm-evaluation-harness

pytorch

GitHub 中提及

ltruncel/Microsoft_Azure_50daysofudacity

GitHub 中提及

shreyashankar/gpt3-sandbox

GitHub 中提及

EightRice/atn_GPT-3

GitHub 中提及

EleutherAI/gpt-neo

GitHub 中提及

abhaskumarsinha/MinimalGPT

fywalter/label-bias

pytorch

GitHub 中提及

hazyresearch/ama_prompting

GitHub 中提及

hojjat-mokhtarabadi/promptsource

GitHub 中提及

openai/gpt-3

官方

GitHub 中提及

RUCAIBox/LLMBox

GitHub 中提及

allenai/macaw

pytorch

GitHub 中提及

PaddlePaddle/PaddleNLP/tree/develop/examples/language_model/gpt-3

paddle

crazydigger/Callibration-of-GPT

pytorch

GitHub 中提及

smile-data/smile

pytorch

GitHub 中提及

karpathy/build-nanogpt

pytorch

GitHub 中提及

volcengine/vegiantmodel

pytorch

GitHub 中提及

national-center-for-ai-saudi-arabia/lm-evaluation-harness

jax

GitHub 中提及

asahi417/relbert

GitHub 中提及

openbiolink/promptsource

GitHub 中提及

facebookresearch/anli

pytorch

GitHub 中提及

ramanakshay/nanogpt

pytorch

GitHub 中提及

vilm-ai/viet-llm-eval

jax

GitHub 中提及

lambert-x/prolab

pytorch

GitHub 中提及

NVIDIA/NeMo-Curator

GitHub 中提及

roberttwomey/machine-imagination-workshop

GitHub 中提及

scrayish/ML_NLP

pytorch

GitHub 中提及

EleutherAI/lm_evaluation_harness

jax

GitHub 中提及

smarton-empower/smarton-ai

GitHub 中提及

ncoop57/gpt-code-clippy

jax

GitHub 中提及

insait-institute/lm-evaluation-harness-bg

jax

GitHub 中提及

kyegomez/GPT3

pytorch

VachanVY/gpt.jax

jax

GitHub 中提及

nlx-group/overlapy

GitHub 中提及

mbzuai-paris/lm-evaluation-harness-atlas-chat

pytorch

GitHub 中提及

ggml-org/llama.cpp

pytorch

GitHub 中提及

ggerganov/llama.cpp

pytorch

GitHub 中提及

bigscience-workshop/promptsource

GitHub 中提及

sambanova/lm-evaluation-harness

jax

GitHub 中提及

codedotal/gpt-code-clippy

jax

GitHub 中提及

grantslatton/llama.cpp

GitHub 中提及

postech-ami/smile-dataset

pytorch

GitHub 中提及

Sypherd/lm-evaluation-harness

pytorch

GitHub 中提及

x-lance/neusym-rag

GitHub 中提及

hilberthit/gpt-3

GitHub 中提及

tonyzhaozh/few-shot-learning

pytorch

GitHub 中提及

gmum/dl-mo-2021

GitHub 中提及

zphang/lm_evaluation_harness

GitHub 中提及

contextlab/abstract2paper

GitHub 中提及

turkunlp/megatron-deepspeed

pytorch

GitHub 中提及

Mind23-2/MindCode-138

mindspore

roberttwomey/machine-imagination-isea

GitHub 中提及

karpathy/llm.c

pytorch

GitHub 中提及

ethanjperez/true_few_shot

pytorch

GitHub 中提及

longhao-chen/aicas2024

pytorch

GitHub 中提及

EleutherAI/lm-evaluation-harness

jax

GitHub 中提及

milmor/GPT

opengptx/lm-evaluation-harness

pytorch

GitHub 中提及

bigscience-workshop/Megatron-DeepSpeed

pytorch

GitHub 中提及

asahi417/lmppl

GitHub 中提及

基准测试

基准	方法	指标
answerability-prediction-on-peerqa	GPT-3.5-Turbo-0613-16k	Macro F1: 0.3304
common-sense-reasoning-on-arc-challenge	GPT-3 175B (0-shot)	Accuracy: 51.4
common-sense-reasoning-on-arc-challenge	GPT-3 175B (1 shot)	Accuracy: 53.2
common-sense-reasoning-on-arc-easy	GPT-3 175B (1 shot)	Accuracy: 71.2
common-sense-reasoning-on-arc-easy	GPT-3 175B (0-shot)	Accuracy: 68.8
common-sense-reasoning-on-record	GPT-3 Large 760M (0-shot)	EM: 82.1
common-sense-reasoning-on-winogrande	GPT-3 Large 760M (0-shot)	Accuracy: 57.4
common-sense-reasoning-on-winogrande	GPT-3 175B (0-shot)	Accuracy: 70.2
coreference-resolution-on-winograd-schema	GPT-3 175B (few-shot)	Accuracy: 80.1
few-shot-learning-on-medconceptsqa	gpt-3.5-turbo	Accuracy: 41.476
language-modelling-on-lambada	GPT-3 175B (Few-Shot)	Accuracy: 86.4 Perplexity: 1.92
language-modelling-on-lambada	GPT-3 13B (Zero-Shot)	Accuracy: 72.5 Perplexity: 3.56
language-modelling-on-lambada	GPT-3 2.7B (Zero-Shot)	Accuracy: 67.1 Perplexity: 4.60
language-modelling-on-lambada	GPT-3 6.7B (Zero-Shot)	Accuracy: 70.3 Perplexity: 4.00
language-modelling-on-lambada	GPT-3 175B (Zero-Shot)	Accuracy: 76.2 Perplexity: 3.00
language-modelling-on-penn-treebank-word	GPT-3 (Zero-Shot)	Params: 175000M Test perplexity: 20.5
multi-task-language-understanding-on-mmlu	GPT-3 175B (5-shot)	Average (%): 43.9
natural-language-inference-on-anli-test	GPT-3	A1: 36.8 A2: 34 A3: 40.2
natural-language-inference-on-commitmentbank	GPT-3 175B (Few-Shot)	Accuracy: 75.6
natural-language-inference-on-commitmentbank	GPT-3 175B (few-shot, k=32)	F1: 52
natural-language-inference-on-rte	GPT-3 175B (few-shot, k=32)	Accuracy: 69%
question-answering-on-boolq	GPT-3 175B (few-shot, k=32)	Accuracy: 76.4
question-answering-on-boolq	GPT-3 75B (0-shot)	Accuracy: 60.5
question-answering-on-copa	GPT-3 175B (few-shot, k=32)	Accuracy: 92
question-answering-on-copa	GPT-3 Large 760M (0-shot)	Accuracy: 73.0
question-answering-on-copa	GPT-3 13B (few-shot, k=32)	Accuracy: 86
question-answering-on-copa	GPT-3 175B (0-shot)	Accuracy: 91
question-answering-on-copa	GPT-3 175B (1-shot)	Accuracy: 87
question-answering-on-coqa	GPT-3 175B (few-shot, k=32)	Overall: 85
question-answering-on-drop-test	GPT-3 175B (few-shot, k=32)	F1: 36.5
question-answering-on-multirc	GPT-3 175B (Few-Shot)	F1: 75.4
question-answering-on-natural-questions	GPT-3 175B (Few-Shot, k=64)	EM: 29.9
question-answering-on-obqa	GPT-3 175B (zero-shot)	Accuracy: 57.6
question-answering-on-openbookqa	GPT-3 175B (few-shot, k=32)	Accuracy: 65.4
question-answering-on-peerqa	GPT-3.5-Turbo-0613-16k	AlignScore: 0.1378 Prometheus-2 Answer Correctness: 3.0408 Rouge-L: 0.2414
question-answering-on-piqa	GPT-3 175B (0-shot)	Accuracy: 81.0
question-answering-on-piqa	GPT-3 Large 760M (0-shot)	Accuracy: 72.9
question-answering-on-quac	GPT-3 175B (few-shot, k=32)	F1: 44.3
question-answering-on-race	GPT-3 175B (few-shot, k=32)	RACE-m: 58.1
question-answering-on-race	GPT-3 175B (Few-Shot)	RACE-h: 46.8
question-answering-on-story-cloze	GPT-3 175B (Few-Shot)	Accuracy: 87.7
question-answering-on-storycloze	GPT-3 Large 760M (zero-shot)	Accuracy: 72.4
question-answering-on-triviaqa	GPT-3 175B (Few-Shot)	EM: 71.2
question-answering-on-webquestions	GPT-3-175B (Few-Shot)	EM: 41.5
question-answering-on-webquestions	GPT-3-175B (Zero-Shot)	EM: 14.4
question-answering-on-webquestions	GPT-3-175B (One-Shot)	EM: 25.3
question-answering-on-webquestions	Few-shot	EM: 44.7
reading-comprehension-on-race	GPT-3 175B (zero-shot)	Accuracy (High): 45.5
reading-comprehension-on-race	GPT-3 175B (0-shot)	Accuracy (Middle): 58.4
unsupervised-machine-translation-on-wmt2014-1	GPT-3 175B (Few-Shot)	BLEU: 39.2
unsupervised-machine-translation-on-wmt2014-2	GPT-3 175B (Few-Shot)	BLEU: 32.6
unsupervised-machine-translation-on-wmt2016	GPT-3 175B (Few-Shot)	BLEU: 29.7
unsupervised-machine-translation-on-wmt2016-1	GPT-3 175B (Few-Shot)	BLEU: 40.6
unsupervised-machine-translation-on-wmt2016-2	GPT-3 175B (Few-Shot)	BLEU: 21
unsupervised-machine-translation-on-wmt2016-3	GPT-3 175B (Few-Shot)	BLEU: 39.5
word-sense-disambiguation-on-words-in-context	GPT-3 175B (few-shot, k=32)	Accuracy: 49.4
zero-shot-learning-on-medconceptsqa	gpt-3.5-turbo	Accuracy: 37.058

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程

即用型 GPU

最优价格

立即开始

Hyper Newsletters

订阅我们的最新资讯

我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新

邮件发送服务由 MailChimp 提供

Command Palette

语言模型是少样本学习者

摘要

代码仓库

基准测试

用 AI 构建 AI

Hyper Newsletters

Command Palette

语言模型是少样本学习者

摘要

代码仓库

基准测试

用 AI 构建 AI

Hyper Newsletters