4 个月前

The Pile：一个用于语言建模的800GB多样化文本数据集

Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe Charles Foster Jason Phang Horace He Anish Thite Noa Nabeshima

摘要

近期研究已表明，增加训练数据集的多样性有助于提升大规模语言模型在跨领域知识泛化及下游任务泛化能力方面的表现。基于这一认识，我们提出了The Pile：一个总容量达825 GiB的英文文本语料库，专为训练大规模语言模型而设计。该语料库由22个多样化且高质量的数据子集构建而成，其中部分为已有数据源，部分为新构建的数据集，许多数据来源涵盖学术或专业领域。我们对未调优的GPT-2与GPT-3模型在The Pile上的表现进行了评估，结果表明，这些模型在多个子集（如学术写作）上表现不佳。相比之下，基于The Pile训练的模型在所有子集上的表现均显著优于原始Common Crawl（Raw CC）与CC-100，同时在下游任务评估中也展现出更强的性能。通过深入的探索性分析，我们揭示了该数据集对潜在使用者可能存在的若干值得关注的问题。相关数据构建代码已公开发布，供社区使用与复现。

代码仓库

ai21labs/lm-evaluation

GitHub 中提及

conceptofmind/lamda-rlhf-pytorch

pytorch

GitHub 中提及

suu990901/LLaMA-InfoEntropy-Loss

jax

GitHub 中提及

conceptofmind/LaMDA-pytorch

pytorch

GitHub 中提及

EleutherAI/gpt-neo

GitHub 中提及

ftramer/lm-extraction-benchmark

GitHub 中提及

Wikidepia/indonesia_dataset

GitHub 中提及

RossNordby/SoftPromptsForEvaluation

pytorch

GitHub 中提及

suu990901/InfoEntropy-Loss

jax

GitHub 中提及

alrope123/prompt-waywardness

pytorch

GitHub 中提及

google-research/lm-extraction-benchmark

GitHub 中提及

EleutherAI/GPTNeo

GitHub 中提及

thoppe/personal_cv

GitHub 中提及

ncoop57/gpt-code-clippy

jax

GitHub 中提及

neutralzz/billa

pytorch

GitHub 中提及

THUDM/GLM

pytorch

GitHub 中提及

EleutherAI/The-Pile

官方

jackbandy/bookcorpus-datasheet

GitHub 中提及

codedotal/gpt-code-clippy

jax

GitHub 中提及

yuchuantian/dijiang

pytorch

GitHub 中提及

nlpodyssey/verbaflow

GitHub 中提及

glassroom/heinsen_attention

pytorch

GitHub 中提及

https://pile.eleuther.ai

基准测试

基准	方法	指标
language-modelling-on-the-pile	GPT-3 Davinci 175B (pre-trained)	Bits per byte: 0.7177
language-modelling-on-the-pile	GPT-2 Medium 355M (pre-trained)	Bits per byte: 1.0928
language-modelling-on-the-pile	GPT-2 XL 1.5B (pre-trained)	Bits per byte: 1.0468
language-modelling-on-the-pile	GPT-2 Large 774M (pre-trained)	Bits per byte: 1.0828
language-modelling-on-the-pile	GPT-3 Curie 6.7B (pre-trained)	Bits per byte: 0.7980
language-modelling-on-the-pile	GPT-2 Small 124M (pre-trained)	Bits per byte: 1.2253
language-modelling-on-the-pile	GPT-3 Ada 350M (pre-trained)	Bits per byte: 0.9631
language-modelling-on-the-pile	GPT-3 Babbage 1.3B (pre-trained)	Bits per byte: 0.8718

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程

即用型 GPU

最优价格

立即开始

Hyper Newsletters

订阅我们的最新资讯

我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新

邮件发送服务由 MailChimp 提供

HyperAI超神经

算力平台

4 个月前

The Pile：一个用于语言建模的800GB多样化文本数据集

查看论文详情

Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe Charles Foster Jason Phang Horace He Anish Thite Noa Nabeshima

摘要

代码仓库

ai21labs/lm-evaluation

GitHub 中提及

conceptofmind/lamda-rlhf-pytorch

pytorch

GitHub 中提及

suu990901/LLaMA-InfoEntropy-Loss

jax

GitHub 中提及

conceptofmind/LaMDA-pytorch

pytorch

GitHub 中提及

EleutherAI/gpt-neo

GitHub 中提及

ftramer/lm-extraction-benchmark

GitHub 中提及

Wikidepia/indonesia_dataset

GitHub 中提及

RossNordby/SoftPromptsForEvaluation

pytorch

GitHub 中提及

suu990901/InfoEntropy-Loss

jax

GitHub 中提及

alrope123/prompt-waywardness

pytorch

GitHub 中提及

google-research/lm-extraction-benchmark

GitHub 中提及

EleutherAI/GPTNeo

GitHub 中提及

thoppe/personal_cv

GitHub 中提及

ncoop57/gpt-code-clippy

jax

GitHub 中提及

neutralzz/billa

pytorch

GitHub 中提及

THUDM/GLM

pytorch

GitHub 中提及

EleutherAI/The-Pile

官方

jackbandy/bookcorpus-datasheet

GitHub 中提及

codedotal/gpt-code-clippy

jax

GitHub 中提及

yuchuantian/dijiang

pytorch

GitHub 中提及

nlpodyssey/verbaflow

GitHub 中提及

glassroom/heinsen_attention

pytorch

GitHub 中提及

https://pile.eleuther.ai

基准测试

基准	方法	指标
language-modelling-on-the-pile	GPT-3 Davinci 175B (pre-trained)	Bits per byte: 0.7177
language-modelling-on-the-pile	GPT-2 Medium 355M (pre-trained)	Bits per byte: 1.0928
language-modelling-on-the-pile	GPT-2 XL 1.5B (pre-trained)	Bits per byte: 1.0468
language-modelling-on-the-pile	GPT-2 Large 774M (pre-trained)	Bits per byte: 1.0828
language-modelling-on-the-pile	GPT-3 Curie 6.7B (pre-trained)	Bits per byte: 0.7980
language-modelling-on-the-pile	GPT-2 Small 124M (pre-trained)	Bits per byte: 1.2253
language-modelling-on-the-pile	GPT-3 Ada 350M (pre-trained)	Bits per byte: 0.9631
language-modelling-on-the-pile	GPT-3 Babbage 1.3B (pre-trained)	Bits per byte: 0.8718

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程

即用型 GPU

最优价格

立即开始

Hyper Newsletters

订阅我们的最新资讯

我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新

邮件发送服务由 MailChimp 提供

Command Palette

The Pile：一个用于语言建模的800GB多样化文本数据集

Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe Charles Foster Jason Phang Horace He Anish Thite Noa Nabeshima2 more

摘要

代码仓库

基准测试

用 AI 构建 AI

Hyper Newsletters

Command Palette

The Pile：一个用于语言建模的800GB多样化文本数据集

Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe Charles Foster Jason Phang Horace He Anish Thite Noa Nabeshima2 more

摘要

代码仓库

基准测试

用 AI 构建 AI

Hyper Newsletters

Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe Charles Foster Jason Phang Horace He Anish Thite Noa Nabeshima

Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe Charles Foster Jason Phang Horace He Anish Thite Noa Nabeshima