5 个月前

生成长序列的稀疏变换器

Rewon Child; Scott Gray; Alec Radford; Ilya Sutskever

摘要

变压器（Transformers）是强大的序列模型，但其所需的时间和内存随着序列长度的增加而呈二次增长。在本文中，我们引入了注意力矩阵的稀疏分解方法，将这一复杂度降低至 $O(n \sqrt{n})$ 。此外，我们还提出了：a) 一种架构和初始化的变化，以训练更深的网络；b) 重新计算注意力矩阵以节省内存；c) 用于训练的快速注意力内核。我们将这些改进后的网络称为稀疏变压器（Sparse Transformers），并展示了它们可以使用数百层来建模长达数万时间步的序列。我们使用相同的架构对图像、音频和文本从原始字节进行建模，在Enwik8、CIFAR-10和ImageNet-64的数据密度建模方面达到了新的最佳水平。我们生成的无条件样本展示了全局连贯性和高度多样性，并证明原则上可以使用自注意力机制来建模长度超过一百万的时间步序列。

代码仓库

mistralai/mistral-src

pytorch

GitHub 中提及

ptillet/torch-blocksparse

pytorch

GitHub 中提及

wilson1yan/VideoGPT

pytorch

GitHub 中提及

MindCode-4/code-11/tree/main/factorized-attention

mindspore

jonahwinninghoff/Text-Summarization

GitHub 中提及

openai/sparse_attention

官方

GitHub 中提及

han-shi/SparseBERT

pytorch

GitHub 中提及

基准测试

基准	方法	指标
audio-generation-on-classical-music-5-seconds	Sparse Transformer 152M (strided)	Bits per byte: 1.97
image-generation-on-imagenet-64x64	Sparse Transformer 59M (strided)	Bits per dim: 3.44
language-modelling-on-enwiki8	Sparse Transformer (30 layers, fixed attn)	Bit per Character (BPC): 0.99 Number of params: 95M
open-domain-question-answering-on-searchqa	Sparse Attention	EM: 64.7
question-answering-on-natural-questions-long	Sparse Attention	F1: 74.5
question-answering-on-quasart-t	Sparse Attention	EM: 52.1

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程

即用型 GPU

最优价格

立即开始

Hyper Newsletters

订阅我们的最新资讯

我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新

邮件发送服务由 MailChimp 提供

HyperAI超神经

算力平台

5 个月前

生成长序列的稀疏变换器

查看论文详情

Rewon Child; Scott Gray; Alec Radford; Ilya Sutskever

摘要

代码仓库

mistralai/mistral-src

pytorch

GitHub 中提及

ptillet/torch-blocksparse

pytorch

GitHub 中提及

wilson1yan/VideoGPT

pytorch

GitHub 中提及

MindCode-4/code-11/tree/main/factorized-attention

mindspore

jonahwinninghoff/Text-Summarization

GitHub 中提及

openai/sparse_attention

官方

GitHub 中提及

han-shi/SparseBERT

pytorch

GitHub 中提及

基准测试

基准	方法	指标
audio-generation-on-classical-music-5-seconds	Sparse Transformer 152M (strided)	Bits per byte: 1.97
image-generation-on-imagenet-64x64	Sparse Transformer 59M (strided)	Bits per dim: 3.44
language-modelling-on-enwiki8	Sparse Transformer (30 layers, fixed attn)	Bit per Character (BPC): 0.99 Number of params: 95M
open-domain-question-answering-on-searchqa	Sparse Attention	EM: 64.7
question-answering-on-natural-questions-long	Sparse Attention	F1: 74.5
question-answering-on-quasart-t	Sparse Attention	EM: 52.1

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程

即用型 GPU

最优价格

立即开始

Hyper Newsletters

订阅我们的最新资讯

我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新

邮件发送服务由 MailChimp 提供

Command Palette

生成长序列的稀疏变换器

Rewon Child; Scott Gray; Alec Radford; Ilya Sutskever

摘要

代码仓库

基准测试

用 AI 构建 AI

Hyper Newsletters

Command Palette

生成长序列的稀疏变换器

Rewon Child; Scott Gray; Alec Radford; Ilya Sutskever

摘要

代码仓库

基准测试

用 AI 构建 AI

Hyper Newsletters