6 hours ago

Yitian Gong Kuangwei Chen Zhaoye Fei Xiaogui Yang Ke Chen Yang Wang Kexin Huang Mingshu Chen Ruixiao Li Qingyuan Cheng

Table of Contents

Abstract

Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretrained encoders, semantic distillation, or heterogeneous CNN-based architectures. These designs introduce fixed inductive biases that limit reconstruction fidelity and hinder effective scaling. In this paper, we argue that discrete audio tokenization should be learned fully end-to-end using a homogeneous and scalable architecture. To this end, we first propose CAT (Causal Audio Tokenizer with Transformer), a purely Transformer-based architecture that jointly optimizes the encoder, quantizer, and decoder from scratch for high-fidelity reconstruction. Building on the CAT architecture, we develop MOSS-Audio-Tokenizer, a large-scale audio tokenizer featuring 1.6 billion parameters, pre-trained on 3 million hours of diverse, general audio data. We show that this simple, fully end-to-end approach built from homogeneous, causal Transformer blocks scales gracefully and supports high-fidelity reconstruction across diverse audio domains. Across speech, sound, and music, MOSS-Audio-Tokenizer consistently outperforms prior codecs over a wide range of bitrates, while exhibiting predictable improvements with increased scale. Notably, leveraging the discrete tokens from our model, we develop the first purely autoregressive TTS model that surpasses prior non-autoregressive and cascaded systems. Furthermore, MOSS-Audio-Tokenizer enables competitive ASR performance without auxiliary encoders. Our findings position the CAT architecture as a unified, scalable interface for the next generation of native audio foundation models.

One-sentence Summary

MOSI.AI researchers propose CAT, a fully end-to-end, Transformer-based audio tokenizer, enabling high-fidelity reconstruction and scalable audio foundation models; their 1.6B-parameter MOSS-Audio-Tokenizer outperforms prior codecs across speech, music, and sound, and powers state-of-the-art autoregressive TTS and ASR without auxiliary encoders.

Key Contributions

We introduce CAT, a fully end-to-end, homogeneous Transformer-based architecture for discrete audio tokenization that jointly optimizes encoder, quantizer, and decoder from scratch, eliminating reliance on pretrained components or heterogeneous CNN designs to improve reconstruction fidelity and scalability.
We scale CAT into MOSS-Audio-Tokenizer, a 1.6B-parameter model trained on 3 million hours of diverse audio, which achieves state-of-the-art reconstruction across speech, sound, and music at all bitrates and exhibits predictable performance gains with scale.
Leveraging MOSS-Audio-Tokenizer’s tokens, we build the first purely autoregressive TTS system that outperforms non-autoregressive and cascaded baselines, and demonstrate competitive ASR performance without auxiliary encoders, validating CAT as a unified interface for audio foundation models.

Introduction

The authors leverage a fully end-to-end, homogeneous Transformer architecture—CAT—to build MOSS-Audio-Tokenizer, a scalable 1.6B-parameter audio tokenizer trained on 3 million hours of diverse audio. Prior tokenizers often rely on pretrained encoders, hybrid CNN-Transformer designs, or multi-stage training, which introduce fixed inductive biases that limit reconstruction fidelity and hinder scaling. MOSS-Audio-Tokenizer overcomes these by jointly optimizing encoder, quantizer, and decoder from scratch under a causal, streaming-friendly framework, achieving state-of-the-art reconstruction across speech, sound, and music at all bitrates. Its discrete tokens enable the first purely autoregressive TTS system to outperform non-autoregressive and cascaded baselines, and support competitive ASR without auxiliary encoders—positioning CAT as a unified foundation for scalable, native audio language models.

Dataset

The authors use a collection of 12 baseline audio tokenizers for evaluation, each sourced from official releases and configured for monophonic audio at either 16 kHz or 24 kHz. Key details:

Encodec: Official causal model at 24 kHz; ~14M parameters. Bitrate controlled by truncating RVQ layers during eval.
DAC: 24 kHz monophonic model; ~74M parameters. Uses engineered discriminators and improved VQ.
SpeechTokenizer: Trained on 16 kHz speech; ~103.67M parameters. Distills HuBERT via first RVQ layer for speech disentanglement.
Mimi: 24 kHz; outputs tokens at 12.5 Hz. Supports streaming encoding/decoding.
BigCodec: 16 kHz; single VQ codebook (size 8,192); 80 Hz frame rate; ~159M parameters.
Stable Codec: 16 kHz speech; uses RFSQ bottleneck; ~953M parameters. Eval uses 1x46656_400bps and 2x15625_700bps presets.
XCodec2.0: 16 kHz; integrates pre-trained speech encoder; 50 Hz frame rate; ~822M parameters.
XY-Tokenizer: 16 kHz; jointly models semantic/acoustic info via dual encoders; 12.5 Hz, 8-layer RVQ (codebook 1,024); ~519M parameters. Quantizer dropout disabled.
Higgs Audio Tokenizer: 24 kHz; ~201M parameters.
MiMo Audio Tokenizer: Trained on >11M hours; supports waveform reconstruction and language modeling; ~1.2B parameters.
Qwen3 TTS Tokenizer: 24 kHz; 12.5 Hz frame rate; ~170M parameters. Designed for streaming TTS.

For bitrate control during training, the authors apply Progressive Sequence Dropout to randomly truncate active RVQ layers. At inference, decoding uses only the first k RVQ tokens per timestep, and the Depth Transformer autoregressively predicts only those k tokens, omitting finer layers. All models are evaluated under their default or recommended configurations without further filtering or dataset composition beyond their original training data.

Method

The authors leverage a purely Transformer-based architecture, CAT (Causal Audio Tokenizer with Transformer), to achieve scalable, high-fidelity discrete audio tokenization without relying on convolutional inductive biases. The framework operates directly on raw waveforms and is designed for end-to-end training, supporting both semantic alignment and acoustic reconstruction. Refer to the framework diagram for an overview of the encoder-decoder structure and auxiliary components.

The encoder processes raw 24 kHz audio by first patchifying the waveform into fixed-dimensional vectors and then applying a stack of causal Transformer blocks. To progressively compress the temporal resolution, patchify operations are inserted after specific Transformer layers, reducing the sequence length and ultimately mapping the input to a discrete token sequence at 12.5 Hz. The decoder mirrors this structure in reverse, reconstructing the waveform from discrete tokens in a fully causal manner. Discretization is handled by a 32-layer residual vector quantizer (RVQ), which supports variable-bitrate tokenization via quantizer dropout during training. Each quantization layer employs factorized vector quantization with L2-normalized codebooks, and the codebook entries are optimized directly via gradient descent.

To encourage semantically rich representations, the authors attach a 0.5B-parameter decoder-only causal language model (LLM) that conditions on the quantizer’s hidden states. The LLM is trained on diverse audio-to-text tasks—including ASR, multi-speaker ASR, and audio captioning—using a task-specific prompt token prepended to the input. The semantic loss is computed as:

\mathcal { L } _ { \mathrm { s e m } } = - \sum _ { t = 1 } ^ { | \mathbf { s } | } \log p _ { \theta _ { \mathrm { L L M } } } ( \mathbf { s } _ { t } \, | \, \mathcal { T } , \, \mathbf { q } , \, \mathbf { s } _ { < t } ) \, ,

where $\mathbf{s}$ is the target text sequence, $\mathbf{q}$ is the quantized audio representation, and $\mathcal{T}$ is the task tag.

Acoustic fidelity is ensured through a multi-scale mel-spectrogram loss:

\mathcal { L } _ { \mathrm { r e c } } = \sum _ { i = 5 } ^ { 1 1 } \| S _ { 2 i } ( \mathbf { x } ) - S _ { 2 i } ( \hat { \mathbf { x } } ) \| _ { 1 } \, ,

where $S_{2i}(\cdot)$ denotes the mel-spectrogram computed with window size $2^i$ and hop size $2^{i-2}$ . Adversarial training with multiple discriminators further enhances perceptual quality, following the loss formulations from XY-Tokenizer. The overall generator objective combines semantic, reconstruction, commitment, codebook, adversarial, and feature matching losses with learnable weights:

\mathcal { L } _ { \mathrm { G } } = \lambda _ { \mathrm { s e m } } \mathcal { L } _ { \mathrm { s e m } } + \lambda _ { \mathrm { r e c } } \mathcal { L } _ { \mathrm { r e c } } + \lambda _ { \mathrm { c m t } } \mathcal { L } _ { \mathrm { c m t } } + \lambda _ { \mathrm { c o d e } } \mathcal { L } _ { \mathrm { c o d e } } + \lambda _ { \mathrm { a d v } } \mathcal { L } _ { \mathrm { a d v } } + \lambda _ { \mathrm { f e a t } } \mathcal { L } _ { \mathrm { f e a t } } .

All components—encoder, quantizer, decoder, LLM, and discriminators—are optimized jointly in an end-to-end fashion.

For end-to-end autoregressive speech generation, the authors build CAT-TTS, which directly predicts CAT’s RVQ tokens from text and speaker prompts. The model employs a Temporal Transformer to capture long-range dependencies across time and a Depth Transformer to model the coarse-to-fine structure within each time step. The Depth Transformer autoregressively predicts RVQ tokens conditioned only on previous time steps and preceding layers at the current step, preserving strict causality.

To enable variable-bitrate synthesis within a single model, the authors introduce Progressive Sequence Dropout. During training, with probability $p$ , a random prefix length $K \in \{1, \ldots, N_q - 1\}$ is sampled, and RVQ tokens from layers $K+1$ to $N_q$ are dropped. The effective number of active layers is defined as:

\hat { K } = \left( 1 - z \right) N _ { a } + z \, K ,

where $z \sim \mathrm{Bernoulli}(p)$ . The Temporal Transformer receives aggregated embeddings from the first $\hat{K}$ layers at each time step:

\tilde { \mathbf { e } } _ { t } = \sum _ { k = 1 } ^ { \hat { K } } \mathbf { E m b } _ { k } ( \mathbf { q } _ { t , k } ) .

The training loss is computed only over the retained prefix:

\mathcal { L } = - \sum _ { t = 1 } ^ { T } \sum _ { k = 1 } ^ { \hat { K } } \log p _ { \theta } \Big ( \mathbf { q } _ { t , k } \mid \mathbf { x } , \, \mathbf { q } _ { < t } , \, \mathbf { q } _ { t , < k } \Big ) .

At inference, the synthesis bitrate is controlled by selecting an inference depth $K_{\text{infer}}$ . The Temporal Transformer processes the first $K_{\text{infer}}$ RVQ streams, and the Depth Transformer predicts only those layers. The resulting tokens are decoded into waveforms using the CAT decoder, which is inherently robust to varying bitrates due to quantizer dropout during training.

As shown in the figure below, the Temporal Transformer processes text and aggregated audio token embeddings over time, while the Depth Transformer autoregressively predicts RVQ tokens across layers, with dropout enabling variable-depth generation.

Experiment

MOSS-Audio-Tokenizer excels in speech reconstruction across all bitrates, outperforming prior methods at low bitrates and achieving state-of-the-art results at medium and high bitrates; it maintains competitive performance on general audio and music, with quality improving as bitrate increases.
Progressive Sequence Dropout significantly enhances robustness of the TTS system under reduced bitrates, enabling stable performance across varying dropout rates while reducing training memory usage; p=1.0 is adopted for optimal efficiency without quality loss.
CAT-TTS surpasses prior discrete autoregressive TTS models in speaker similarity and matches top systems like IndexTTS2 and VoxCPM in word error rate, achieving the highest speaker similarity scores on Seed-TTS-Eval for both English and Chinese, validating its effectiveness for zero-shot generation.
End-to-end optimization is critical for CAT’s scalability, enabling continuous improvement in reconstruction quality with increased training, unlike partial optimization which plateaus early due to frozen components.
Model parameters and quantization capacity must scale together; larger models benefit at high bitrates but underperform at low bitrates, revealing bitrate as the primary bottleneck—optimal scaling requires synchronized expansion of both.
Reconstruction fidelity consistently improves with increased training batch size, showing predictable scaling behavior where higher throughput directly translates to higher quality within the same training steps.
Subjective evaluations confirm MOSS-Audio-Tokenizer delivers high perceptual quality across bitrates, outperforming variable-bitrate tokenizers at low bitrates and matching specialized tokenizers at their target bitrates.
CAT tokens support effective speech understanding: CAT-ASR achieves competitive WER and CER on English and Chinese benchmarks when fed directly into an LLM, demonstrating strong alignment with text and sufficient linguistic content preservation.

The authors use CAT-TTS, a fully autoregressive discrete token-based system, to achieve state-of-the-art speaker similarity scores on both English and Chinese benchmarks while maintaining low word error rates. Results show that CAT-TTS outperforms prior discrete autoregressive models and matches or exceeds performance of recent cascaded and non-autoregressive systems, demonstrating the effectiveness of its unified discrete interface for high-quality zero-shot speech generation. The system also supports variable bitrate control, enabling flexible synthesis without compromising fidelity.

The authors use MOSS-Audio-Tokenizer to achieve strong reconstruction across speech, sound, and music at multiple bitrates, outperforming prior methods especially at low and medium bitrates while maintaining scalability through end-to-end optimization. Results show that its transformer-based architecture, variable bitrate support, and semantic richness contribute to consistent high-quality reconstruction without relying on pretrained encoders. The model’s design enables robust performance across domains and bitrates, distinguishing it from other tokenizers that either lack end-to-end training or fail to scale effectively.

The authors use CAT tokens as direct inputs to a large language model for automatic speech recognition, achieving competitive word and character error rates on English and Chinese benchmarks. Results show that CAT tokens preserve sufficient linguistic content and align well with text, enabling effective speech understanding without additional alignment or auxiliary supervision.

MOSS-Audio-Tokenizer consistently outperforms or matches other open-source audio tokenizers across speech, general audio, and music benchmarks at low, medium, and high bitrates, with performance improving as bitrate increases. The model demonstrates strong scalability through end-to-end optimization, maintaining high reconstruction fidelity even under variable bitrate conditions. Its design enables robust performance across diverse audio types without requiring separate models for different bitrates or domains.

Source PDF View Code

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

6 hours ago

Transformer

Audio and Speech Processing

Yitian Gong Kuangwei Chen Zhaoye Fei Xiaogui Yang Ke Chen Yang Wang Kexin Huang Mingshu Chen Ruixiao Li Qingyuan Cheng

Table of Contents

Abstract

One-sentence Summary

Key Contributions

We introduce CAT, a fully end-to-end, homogeneous Transformer-based architecture for discrete audio tokenization that jointly optimizes encoder, quantizer, and decoder from scratch, eliminating reliance on pretrained components or heterogeneous CNN designs to improve reconstruction fidelity and scalability.
We scale CAT into MOSS-Audio-Tokenizer, a 1.6B-parameter model trained on 3 million hours of diverse audio, which achieves state-of-the-art reconstruction across speech, sound, and music at all bitrates and exhibits predictable performance gains with scale.
Leveraging MOSS-Audio-Tokenizer’s tokens, we build the first purely autoregressive TTS system that outperforms non-autoregressive and cascaded baselines, and demonstrate competitive ASR performance without auxiliary encoders, validating CAT as a unified interface for audio foundation models.

Introduction

Dataset

The authors use a collection of 12 baseline audio tokenizers for evaluation, each sourced from official releases and configured for monophonic audio at either 16 kHz or 24 kHz. Key details:

Encodec: Official causal model at 24 kHz; ~14M parameters. Bitrate controlled by truncating RVQ layers during eval.
DAC: 24 kHz monophonic model; ~74M parameters. Uses engineered discriminators and improved VQ.
SpeechTokenizer: Trained on 16 kHz speech; ~103.67M parameters. Distills HuBERT via first RVQ layer for speech disentanglement.
Mimi: 24 kHz; outputs tokens at 12.5 Hz. Supports streaming encoding/decoding.
BigCodec: 16 kHz; single VQ codebook (size 8,192); 80 Hz frame rate; ~159M parameters.
Stable Codec: 16 kHz speech; uses RFSQ bottleneck; ~953M parameters. Eval uses 1x46656_400bps and 2x15625_700bps presets.
XCodec2.0: 16 kHz; integrates pre-trained speech encoder; 50 Hz frame rate; ~822M parameters.
XY-Tokenizer: 16 kHz; jointly models semantic/acoustic info via dual encoders; 12.5 Hz, 8-layer RVQ (codebook 1,024); ~519M parameters. Quantizer dropout disabled.
Higgs Audio Tokenizer: 24 kHz; ~201M parameters.
MiMo Audio Tokenizer: Trained on >11M hours; supports waveform reconstruction and language modeling; ~1.2B parameters.
Qwen3 TTS Tokenizer: 24 kHz; 12.5 Hz frame rate; ~170M parameters. Designed for streaming TTS.

Method

\mathcal { L } _ { \mathrm { s e m } } = - \sum _ { t = 1 } ^ { | \mathbf { s } | } \log p _ { \theta _ { \mathrm { L L M } } } ( \mathbf { s } _ { t } \, | \, \mathcal { T } , \, \mathbf { q } , \, \mathbf { s } _ { < t } ) \, ,

where $\mathbf{s}$ is the target text sequence, $\mathbf{q}$ is the quantized audio representation, and $\mathcal{T}$ is the task tag.

Acoustic fidelity is ensured through a multi-scale mel-spectrogram loss:

\mathcal { L } _ { \mathrm { r e c } } = \sum _ { i = 5 } ^ { 1 1 } \| S _ { 2 i } ( \mathbf { x } ) - S _ { 2 i } ( \hat { \mathbf { x } } ) \| _ { 1 } \, ,

\mathcal { L } _ { \mathrm { G } } = \lambda _ { \mathrm { s e m } } \mathcal { L } _ { \mathrm { s e m } } + \lambda _ { \mathrm { r e c } } \mathcal { L } _ { \mathrm { r e c } } + \lambda _ { \mathrm { c m t } } \mathcal { L } _ { \mathrm { c m t } } + \lambda _ { \mathrm { c o d e } } \mathcal { L } _ { \mathrm { c o d e } } + \lambda _ { \mathrm { a d v } } \mathcal { L } _ { \mathrm { a d v } } + \lambda _ { \mathrm { f e a t } } \mathcal { L } _ { \mathrm { f e a t } } .

All components—encoder, quantizer, decoder, LLM, and discriminators—are optimized jointly in an end-to-end fashion.

\hat { K } = \left( 1 - z \right) N _ { a } + z \, K ,

where $z \sim \mathrm{Bernoulli}(p)$ . The Temporal Transformer receives aggregated embeddings from the first $\hat{K}$ layers at each time step:

\tilde { \mathbf { e } } _ { t } = \sum _ { k = 1 } ^ { \hat { K } } \mathbf { E m b } _ { k } ( \mathbf { q } _ { t , k } ) .

The training loss is computed only over the retained prefix:

\mathcal { L } = - \sum _ { t = 1 } ^ { T } \sum _ { k = 1 } ^ { \hat { K } } \log p _ { \theta } \Big ( \mathbf { q } _ { t , k } \mid \mathbf { x } , \, \mathbf { q } _ { < t } , \, \mathbf { q } _ { t , < k } \Big ) .

Experiment

MOSS-Audio-Tokenizer excels in speech reconstruction across all bitrates, outperforming prior methods at low bitrates and achieving state-of-the-art results at medium and high bitrates; it maintains competitive performance on general audio and music, with quality improving as bitrate increases.
Progressive Sequence Dropout significantly enhances robustness of the TTS system under reduced bitrates, enabling stable performance across varying dropout rates while reducing training memory usage; p=1.0 is adopted for optimal efficiency without quality loss.
CAT-TTS surpasses prior discrete autoregressive TTS models in speaker similarity and matches top systems like IndexTTS2 and VoxCPM in word error rate, achieving the highest speaker similarity scores on Seed-TTS-Eval for both English and Chinese, validating its effectiveness for zero-shot generation.
End-to-end optimization is critical for CAT’s scalability, enabling continuous improvement in reconstruction quality with increased training, unlike partial optimization which plateaus early due to frozen components.
Model parameters and quantization capacity must scale together; larger models benefit at high bitrates but underperform at low bitrates, revealing bitrate as the primary bottleneck—optimal scaling requires synchronized expansion of both.
Reconstruction fidelity consistently improves with increased training batch size, showing predictable scaling behavior where higher throughput directly translates to higher quality within the same training steps.
Subjective evaluations confirm MOSS-Audio-Tokenizer delivers high perceptual quality across bitrates, outperforming variable-bitrate tokenizers at low bitrates and matching specialized tokenizers at their target bitrates.
CAT tokens support effective speech understanding: CAT-ASR achieves competitive WER and CER on English and Chinese benchmarks when fed directly into an LLM, demonstrating strong alignment with text and sufficient linguistic content preservation.

Source PDF View Code

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

Yitian Gong Kuangwei Chen Zhaoye Fei Xiaogui Yang Ke Chen Yang Wang Kexin Huang Mingshu Chen Ruixiao Li Qingyuan Cheng2 more

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

Yitian Gong Kuangwei Chen Zhaoye Fei Xiaogui Yang Ke Chen Yang Wang Kexin Huang Mingshu Chen Ruixiao Li Qingyuan Cheng2 more

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

Yitian Gong Kuangwei Chen Zhaoye Fei Xiaogui Yang Ke Chen Yang Wang Kexin Huang Mingshu Chen Ruixiao Li Qingyuan Cheng2 more

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Build AI with AI

HyperAI Newsletters

Yitian Gong Kuangwei Chen Zhaoye Fei Xiaogui Yang Ke Chen Yang Wang Kexin Huang Mingshu Chen Ruixiao Li Qingyuan Cheng

Yitian Gong Kuangwei Chen Zhaoye Fei Xiaogui Yang Ke Chen Yang Wang Kexin Huang Mingshu Chen Ruixiao Li Qingyuan Cheng

Yitian Gong Kuangwei Chen Zhaoye Fei Xiaogui Yang Ke Chen Yang Wang Kexin Huang Mingshu Chen Ruixiao Li Qingyuan Cheng