HyperAIHyperAI

Command Palette

Search for a command to run...

Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis

Shijia Liao Yuxuan Wang Tianyu Li Yifan Cheng Ruoyi Zhang Rongzhi Zhou Yijin Xing

Abstract

Text-to-Speech (TTS) systems face ongoing challenges in processing complex linguistic features, handling polyphonic expressions, and producing natural-sounding multilingual speech - capabilities that are crucial for future AI applications. In this paper, we present Fish-Speech, a novel framework that implements a serial fast-slow Dual Autoregressive (Dual-AR) architecture to enhance the stability of Grouped Finite Scalar Vector Quantization (GFSQ) in sequence generation tasks. This architecture improves codebook processing efficiency while maintaining high-fidelity outputs, making it particularly effective for AI interactions and voice cloning. Fish-Speech leverages Large Language Models (LLMs) for linguistic feature extraction, eliminating the need for traditional grapheme-to-phoneme (G2P) conversion and thereby streamlining the synthesis pipeline and enhancing multilingual support. Additionally, we developed FF-GAN through GFSQ to achieve superior compression ratios and near 100% codebook utilization. Our approach addresses key limitations of current TTS systems while providing a foundation for more sophisticated, context-aware speech synthesis. Experimental results show that Fish-Speech significantly outperforms baseline models in handling complex linguistic scenarios and voice cloning tasks, demonstrating its potential to advance TTS technology in AI applications.

One-sentence Summary

The authors propose Fish-Speech, a novel multilingual text-to-speech framework that utilizes a serial fast-slow Dual-AR architecture to enhance Grouped Finite Scalar Vector Quantization stability and leverages Large Language Models for linguistic feature extraction to eliminate traditional G2P conversion, achieving high-fidelity voice cloning and superior compression through FF-GAN.

Key Contributions

  • The paper introduces Fish-Speech, a framework featuring a serial fast-slow Dual Autoregressive (Dual-AR) architecture designed to improve the stability of Grouped Finite Scalar Vector Quantization (GFSQ) during sequence generation.
  • This work implements a non-grapheme-to-phoneme (non-G2P) structure that utilizes Large Language Models (LLMs) for linguistic feature extraction, which streamlines the synthesis pipeline and enhances multilingual support.
  • The researchers developed FF-GAN through GFSQ to achieve superior compression ratios and near 100% codebook utilization, resulting in a system that outperforms baseline models in complex linguistic scenarios and voice cloning tasks.

Introduction

High-quality Text-to-Speech (TTS) systems are essential for advancing AI interactions, voice cloning, and multilingual virtual assistants. Traditional architectures often rely on grapheme-to-phoneme (G2P) conversion, which struggles with context-dependent polyphonic words and requires complex, language-specific phonetic rules that hinder scalability. Furthermore, many existing solutions must trade off semantic understanding for acoustic stability, which can limit their effectiveness in voice cloning. The authors leverage a novel Fish-Speech framework that utilizes Large Language Models (LLMs) for direct linguistic feature extraction, effectively eliminating the need for G2P conversion. Their main contribution is a serial fast-slow Dual Autoregressive (Dual-AR) architecture paired with a new Firefly-GAN (FFGAN) vocoder, which achieves high-fidelity synthesis, near 100% codebook utilization, and significantly lower latency than traditional DiT or Flow-based structures.

Dataset

The authors constructed a massive multilingual speech dataset totaling approximately 720,000 hours, sourced from a combination of public repositories and proprietary data collection.

  • Dataset Composition and Sources: The collection is built from diverse public sources and internal data collection processes to ensure high variety.
  • Subset Details:
    • English and Mandarin Chinese: These serve as the primary components, with 300,000 hours allocated to each language.
    • Other Language Families: The authors included 20,000 hours each for Germanic (German), Romance (French, Italian), East Asian (Japanese, Korean), and Semitic (Arabic) language families.
  • Data Processing and Strategy: The authors implemented a careful balancing strategy across different languages. This specific distribution is designed to facilitate simultaneous multilingual learning and improve the model's ability to generate natural mixed-language content.

Method

The Fish-Speech framework is designed as a high-performance Text-to-Speech (TTS) system capable of handling multi-emotional and multilingual speech synthesis. The authors leverage a hierarchical approach that combines a dual autoregressive architecture with an advanced vector quantization technique to ensure both stability and efficiency.

The core of the system is the Dual Autoregressive (Dual-AR) architecture, which processes speech information through two sequential transformer modules: the Slow Transformer and the Fast Transformer. As shown in the architectural overview of the Dual-AR framework:

The Slow Transformer operates at a high level of abstraction, taking input text embeddings to capture global linguistic structures and semantic content. It generates intermediate hidden states h\mathbf{h}h and predicts semantic token logits z\mathbf{z}z via the following transformations:

h=SlowTransformer(x)z=WtokNorm(h)\begin{array} { r } { \mathbf { h } = \mathrm { S l o w T r a n s f o r m e r } ( \mathbf { x } ) } \\ { \mathbf { z } = \mathbf { W _ { t o k } } \cdot \mathrm { N o r m } ( \mathbf { h } ) } \end{array}h=SlowTransformer(x)z=WtokNorm(h)

where Norm()\text{Norm}(\cdot)Norm() represents layer normalization and Wtok\mathbf{W}_{\text{tok}}Wtok denotes the learnable parameters of the token prediction layer. Following this, the Fast Transformer refines the output by processing detailed acoustic features. It takes a concatenated sequence of the hidden states h\mathbf{h}h and codebook embeddings c\mathbf{c}c as input:

h~=[h;c],(hfast)\tilde { \mathbf { h } } = [ \mathbf { h } ; \mathbf { c } ] , ( \mathbf { h } ^ { \mathrm { f a s t } } )h~=[h;c],(hfast)

hfast=FastTransformer(h~,(hfast))\mathbf { h } ^ { \mathrm { f a s t } } = \mathrm { F a s t T r a n s f o r m e r } ( \tilde { \mathbf { h } } , ( \mathbf { h } ^ { \mathrm { f a s t } } ) )hfast=FastTransformer(h~,(hfast))

y=WcbkNorm(hfast)\mathbf { y } = \mathbf { W } _ { \mathrm { c b k } } \cdot \mathrm { N o r m } ( \mathbf { h } ^ { \mathrm { f a s t } } )y=WcbkNorm(hfast)

where Wcbk\mathbf{W}_{\text{cbk}}Wcbk contains the learnable parameters for codebook prediction, and y\mathbf{y}y represents the resulting codebook logits. This hierarchical design enhances sequence generation stability and optimizes codebook processing.

To facilitate efficient quantization, the authors introduce Grouped Finite Scalar Vector Quantization (GFSQ). This method divides the input feature matrix ZZZ into GGG groups:

Z=[Z(1),Z(2),,Z(G)]\mathbf { Z } = [ \mathbf { Z } ^ { ( 1 ) } , \mathbf { Z } ^ { ( 2 ) } , \ldots , \mathbf { Z } ^ { ( G ) } ]Z=[Z(1),Z(2),,Z(G)]

Each scalar within a group undergoes quantization, and the resulting indices are used to decode the quantized vectors. The quantized vectors from all groups are then concatenated along the channel dimension to reconstruct the quantized downsampled tensor zqd\mathbf{z}_{q_d}zqd:

zqd(b,:,l)=[zqd(1)(b,:,l);zqd(2)(b,:,l);;zqd(G)(b,:,l)]\mathbf { z } _ { q d } ( b , : , l ) = \left[ \mathbf { z } _ { q d } ^ { ( 1 ) } ( b , : , l ) ; \mathbf { z } _ { q d } ^ { ( 2 ) } ( b , : , l ) ; \ldots ; \mathbf { z } _ { q d } ^ { ( G ) } ( b , : , l ) \right]zqd(b,:,l)=[zqd(1)(b,:,l);zqd(2)(b,:,l);;zqd(G)(b,:,l)]

Finally, an upsampling function fupf_{\text{up}}fup restores the tensor to its original size.

The speech synthesis is completed by the Firefly-GAN (FF-GAN) vocoder. This component improves upon traditional architectures by replacing the Multi-Receptive Field (MRF) module with a ParallelBlock, which utilizes a stack-and-average mechanism and configurable convolution kernel sizes. The internal structure of the FireFly Generator is illustrated below:

The overall training process follows a three-stage pipeline: initial pre-training on large-scale standard data, Supervised Fine-Tuning (SFT) on high-quality datasets, and Direct Preference Optimization (DPO) using manually labeled positive and negative sample pairs. The training infrastructure is divided into separate components for the AR model and the vocoder to optimize resource utilization.

Experiment

The evaluation framework utilizes objective metrics, subjective listening tests, and inference speed benchmarks to validate the model's performance in voice cloning and real-time processing. Results demonstrate that the GFSQ and typo-codebook strategies significantly enhance codebook utilization, speaker similarity, and content fidelity compared to baseline models. Ultimately, the system achieves high speech naturalness and efficient real-time inference, making it a robust architecture for high-quality speech synthesis and AI agent applications.

The authors conduct a subjective evaluation of synthesized audio quality using Mean Opinion Score (MOS) ratings. The results demonstrate that the proposed fish-speech model achieves higher perceptual quality scores compared to the tested baseline models. Fish-speech outperforms baseline models like CosyVoice, F5-TTS, and reecho in subjective quality ratings The proposed method shows superior performance in terms of speech naturalness and speaker similarity The results indicate that the model better captures and reproduces natural human speech characteristics

The authors evaluate the performance of the fish-speech model in voice cloning tasks by measuring the Word Error Rate. The results demonstrate that the proposed model achieves higher intelligibility compared to several baseline models. The fish-speech model achieves a lower word error rate than all listed baseline models The model's performance in terms of content fidelity even surpasses the ground truth recordings The proposed methodology shows a significant improvement in synthetic stability over competing models

The authors evaluate speaker similarity across different models using Resemblyzer and SpeechBrain metrics. The results demonstrate that the proposed fish-speech model closely approximates the ground truth performance in both evaluation frameworks. The fish-speech model achieves speaker similarity scores that are remarkably close to the ground truth. The proposed method outperforms baseline models such as F5-TTS and reecho in capturing speaker characteristics. Consistent performance across both evaluation metrics validates the model's ability to preserve speaker identity.

The authors evaluate the Fish-Speech model through subjective quality assessments, voice cloning intelligibility tests, and speaker similarity measurements. The results demonstrate that the proposed model superiorly captures natural human speech characteristics and maintains high content fidelity compared to several baseline models. Furthermore, the model shows a remarkable ability to preserve speaker identity and provides enhanced synthetic stability across different evaluation frameworks.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp