{Thai Binh Nguyen}
Abstract
Our models are pre-trained on 13k hours of Vietnamese youtube audio (un-label data) and fine-tuned on 250 hours labeled of VLSP ASR dataset on 16kHz sampled speech audio. We use wav2vec2 architecture for the pre-trained model. For fine-tuning phase, wav2vec2 is fine-tuned using Connectionist Temporal Classification (CTC), which is an algorithm that is used to train neural networks for sequence-to-sequence problems and mainly in Automatic Speech Recognition and handwriting recognition. On the Vivos dataset, we achieved a WER score of 6.15
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| speech-recognition-on-common-voice-vi | Vietnamese end-to-end speech recognition using wav2vec 2.0 by VietAI | Test WER: 11.52 |
| speech-recognition-on-vivos | Vietnamese end-to-end speech recognition using wav2vec 2.0 by VietAI | Test WER: 6.15 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.