The IBM 2016 English Conversational Telephone Speech Recognition System
The IBM 2016 English Conversational Telephone Speech Recognition System
George Saon Tom Sercu Steven Rennie Hong-Kwang J. Kuo

Abstract
We describe a collection of acoustic and language modeling techniques that lowered the word error rate of our English conversational telephone LVCSR system to a record 6.6% on the Switchboard subset of the Hub5 2000 evaluation testset. On the acoustic side, we use a score fusion of three strong models: recurrent nets with maxout activations, very deep convolutional nets with 3x3 kernels, and bidirectional long short-term memory nets which operate on FMLLR and i-vector features. On the language modeling side, we use an updated model "M" and hierarchical neural network LMs.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| speech-recognition-on-swb_hub_500-wer | RNN + VGG + LSTM acoustic model trained on SWB+Fisher+CH, N-gram + "model M" + NNLM language model | Percentage error: 12.2 |
| speech-recognition-on-switchboard-hub500 | RNN + VGG + LSTM acoustic model trained on SWB+Fisher+CH, N-gram + "model M" + NNLM language model | Percentage error: 6.6 |
| speech-recognition-on-switchboard-hub500 | IBM 2016 | Percentage error: 6.9 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.