Click here to register.

Acoustic Model Discussions

New 160k words 1080 hours english models released
User: guenter
Date: 10/30/2017 5:01 pm
Views: 475
Rating: 0

After working on the german voxforge model for some time I have now applied the scripts I developed for those models to a combination of the english librispeech and voxforge corpora. The resulting models can be downloaded from:

The scripts I am using to build my models can be found on github here:

While I took a pretty much manual approach for the german models I decided to try a more or less fully automated approach for the english ones - mostly because a lot of speech model resources are available here (while I had to start pretty much from scratch for the german models). 

The lexicon is based on the CMUdict to which I added missing entries using sequitur g2p (trained on CMUdict).

The audio recordings consist of


  •  the "good" librispeech recordings
  •  those same recordings with noise and reverb added to them at random


I trained a first kaldi nnet3 model on these recordings and then used this model to decode all the recordings from the english voxforge model and added those recordings to my corpus where the decoding results matched the transcripts. I iterated this process once more (and plan to do more iterations  in the future along with manual reviews).


159373 lexicon entries.
total duration of all good submissions: 1038:59:40
%WER 7.30 [ 36196 / 496128, 2226 ins, 16007 del, 17963 sub ] exp/nnet3/nnet_tdnn_a/decode/wer_8_0.0
CMU Sphinx models:
cmusphinx cont model: SENTENCE ERROR: 85.5% (12906/15093)   WORD ERROR RATE: 18.0% (89407/496158)
cmusphinx ptm model: SENTENCE ERROR: 89.2% (13467/15093)   WORD ERROR RATE: 24.2% (120169/496158)
sequitur g2p model:
    total: 13147 strings, 99753 symbols
    successfully translated: 13146 (99.99%) strings, 99746 (99.99%) symbols
        string errors:       4881 (37.13%)
        symbol errors:       9557 (9.58%)
            insertions:      2190 (2.20%)
            deletions:       2422 (2.43%)
            substitutions:   4945 (4.96%)
    translation failed:      1 (0.01%) strings, 7 (0.01%) symbols
    total string errors:     4882 (37.13%)
    total symbol errors:     9564 (9.59%)

--- (Edited on 10/30/2017 5:01 pm [GMT-0500] by guenter) ---

Re: New 160k words 1080 hours english models released
User: kmaclean
Date: 10/31/2017 7:44 am
Views: 78
Rating: 0

>I have now applied the scripts I developed for those models to a

>combination of the english librispeech and voxforge corpora

Very impressive!

Thank you,


--- (Edited on 10/31/2017 8:44 am [GMT-0400] by kmaclean) ---

Re: New 160k words 1080 hours english models released
User: guenter
Date: 11/29/2017 6:04 pm
Views: 69
Rating: 0

I have done another auto-review round using the latest model. This time, I also upgraded to kaldi 5.2 and used that to train tdnn-chain models - with quite encouraging results:

%WER 2.48 [ 12525 / 504653, 737 ins, 2720 del, 9068 sub ] exp/nnet3_chain/tdnn_sp/decode_test/wer_10_0.0
%WER 3.03 [ 15269 / 504653, 948 ins, 3260 del, 11061 sub ] exp/nnet3_chain/tdnn_250/decode_test/wer_9_0.0
the smaller (tdnn_250) model is targeted at embedded platforms like the raspberry pi 3 where it achieves near realtime performance:
[bofh@donald py-kaldi-asr]$ python examples/ 
tdnn_250 loading model...
tdnn_250 loading model... done, took 23.394126s.
tdnn_250 creating decoder...
tdnn_250 creating decoder... done, took 14.411979s.
decoding data/dw961.wav...
 0.087s:  4000 frames ( 0.250s) decoded.
 0.400s:  8000 frames ( 0.500s) decoded.
 0.742s: 12000 frames ( 0.750s) decoded.
 1.021s: 16000 frames ( 1.000s) decoded.
 1.263s: 20000 frames ( 1.250s) decoded.
 1.497s: 24000 frames ( 1.500s) decoded.
 1.714s: 28000 frames ( 1.750s) decoded.
 1.992s: 32000 frames ( 2.000s) decoded.
 2.370s: 36000 frames ( 2.250s) decoded.
 2.642s: 40000 frames ( 2.500s) decoded.
 2.873s: 44000 frames ( 2.750s) decoded.
 3.112s: 48000 frames ( 3.000s) decoded.
 3.333s: 52000 frames ( 3.250s) decoded.
 3.668s: 56000 frames ( 3.500s) decoded.
 3.876s: 60000 frames ( 3.750s) decoded.
 4.092s: 64000 frames ( 4.000s) decoded.
 4.305s: 68000 frames ( 4.250s) decoded.
 4.517s: 72000 frames ( 4.500s) decoded.
 4.951s: 74000 frames ( 4.625s) decoded.
** data/dw961.wav
** i cannot follow you she said 
** tdnn_250 likelihood: 1.99656772614
tdnn_250 decoding took     4.95s
the new models are available for download here:
and, as always, the scripts used to produce these models are available free and open source on my github:

--- (Edited on 11/29/2017 6:04 pm [GMT-0600] by guenter) ---