VoxForge
After working on the german voxforge model for some time I have now applied the scripts I developed for those models to a combination of the english librispeech and voxforge corpora. The resulting models can be downloaded from:
http://goofy.zamia.org/voxforge/en/
The scripts I am using to build my models can be found on github here:
https://github.com/gooofy/speech
While I took a pretty much manual approach for the german models I decided to try a more or less fully automated approach for the english ones - mostly because a lot of speech model resources are available here (while I had to start pretty much from scratch for the german models).
The lexicon is based on the CMUdict to which I added missing entries using sequitur g2p (trained on CMUdict).
The audio recordings consist of
I trained a first kaldi nnet3 model on these recordings and then used this model to decode all the recordings from the english voxforge model and added those recordings to my corpus where the decoding results matched the transcripts. I iterated this process once more (and plan to do more iterations in the future along with manual reviews).
stats:
159373 lexicon entries.
total duration of all good submissions: 1038:59:40
Kaldi:
%WER 7.30 [ 36196 / 496128, 2226 ins, 16007 del, 17963 sub ] exp/nnet3/nnet_tdnn_a/decode/wer_8_0.0
CMU Sphinx models:
cmusphinx cont model: SENTENCE ERROR: 85.5% (12906/15093) WORD ERROR RATE: 18.0% (89407/496158)
cmusphinx ptm model: SENTENCE ERROR: 89.2% (13467/15093) WORD ERROR RATE: 24.2% (120169/496158)
sequitur g2p model:
total: 13147 strings, 99753 symbols
successfully translated: 13146 (99.99%) strings, 99746 (99.99%) symbols
string errors: 4881 (37.13%)
symbol errors: 9557 (9.58%)
insertions: 2190 (2.20%)
deletions: 2422 (2.43%)
substitutions: 4945 (4.96%)
translation failed: 1 (0.01%) strings, 7 (0.01%) symbols
total string errors: 4882 (37.13%)
total symbol errors: 9564 (9.59%)
--- (Edited on 10/30/2017 5:01 pm [GMT-0500] by guenter) ---
>I have now applied the scripts I developed for those models to a
>combination of the english librispeech and voxforge corpora
Very impressive!
Thank you,
ken
--- (Edited on 10/31/2017 8:44 am [GMT-0400] by kmaclean) ---
I have done another auto-review round using the latest model. This time, I also upgraded to kaldi 5.2 and used that to train tdnn-chain models - with quite encouraging results:
%WER 2.48 [ 12525 / 504653, 737 ins, 2720 del, 9068 sub ] exp/nnet3_chain/tdnn_sp/decode_test/wer_10_0.0
%WER 3.03 [ 15269 / 504653, 948 ins, 3260 del, 11061 sub ] exp/nnet3_chain/tdnn_250/decode_test/wer_9_0.0
[bofh@donald py-kaldi-asr]$ python examples/chain_incremental.pytdnn_250 loading model...tdnn_250 loading model... done, took 23.394126s.tdnn_250 creating decoder...tdnn_250 creating decoder... done, took 14.411979s.decoding data/dw961.wav...0.087s: 4000 frames ( 0.250s) decoded.0.400s: 8000 frames ( 0.500s) decoded.0.742s: 12000 frames ( 0.750s) decoded.1.021s: 16000 frames ( 1.000s) decoded.1.263s: 20000 frames ( 1.250s) decoded.1.497s: 24000 frames ( 1.500s) decoded.1.714s: 28000 frames ( 1.750s) decoded.1.992s: 32000 frames ( 2.000s) decoded.2.370s: 36000 frames ( 2.250s) decoded.2.642s: 40000 frames ( 2.500s) decoded.2.873s: 44000 frames ( 2.750s) decoded.3.112s: 48000 frames ( 3.000s) decoded.3.333s: 52000 frames ( 3.250s) decoded.3.668s: 56000 frames ( 3.500s) decoded.3.876s: 60000 frames ( 3.750s) decoded.4.092s: 64000 frames ( 4.000s) decoded.4.305s: 68000 frames ( 4.250s) decoded.4.517s: 72000 frames ( 4.500s) decoded.4.951s: 74000 frames ( 4.625s) decoded.******************************************************************* data/dw961.wav** i cannot follow you she said** tdnn_250 likelihood: 1.99656772614*****************************************************************tdnn_250 decoding took 4.95s
--- (Edited on 11/29/2017 6:04 pm [GMT-0600] by guenter) ---
Thanks for your work. I've got a question though - the directory voxforge.cd_cont_6000 already contains means / mdef so I tried to use it directly with pocketshinx and the -hmm parameter and it seems to be working though loading means and variances takes quite some time (on a Raspberry). Any way I would be able to speed up the process or am I doing something completely wrong?
Michaela
--- (Edited on 5/5/2018 4:55 pm [GMT-0500] by ) ---
Hi Michaela,
currently we are quite focussed on kaldi - we do have models tuned for use on the rpi3. For now, this is work in progress but we are planning to do an "official" announcement soon.
If you want to give it a try, here we have Raspbian packages of kaldi, the models and the python wrapper:
http://goofy.zamia.org/raspbian-ai/
once you have them installed, you can use a python script to try them (this one is adapted from the example scripts that come with https://github.com/gooofy/py-kaldi-asr ):
import sys import os import wave import struct import numpy as np from time import time from kaldiasr.nnet3 import KaldiNNet3OnlineModel, KaldiNNet3OnlineDecoder # this is useful for benchmarking purposes NUM_DECODER_RUNS = 1 MODELDIR = '/opt/kaldi/model/kaldi-chain-voxforge-de' MODEL = 'tdnn_250' WAVFILE = 'data/gsp1.wav' print '%s loading model...' % MODEL time_start = time() kaldi_model = KaldiNNet3OnlineModel (MODELDIR, MODEL, acoustic_scale=1.0, beam=7.0, frame_subsampling_factor=3) print '%s loading model... done, took %fs.' % (MODEL, time()-time_start) print '%s creating decoder...' % MODEL time_start = time() decoder = KaldiNNet3OnlineDecoder (kaldi_model) print '%s creating decoder... done, took %fs.' % (MODEL, time()-time_start) for i in range(NUM_DECODER_RUNS): time_start = time() print 'decoding %s...' % WAVFILE wavf = wave.open(WAVFILE, 'rb') # check format assert wavf.getnchannels()==1 assert wavf.getsampwidth()==2 # process file in 250ms chunks chunk_frames = 250 * wavf.getframerate() / 1000 tot_frames = wavf.getnframes() num_frames = 0 while num_frames < tot_frames: finalize = False if (num_frames + chunk_frames) < tot_frames: nframes = chunk_frames else: nframes = tot_frames - num_frames finalize = True frames = wavf.readframes(nframes) num_frames += nframes samples = struct.unpack_from('<%dh' % nframes, frames) decoder.decode(wavf.getframerate(), np.array(samples, dtype=np.float32), finalize) s, l = decoder.get_decoded_string() print "%6.3fs: %5d frames (%6.3fs) decoded. %s" % (time()-time_start, num_frames, float(num_frames) / float(wavf.getframerate()), s) wavf.close() s, l = decoder.get_decoded_string() print print "*****************************************************************" print "**", WAVFILE print "**", s print "** %s likelihood:" % MODEL, l print "*****************************************************************" print print "%s decoding took %8.2fs" % (MODEL, time() - time_start )
--- (Edited on 5/8/2018 2:17 pm [GMT-0500] by guenter) ---