Speech Recognition Engines

Nested
HTK training & Julius Decoding
User: enigma1987
Date: 3/15/2011 8:23 am
Views: 11502
Rating: 3

Hi all,

I have downloaded VoxForge Eng. speech corpus and tried to do follow VoxForge HTK training tutorial. I have successfully get acoustic model files. (hmm15,tiedlist,stats) .

Than I tried to decode a small test set with Julius but the accuracy is almost %5 . With HVite, accuracy is % 10.

In CMU Sphinx, I trained with same training corpus(66 hrs). Than with same test set (0.4 hr) , I got %92 accuracy.

What is the main reason of that accuracy difference. Which part am I doing wrong?

Also can anyone give me the decoding configuration paramateres for julius.

I have used the following configuration files for training.

Other things are same with voxforge training tutorial.

Sampling rate of training files is 16000, same in testing.

I have copied also some part of the julius output at the end.

wav_config:

SOURCEFORMAT = WAV
TARGETKIND = MFCC_0
TARGETRATE = 100000.0
SAVECOMPRESSED = T
SAVEWITHCRC = T
WINDOWSIZE = 250000.0
USEHAMMING = T
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12

******************

config:

SOURCEFORMAT = HTK
TARGETKIND = MFCC_0_D_A
TARGETRATE = 100000.0
SAVECOMPRESSED = T
SAVEWITHCRC = T
WINDOWSIZE = 250000.0
USEHAMMING = T
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12

julius configuration file:
# support ascii hmmdefs or binary format (converted by "mkbinhmm")
# format (ascii/binary) will be automatically detected
-h ./hmmdefs
## triphone model needs HMMList that maps logical triphone to physical ones.
-hlist ./tiedlist
## word insertion penalty
##
-penalty1 5.0        # first pass
-penalty2 20.0        # second pass
-d ./jlm
## do not giveup startup on error words
##
-forcedict
-v ./dict
-m 5000
-iwcd1 avg    # assign average likelihood of the same context (default)
-gprune safe        # safe pruning, accurate but slow
-b 10000                # beam width on 1st pass (#nodes) for triphone,PTM,engine=v2.1
-b2 200                 # beam width on 2nd pass (#words)
-n 10                  #   (default for 'standard' configuration)
-spmodel "sp"        # HMM model name
-iwsp                # append a skippable sp model at all word ends
-iwsppenalty -70.0    # transition penalty for the appenede sp models
#-input mfcfile         # MFCC file in HTK parameter file format
-input rawfile         # raw wavefile (auto-detect format)
                        # WAV(16bit) or
                        # RAW(16bit(signed short),mono,big-endian)
                        # AIFF,AU (with libsndfile extension)
            # other than 16kHz, sampling rate should be specified
            # by "-smpFreq" option
#-input mic             # direct microphone input
            # device name can be specified via env. val. "AUDIODEV"
#-input netaudio -NA host:0    # direct input from DatLink(NetAudio) host
#-input adinnet -adport portnum # via adinnet network client
#-input stdin        # from standard tty input (pipe)
-filelist wavlistNew    # specify file list to be recognized in batch mode
#-zmean            # enable DC offset removal (invalid for mfcfile input)
######################################################################
#### Recording
######################################################################
#-record directory    # auto-save recognized speech data into the dir
######################################################################
#### GMM-based Input Verification and Rejection
######################################################################
#-gmm gmmdefs        # specify GMM definition file in HTK format
#-gmmnum 10        # num of Gaussians to be computed per mixture
#-gmmreject "noise,laugh,cough" # list of GMM names to be rejected
######################################################################
#### Too Short Input Rejection
######################################################################
#-rejectshort 200    # reject input shorter than specified millisecond
######################################################################
#### Speech Detection
######################################################################
#-pausesegment        # turn on speech detection by level and zero-cross
-nopausesegment    # turn off speech detection by level and zero-cross
            # (default: on for mic or adinnet, off for file)
#-lv 1000        # threshold of input level (0-32767)
#-headmargin 500    # head margin of input segment (msec)
#-tailmargin 2000    # tail margin of input segment (msec)
#-zc 60            # threshold of number of zero-cross in a second
######################################################################
#### Acoustic Analysis
######################################################################
-smpFreq 16000        # sampling rate (Hz)
-smpPeriod 625        # sampling period (ns) (= 10000000 / smpFreq)
#-fsize 400        # window size (samples)
#-fshift 160        # frame shift (samples)
#-delwin 2        # delta window (frames)
#-hifreq 4000        # cut-off hi frequency (Hz) (-1: disable)
#-lofreq 10        # cut-off low frequency (Hz) (-1: disable)
#-cmnsave filename    # save CMN param to file (update per input)
#-cmnload filename    # load initial CMN param from file on startup
######################################################################
#### Spectral Subtraction (SS)
######################################################################
#-sscalc        # do SS using head silence (file input only)
#-sscalclen 300        # length of head silence for SS (msec)
#-ssload filename       # load constant noise spectrum from file for SS
#-ssalpha 2.0        # alpha coef. for SS
#-ssfloor 0.5        # spectral floor for SS
######################################################################
#### Forced alignment
######################################################################
#-walign        # do forced alignment with result per word
#-palign        # do forced alignment with result per phoneme
#-salign        # do forced alignment with result per HMM state
######################################################################
#### Word Confidence Scoring
######################################################################
#-cmalpha 0.05        # smoothing coef. alpha
######################################################################
#### Output
######################################################################
#-separatescore        # output language and acoustic score separately
-progout        # output partial result per a time interval
-proginterval 300    # time interval for "-progout" (msec)
#-quiet            # output minimal result
#-demo            # = "-progout -quiet", suitable for dictation demo
#-debug            # output full message for debug
#-charconv from to    # output character set conversion (see manual for
            # available code set name)
######################################################################
#### Server module mode
######################################################################
#-module        # Run Julius on "Server module mode"
#-module 5530        # (when using another port number for connection)
#-outcode WLPSC        # select output message toward module (WLPSCwlps)
######################################################################
#### Misc.
######################################################################
#-help            # output help and exit
#-setting        # output engine configuration and exit
#-C jconffile        # expand other jconf file in its place

#################################################################
-silhead "SENT-START"
-siltail "SENT-END"

 

********************

some part of the output of julius with test set:

STAT: include config: sample-julius-conf
WARNING: m_chkparam: "-penalty1" only for grammar, ignored
WARNING: m_chkparam: "-penalty2" only for grammar, ignored
STAT: jconf successfully finalized
STAT: *** loading AM00 _default
Stat: init_phmm: Reading in HMM definition
Stat: rdhmmdef: ascii format HMM definition
Stat: rdhmmdef: limit check passed
Stat: check_hmm_restriction: an HMM with several arcs from initial state found: "sp"
Stat: rdhmmdef: this HMM requires multipath handling at decoding
Stat: init_phmm: defined HMMs:  8696
Stat: init_phmm: loading ascii hmmlist
Stat: init_phmm: logical names:  9526 in HMMList
Stat: init_phmm: base phones:    44 used in logical
Stat: init_phmm: finished reading HMM definitions
STAT: making pseudo bi/mono-phone for IW-triphone
Stat: hmm_lookup: 1058 pseudo phones are added to logical HMM list
STAT: *** AM00 _default loaded
STAT: *** loading LM00 _default

Error: voca_load_htkdict: line 867: triphone "aa-f+l" not found
Error: voca_load_htkdict: the line content was: AWFULLY aa f l iy ..... many of these errors.

Stat: init_ngram: mapping dictonary words to n-gram entries
Warning: ngram_lookup: "ABALON" not exist in N-gram, treat as unknown
Warning: ngram_lookup: "ABDOMINALS" not exist in N-gram, treat as unknown
Warning: ngram_lookup: "ABIES" not exist in N-gram, treat as unknown
Warning: ngram_lookup: "ABIOGENESIS" not exist in N-gram, treat as unknown
Warning: ngram_lookup: "ABOMINATE" not exist in N-gram, treat as unknown
Warning: ngram_lookup: "ABRADING" not exist in N-gram, treat as unknown  .....many of these warnings..

 ### read waveform input
Stat: adin_file: input speechfile: an4test_clstk/419.wav
STAT: 36864 samples (2.30 sec.)
STAT: ### speech analysis (waveform -> MFCC)
### Recognition: 1st pass (LR beam)
 pass1_best: pass1_best: SENT-START pass1_best: SENT-START SHEWED PIER pass1_best: SENT-START KERRY BARED pass1_best: SENT-START KERRY BARED WHETHER pass1_best: SENT-START INDIANS APPETITE pass1_best: SENT-START KERRY BARED PROPELLED pass1_best: SENT-START KERRY BARED PROPELLED SENT-END pass1_best: SENT-START KERRY BARED PROPELLED SENT-END                           
pass1_best_wordseq: SENT-START KERRY BARED PROPELLED SENT-END
pass1_best_phonemeseq: sil | k eh r iy | b eh r d | p r ax p eh l d | sil
pass1_best_score: -7916.125000
### Recognition: 2nd pass (RL heuristic best-first)
WARNING: IW-triphone for word head "sil-ow+f" not found, fallback to pseudo {ow+f}
WARNING: IW-triphone for word head "sil-ow+f" not found, fallback to pseudo {ow+f}
WARNING: IW-triphone for word head "sil-ow+f" not found, fallback to pseudo {ow+f}
WARNING: IW-triphone for word head "sil-er+t" not found, fallback to pseudo {er+t}
WARNING: IW-triphone for word head "ow-ow+f" not found, fallback to pseudo {ow+f}
WARNING: IW-triphone for word head "ow-ow+f" not found, fallback to pseudo {ow+f}
WARNING: IW-triphone for word head "iy-ow+f" not found, fallback to pseudo {ow+f}
WARNING: IW-triphone for word head "ow-ow+f" not found, fallback to pseudo {ow+f}
WARNING: IW-triphone for word head "aa-ow+f" not found, fallback to pseudo {ow+f}
WARNING: IW-triphone for word head "er-ow+f" not found, fallback to pseudo {ow+f}
WARNING: IW-triphone for word head "ow-ow+f" not found, fallback to pseudo {ow+f}
WARNING: IW-triphone for word head "ow-ow+f" not found, fallback to pseudo {ow+f}
WARNING: IW-triphone for word head "aa-ow+f" not found, fallback to pseudo {ow+f}
WARNING: IW-triphone for word head "ow-ow+f" not found, fallback to pseudo {ow+f}
WARNING: IW-triphone for word head "ow-ow+f" not found, fallback to pseudo {ow+f}
WARNING: IW-triphone for word head "er-ow+f" not found, fallback to pseudo {ow+f}
WARNING: IW-triphone for word head "ow-er+t" not found, fallback to pseudo {er+t}
WARNING: IW-triphone for word head "ow-er+t" not found, fallback to pseudo {er+t}
WARNING: IW-triphone for word head "iy-ow+f" not found, fallback to pseudo {ow+f}
WARNING: IW-triphone for word head "ow-ow+f" not found, fallback to pseudo {ow+f}
WARNING: IW-triphone for word head "iy-ow+f" not found, fallback to pseudo {ow+f}
WARNING: IW-triphone for word head "aa-er+t" not found, fallback to pseudo {er+t}
WARNING: IW-triphone for word head "aa-ow+f" not found, fallback to pseudo {ow+f}
WARNING: IW-triphone for word head "iy-er+t" not found, fallback to pseudo {er+t}
WARNING: IW-triphone for word head "ow-er+t" not found, fallback to pseudo {er+t}
WARNING: IW-triphone for word head "ow-ow+f" not found, fallback to pseudo {ow+f}
WARNING: IW-triphone for word head "er-er+t" not found, fallback to pseudo {er+t}
WARNING: IW-triphone for word head "er-ow+f" not found, fallback to pseudo {ow+f}
WARNING: IW-triphone for word head "sil-aa+sh" not found, fallback to pseudo {aa+sh}
WARNING: IW-triphone for word head "sil-er+f" not found, fallback to pseudo {er+f}
WARNING: 00 _default: hypothesis stack exhausted, terminate search now
STAT: 00 _default: 7 sentences have been found
STAT: 00 _default: 2042 generated, 1516 pushed, 685 nodes popped in 228
sentence1: SENT-START TEACHER BARED PROPELLE SENT-END
wseq1: SENT-START TEACHER BARED PROPELLE SENT-END
phseq1: sil | t iy ch er | b eh r d | p r ax p eh l | sil
cmscore1: 0.887 0.013 0.339 0.138 1.000
score1: -8005.568359

 

--- (Edited on 3/15/2011 8:23 am [GMT-0500] by enigma1987) ---

Re: HTK training & Julius Decoding
User: kmaclean
Date: 4/15/2011 9:54 am
Views: 133
Rating: 3

>Than I tried to decode a small test set with Julius but the

>accuracy is almost %5 . With HVite, accuracy is % 10.

It should be higher than that?

is the speech you're recognizing the same sampling rate as what you trained with?

Are you using grammar-based recognition or a statistical language model?

Are all the words you are trying to recognize in the same pronunciation dictationary as the one you trained with?  

--- (Edited on 4/15/2011 10:54 am [GMT-0400] by kmaclean) ---

Re: HTK training & Julius Decoding
User: enigma1987
Date: 4/18/2011 4:49 am
Views: 279
Rating: 2

Hi,

yes sampling rate is same, I am using statistical language model, and words to be recognized are in same pronunciation dictionary.

I found my error, it is because of the target kind. I use MFCC_0_D_N_Z with vector size 25 in my last experiment and get an accuracy as 68.54%.

But this is still less than what I got from CMU Sphinx. Although I use a LM which was trained with only training sentences when getting this accuracy. But in CMU Sphinx I used a more bigger English LM which I downloaded from web.

In HTK training I followed only this tutorial http://voxforge.org/home/dev/acousticmodels/linux/create/htkjulius/tutorial

with step by step.

Am I supposed to do some more steps to get more accuracy ?

Also one more thing that I did is, I downloaded keith vertanen HTK acoustic model and test it with my test set and I got 73.98%  accuracy.

I did all tests with Julius decoder. Maybe there were some mistakes in my testing process ?

Best regards,

Berker

 

 

 

 

--- (Edited on 4/18/2011 4:49 am [GMT-0500] by enigma1987) ---

Re: HTK training & Julius Decoding
User: kmaclean
Date: 4/25/2011 11:50 am
Views: 724
Rating: 4

>Am I supposed to do some more steps to get more accuracy ?

The VoxForge tutorial creates acoustic models for grammar-based recognition - and works quite well for that.

I don't have much experience with dictation-based recognition.

Ken

--- (Edited on 4/25/2011 12:50 pm [GMT-0400] by kmaclean) ---

Re: HTK training & Julius Decoding
User: spy1
Date: 8/31/2011 6:33 am
Views: 3262
Rating: 3

Dear enigma 1987

I have problem with julius decoder. Could you help me about your dict file and language models?

-Does your dict file include "SENT-STRAT" and "SENT-END" entries?

-Are the end of all entries in your dict file specified with "sp"?

-how did you build your language models? 

thank you

--- (Edited on 8/31/2011 6:33 am [GMT-0500] by Visitor) ---

WARNING: IW-triphone for word head not found, fallback to pseudo
User: kmaclean
Date: 6/24/2015 11:34 am
Views: 601
Rating: 2

>WARNING: IW-triphone for word head not found, fallback to pseudo

Make sure julius is set to run in multipath mode.

The VoxForge acoustic models are trained in multipath and Julius needs to be configured to run in multipath (in your jconfig file or set it on the julius command line)

--- (Edited on 6/24/2015 12:34 pm [GMT-0400] by kmaclean) ---

error[+9999]: cannot find model for label 'sil'
User: Yasin Mhmd
Date: 3/2/2016 9:13 pm
Views: 1829
Rating: 1

hi,
when i use HDecode tool, i get the following error
error[+9999]: cannot find model for label 'sil'

I write the HDecode command as follows:

HDecode -H hmm15/hmmdefs -H hmm15/macros -S test.scp -t 220.0 220.0 -C config -i recout.mlf -w bg_lm -p 0.0 -s 5.0 dict tiedlist

where bg_lm is the language model 

thanks for your help..

--- (Edited on 3/2/2016 9:13 pm [GMT-0600] by Visitor) ---

PreviousNext