Speech Recognition Engines

Flat
Strange Overall Likelihood Per Frame with SphinxTrain
User: brina
Date: 3/3/2009 3:51 am
Views: 5525
Rating: 14

Hi,

I am using SphinxTrain and sphinx3. I followed the Robust Group tutorial for the databases rm1 and an4 an it worked fine for both of them.

Now I am trying to train and test another database (Aurora2 - containing only digits). I've got feature vectors computed from HTK and I converted them to the format used by Sphinx (13 components per vector in big-endian and a 4-byte int header containing the number of datapoints). I am quite sure, that this conversion is working properly, because I get reasonable output with cepview.

As it is a small vocabulary, I don't use phonems but simply the hole word in the dictionary:
EIGHT    EIGHT
FIVE    FIVE
FOUR    FOUR
NINE    NINE
OH    OH
ONE    ONE
SEVEN    SEVEN
SIX    SIX
THREE    THREE
TWO    TWO
ZERO    ZERO

The training with SphinxTrain is working, but it produces strange results. After the last iteration in MODULE: 50 Training Context dependent models there is a Current Overall Likelihood Per Frame = -116.562247763689. Is this a reasonable value? Compared to the Likelihoods from "an1", it looks quite strange to me.

In addition to that I don't get any decoding results with sphinx3 (probably because the training already failed?), i.e. for every test file only "sil" is recognized.

Does anybody have any hint to me?

Thank you,
Brina

--- (Edited on 3/3/2009 3:51 am [GMT-0600] by brina) ---

Re: Strange Overall Likelihood Per Frame with SphinxTrain
User: nsh
Date: 3/3/2009 11:47 am
Views: 97
Rating: 13

Well, digits training is a bit different from the an4 tutorial. First of all it's true you need another dictionary, but not the one you used. Unlike HTK sphinx decoders support only 3 and 5 states HMM, which is not enough to model numbers reliably. To overcome this issue, use the following hack on dicitonary:

eight                EY_eight T_eight
five                 F_five AY_five V_five
four                 F_four OW_four R_four
nine                 N_nine AY_nine N_nine_2
oh                   OW_oh
one                  W_one AX_one N_one
seven                S_seven EH_seven V_seven E_seven N_seven
six                  S_six I_six K_six S_six_2
three                TH_three R_three II_three
two                  T_two OO_two
zero                 Z_zero II_zero R_zero OW_zero


Next, there is no sense to use context-dependant model with wide number of senones. I suggest to use 500 tied states. Actually for more precise setup you can just look in SphinxTrain/templates/tidigits.


About aurora, there are issues related to silence detection. I would probably recommend to use forced alignment step to make sure data is consistent.

--- (Edited on 3/3/2009 11:47 am [GMT-0600] by nsh) ---

Re: Strange Overall Likelihood Per Frame with SphinxTrain
User: brina
Date: 3/3/2009 2:53 pm
Views: 82
Rating: 12

Thank you for your fast reply!


I changed the dictionary and phone-file according to your suggestion.

Then I changed my sphinx_train.cfg as you suggested and set

$CFG_FORCEDALIGN = 'yes';

Now training runs with  forced alignement and creates several output file (e.g. new transcription files with inserted "sil").

But unfortunately  the Likelihood Per Frame still has this strange value of about -116 after each iteration (in ci- and cd- training). In fact I don't unterstand, what this likelihood actually means. If it is log10(), it should never become a positive number (as it did in training with an4 and rm1). Otherwise the likelihood is bigger than 1.


Nevertheless, I started a new run of decoding and again almost only "sil" is detected (Sometimes it even recognizes one single digit). Maybe there is something wrong with mit language model file? As every digit can follow after  every digit, I only use unigrams:

\data\
ngram 1=14
ngram 2=1

\1-grams:
-1.0791 </s> -99.0000
-99.0000 <s> 0.0000
-1.0791 <sil> 0.0000
-1.0791 EIGHT 0.0000
-1.0791 FIVE 0.0000
-1.0791 FOUR 0.0000
-1.0791 NINE 0.0000
-1.0791 OH 0.0000
-1.0791 ONE 0.0000
-1.0791 SEVEN 0.0000
-1.0791 SIX 0.0000
-1.0791 THREE 0.0000
-1.0791 TWO 0.0000
-1.0791 ZERO 0.0000

\2-grams:
0.0000 <s> </s>
\end\

Regards,

Brina

--- (Edited on 3/3/2009 2:53 pm [GMT-0600] by brina) ---

Re: Strange Overall Likelihood Per Frame with SphinxTrain
User: nsh
Date: 3/3/2009 4:14 pm
Views: 1816
Rating: 11

> But unfortunately  the Likelihood Per Frame still has this strange value of about -116 after each iteration (in ci- and cd- training). In fact I don't unterstand, what this likelihood actually means. If it is log10(), it should never become a positive number (as it did in training with an4 and rm1). Otherwise the likelihood is bigger than 1.


Likelihood is an average of the logarithm of the gaussian density, not probability, so it's not normalized and can be as well as positive as negative. -116 is quite a strange value indeed.

How did you extract the cepstra? Can you provide samples? Can you try on aurora 5, which is available for me to test your setup:

http://aurora.hsnr.de/download.html

Also, when you talk about decoding, are you decoding wav files or extracted features like in tutorial? I'm not sure if Aurora2 is at 8 kHz, did you adjust make_feats accordingly?

> As every digit can follow after  every digit, I only use unigrams:

There is no need to have <sil>, it's a filler that is inserted automatically. Second point, it's probably easier to use jsgf instead of lm, but this LM also works fine.

 

--- (Edited on 3/3/2009 4:14 pm [GMT-0600] by nsh) ---

PreviousNext