VoxForge
Hi,
I am using SphinxTrain and sphinx3. I followed the Robust Group tutorial for the databases rm1 and an4 an it worked fine for both of them.
Now I am trying to train and test another database (Aurora2 - containing only digits). I've got feature vectors computed from HTK and I converted them to the format used by Sphinx (13 components per vector in big-endian and a 4-byte int header containing the number of datapoints). I am quite sure, that this conversion is working properly, because I get reasonable output with cepview.
As it is a small vocabulary, I don't use phonems but simply the hole word in the dictionary:
EIGHT EIGHT
FIVE FIVE
FOUR FOUR
NINE NINE
OH OH
ONE ONE
SEVEN SEVEN
SIX SIX
THREE THREE
TWO TWO
ZERO ZERO
The training with SphinxTrain is working, but it produces strange results. After the last iteration in MODULE: 50 Training Context dependent models there is a Current Overall Likelihood Per Frame = -116.562247763689. Is this a reasonable value? Compared to the Likelihoods from "an1", it looks quite strange to me.
In addition to that I don't get any decoding results with sphinx3 (probably because the training already failed?), i.e. for every test file only "sil" is recognized.
Does anybody have any hint to me?
Thank you,
Brina
--- (Edited on 3/3/2009 3:51 am [GMT-0600] by brina) ---
Well, digits training is a bit different from the an4 tutorial. First of all it's true you need another dictionary, but not the one you used. Unlike HTK sphinx decoders support only 3 and 5 states HMM, which is not enough to model numbers reliably. To overcome this issue, use the following hack on dicitonary:
eight EY_eight T_eight
five F_five AY_five V_five
four F_four OW_four R_four
nine N_nine AY_nine N_nine_2
oh OW_oh
one W_one AX_one N_one
seven S_seven EH_seven V_seven E_seven N_seven
six S_six I_six K_six S_six_2
three TH_three R_three II_three
two T_two OO_two
zero Z_zero II_zero R_zero OW_zero
Next, there is no sense to use context-dependant model with wide number of senones. I suggest to use 500 tied states. Actually for more precise setup you can just look in SphinxTrain/templates/tidigits.
About aurora, there are issues related to silence detection. I would probably recommend to use forced alignment step to make sure data is consistent.
--- (Edited on 3/3/2009 11:47 am [GMT-0600] by nsh) ---
Thank you for your fast reply!
I changed the dictionary and phone-file according to your suggestion.
Then I changed my sphinx_train.cfg as you suggested and set
$CFG_FORCEDALIGN = 'yes';
Now training runs with forced alignement and creates several output file (e.g. new transcription files with inserted "sil").
But unfortunately the Likelihood Per Frame still has this strange value of about -116 after each iteration (in ci- and cd- training). In fact I don't unterstand, what this likelihood actually means. If it is log10(), it should never become a positive number (as it did in training with an4 and rm1). Otherwise the likelihood is bigger than 1.
Nevertheless, I started a new run of decoding and again almost only "sil" is detected (Sometimes it even recognizes one single digit). Maybe there is something wrong with mit language model file? As every digit can follow after every digit, I only use unigrams:
\data\
ngram 1=14
ngram 2=1
\1-grams:
-1.0791 </s> -99.0000
-99.0000 <s> 0.0000
-1.0791 <sil> 0.0000
-1.0791 EIGHT 0.0000
-1.0791 FIVE 0.0000
-1.0791 FOUR 0.0000
-1.0791 NINE 0.0000
-1.0791 OH 0.0000
-1.0791 ONE 0.0000
-1.0791 SEVEN 0.0000
-1.0791 SIX 0.0000
-1.0791 THREE 0.0000
-1.0791 TWO 0.0000
-1.0791 ZERO 0.0000
\2-grams:
0.0000 <s> </s>
\end\
Regards,
Brina
--- (Edited on 3/3/2009 2:53 pm [GMT-0600] by brina) ---
> But unfortunately the Likelihood Per Frame still has this strange value of about -116 after each iteration (in ci- and cd- training). In fact I don't unterstand, what this likelihood actually means. If it is log10(), it should never become a positive number (as it did in training with an4 and rm1). Otherwise the likelihood is bigger than 1.
Likelihood is an average of the logarithm of the gaussian density, not probability, so it's not normalized and can be as well as positive as negative. -116 is quite a strange value indeed.
How did you extract the cepstra? Can you provide samples? Can you try on aurora 5, which is available for me to test your setup:
http://aurora.hsnr.de/download.html
Also, when you talk about decoding, are you decoding wav files or extracted features like in tutorial? I'm not sure if Aurora2 is at 8 kHz, did you adjust make_feats accordingly?
> As every digit can follow after every digit, I only use unigrams:
There is no need to have <sil>, it's a filler that is inserted automatically. Second point, it's probably easier to use jsgf instead of lm, but this LM also works fine.
--- (Edited on 3/3/2009 4:14 pm [GMT-0600] by nsh) ---