VoxForge
Hi tpavelka,
>With grammar based recognition I would expect that the results from HTK
>and Julius are exactly the same (even up to the acoustic scores). They
>may differ in recognition speed though.
Very interesting... that would explain alot... I just followed the HTK Tutorial and "Step 11 - Recognising the Test Data" says to use HVite with the monophone dict file created in "Step 2 - Dictionary". I never got past the tutorial in reading the tutorial in the HTK manual... I'm a Perl hacker, not a speech rec specialist like you and nsh, and got lost rather quickly :)
Another thought with respect to the problem you have been having with the VoxForge Corpus: because noisy and clean speech are intermixed in the corpus, I am wondering if your poor results might be due to the silence model being improperly trained... what I am getting at, is I think that when the VoxForge AM is trained with a corpus that includes noisy submissions (with line hum/noise/scratching/pops...) then HTK will try to identify the phonemes, and anything that is not a phoneme would be identified as sil, but if the submission is noisy, then it will try to incoporate the noise data into sil, causing mis-recognitions later.
One way to address this might be to identify the entire submission as noisy, and programmatically add a new phoneme for "noisy silence" (nsil) so that a transcription that would normally read as "Bob walked around the tree", in a noisy submission would look like "Bob nsil walked nsil around nsil the nsil tree" so that only "quiet" submissions would be used for training the "sil" model. But we would still need some way to disable "sil" training for the noisy submissions (even if we add nsil between each word, HTK will think this is a regular phoneme and try to insert a silence after each phone, even the "nsil").
Another way would to select a clean subset of the corpus and train the "sil" model and use it to replace the current sil model in the VoxForge corpus. This would be the best way to test this theory.
I guess this is why most people would simply create two separate acoustic models: one for clean speech and one for noisy speech.
Any thoughts?
thanks,
Ken
--- (Edited on 3/13/2009 9:23 pm [GMT-0400] by kmaclean) ---
Hi,
I have checked the tutorial and now see that I was wrong, HTK actually supports automatic expansion of monophones into triphones. But there is a problem if you have both monophones and triphones in the set. According to the tutorial, you need to set the following configuration variables:
FORCECXTEXP = T
ALLOWXWRDEXP = F
add these two lines to this:
http://www.dev.voxforge.org/projects/Main/browser/Trunk/Scripts/Testing_scripts/NightlyTest/wav_config
and see if the results on the HTK test increase.
As for the silence model I do not think that is the problem. In both training and testing the silence can only be at the beggining or ending of the utterance. The only way for silence to cause problems is if it "eats up" some of the phonemes at the beginning and ending of the utterance. This would lead to word errors at the beginnings and endings of the utterances which I do not see (by looking at the test results, I did not do any statistical tests to check this).
If we were talking about live recognition that would be a different story because there the silence model is used to decide when to end the recording. But that is not the case in the tests we are doing right now.
As for the noisy silence model between words you are right that it would be treated just like any other phoneme and would require the silence to be actually present between the words (from my experience this is rarely the case).
I have read somewhere on these forums (can't find it now) that if you have too much noise in the data then the trained Gaussians have too large variance which decreases the resulting accuracy. What I am planning to do now is to check whether my VoxForge models have visibly larger variance than the models trained on my Czech corpus (on which have much larger accuracy even though it is quite small). If that is the case we can try to figure out where does this variance come from.
I have done another experiment in order to increase my accuracy: since it is said that PLP works better on noisy data than MFCC i have tried to train acoustic models with PLP parametrized speech. Here are the results:
WORD: %Corr=54.95, Acc=49.01 [H=611, D=100, S=401, I=66, N=1112]
If you compare it with what I have written earlier, it is pretty much the same result... no luck here.
On a side note: While doing the training I have found out that if you feed HTK's HCompV with the 40+k files of VoxForge you can get negative variances for the initial Gaussain estimates. Figuring this out, asking a guy in mathematics department to design a more numerically stable algorithm for variance computation and writing my own version took me about two days. With a corpus of this size everything takes much more time than what I was used to...
--- (Edited on 3/17/2009 8:47 am [GMT-0500] by tpavelka) ---
Hi tpavelka,
>FORCECXTEXP = T
>ALLOWXWRDEXP = F
>add these two lines to this:
>http://www.dev.voxforge.org/projects/Main/browser/Trunk/Scripts/Testing_scripts/NightlyTest/wav_config
>and see if the results on the HTK test increase.
Done... it will run tonight
>What I am planning to do now is to check whether my VoxForge models
>have visibly larger variance than the models trained on my Czech corpus
>(on which have much larger accuracy even though it is quite small). If that
>is the case we can try to figure out where does this variance come from.
Cool, thank you very much for your help on this!
>With a corpus of this size everything takes much more time than what I was used to...
There are resources for splitting the training of acoustic models that you might be interested in: Parallel Processing of the HTK Commands, which might be able to speed things up if you have a multi-core computer
Ken
--- (Edited on 3/17/2009 10:27 am [GMT-0400] by kmaclean) ---
Hi tpavelka,
>According to the tutorial, you need to set the following configuration variables:
>FORCECXTEXP = T
>ALLOWXWRDEXP = F
New test results are in, and HTK is looking much better:
Testing Acoustic Models created in: /data/svn-mirror/Nightly_Builds/AcousticModel-2009-03-18
HTK 16kHz_16bit
---------------
Parameters:
word insertion penalty: 0.0
grammar scale factor: 1.0
====================== Results Analysis =======================
Date: Wed Mar 18 13:59:42 2009
Ref : testref.mlf
Rec : recout.mlf
------------------------ Overall Results --------------------------
SENT: %Correct=58.00 [H=29, S=21, N=50]
WORD: %Corr=98.41, Acc=76.72 [H=186, D=0, S=3, I=41, N=189]
===================================================================
Whereas Monday night's run was (note that about 40+ new submission were added last night):
Testing Acoustic Models created in: /data/svn-mirror/Nightly_Builds/AcousticModel-2009-03-17
HTK 16kHz_16bit
---------------
Parameters:
word insertion penalty: 0.0
grammar scale factor: 1.0
====================== Results Analysis =======================
Date: Tue Mar 17 05:42:27 2009
Ref : testref.mlf
Rec : recout.mlf
------------------------ Overall Results --------------------------
SENT: %Correct=44.00 [H=22, S=28, N=50]
WORD: %Corr=88.89, Acc=58.73 [H=168, D=2, S=19, I=57, N=189]
===================================================================
With some tweaking of parameters, we should be able to get something close to the Julius results.
Very cool!
thanks!
Ken
--- (Edited on 3/18/2009 6:19 pm [GMT-0400] by kmaclean) ---
Hi, thanks to this post by nsh I have tried the new HDecode and I think I have finally figured out what the problem was with all my experiments with language models.
When I ran HDecode the first time, recognition speed was like 8-40x RT and the results were pretty poor:
SENT: %Correct=9.09 [H=2, S=20, N=22]
WORD: %Corr=63.74, Acc=51.10 [H=116, D=4, S=62, I=23, N=182]
(first 20 sentences only as it takes ages)
But I have found this in the tutorial:
Note: the grammar scale factors used in this section, and the next section on discriminative training, are consistent with the values used in the previous tutorial sections. However for large vocabulary speech recognition systems grammar scale factors in the range 12-15 are commonly used.
After changing the scale factor to 15, the speed increased about ten fold and the resulting accuracy is:
SENT: %Correct=38.00 [H=38, S=62, N=100]
WORD: %Corr=83.99, Acc=80.59 [H=766, D=22, S=124, I=31, N=912]
I think this result is consistent with what nsh reported on the Sphinx test since he used several mixtures whereas I have only one. I think that if I add mixtures and tweak the parameters a bit I can get over 90%.
@Ken: I did the test with my own acoustic models. To do the tests on the official VoxForge model I need the file with the decision trees that were created after clustering. Unfortunatelly, it is not in the distribution archive and I was not able to find it anywhere else. I think the decision trees should be included in the distribution so that people who download it can synthesize unseen triphones that are not present in the tiedlist.
Tomas
--- (Edited on 3/19/2009 10:41 am [GMT-0500] by tpavelka) ---
Hi Tomas,
>To do the tests on the official VoxForge model I need the file with the
>decision trees that were created after clustering.
I set up a couple of new package in the nightly build that includes all the files used in the creation in the acoustic model:
XXX_HTK_AcousticModel-2009-03-19_16kHz_16bit_MFCC_O_D_devel.tgz 19-Mar-2009 20:37 38.0M
XXX_HTK_AcousticModel-2009-03-19_8kHz_16bit_MFCC_O_D_devel.tgz 19-Mar-2009 20:37 38.1M
They are a bit big to be run nightly, so I won't be necessarily running every day... maybe with every new addition of speech submissions to the corpus.
Probably makes more sense to just include the decision tree in the distro as you originally requested, but this should suit your needs for now.
Ken
--- (Edited on 3/20/2009 6:49 am [GMT-0400] by kmaclean) ---
Hi,
I have run the experiment with the 16kHz_16bit models and here are the results:
SENT: %Correct=77.00 [H=77, S=23, N=100]
WORD: %Corr=94.85, Acc=94.19 [H=865, D=9, S=38, I=6, N=912]
Actually it is surprisingly high knowing that it is a single mixture hmm. Again I don't know what I am doing wrong in my training ;-) One thing is that the models were trained on all the data including the ones used in testing but I do not think that it can explain the difference. With a corpus of this size I do not believe there was overtraining which would explain high results with training data.
The whole experiment including my models can be downloaded here (it is kind of a mess though, I did not have time to clean it up...).
Tomas
--- (Edited on 3/20/2009 8:05 am [GMT-0500] by tpavelka) ---
Ok, here's another one, the whole testing set with the official acoustic model (this test with HDecode took almost 24 hours).
SENT: %Correct=70.72 [H=2111, S=874, N=2985]
WORD: %Corr=94.66, Acc=93.38 [H=26839, D=461, S=1053, I=362, N=28353]
Since we cannot be sure whether the good results are due to the presence of the testing data in the training process I propose that these files (the whole list can be found here) should be taken out of the training set. That way we'll have (at least until a proper testing corpus is created) a reference point with which anyone can compare his results.
Tomas
--- (Edited on 3/21/2009 8:02 am [GMT-0500] by tpavelka) ---
Hi Tomas,
Sorry for the delay in getting back to you, I was travelling all last week...
>Since we cannot be sure whether the good results are due to the
>presence of the testing data in the training process I propose that these
>files (the whole list can be found here) should be taken out of the training
>set.
The MasterPrompts file currently points to all the speech audio that is to be trained on a given nightly build. Could we dynamically create a test corpus from the master prompts file (say every 100th prompt is to be used for testing) or does the training corpus need to be static to get better results? Or should we create a testing corpus from people who's submissions are not included in the training corpus (from people who have only submitted 1 or 2 times)?
I am also thinking that for a release of the VoxForge acoustic model, we would train using the *entire* corpus (not just the training corpus) in order to ensure that it includes as much speech as possible - is this reasonable?
thanks,
Ken
--- (Edited on 3/30/2009 1:45 pm [GMT-0400] by kmaclean) ---