VoxForge
Hi,
I'm trying to build up a speech recognition system for free speech, i.e. I'm using a high performance microphone which is put somewhere in a distance from 1 to 4 meters to the speaker. Since my vocabulary is very limited (approx. 20 words) I decided to use an isolated word recognition system.
Now my questions are:
- How do I know which prototype of my hmm's should be the best for my requirements (how many states should I use, do I need mixture desity models, ...)? So far I used 6 state hmm's (4 activ states) without mixture densities!
- The next question is, which terms and conditions are wise when recording the training data? (should I train my models with speech which was recorded close to the microphone or should I do my recordings under the same conditions as I want to use the recognition system?)
I hope someone has some good advises and can maybe also answer some of my questions!
Regards, Nick
--- (Edited on 4/6/2009 8:47 am [GMT-0500] by Visitor) ---
> How do I know which prototype of my hmm's should be the best for my requirements (how many states should I use, do I need mixture desity models, ...)? So far I used 6 state hmm's (4 activ states) without mixture densities!
Up to 12 states depending on the word's length. It's even better to use different number of states for different words. For short ones use 7 for long 12. Train your models with:
HVComp -> HERest -> HVite (get labels) -> HInit -> HRest -> HERest
Increase number of mixtures but only if you have enough data (200-300 speakers). Create a test set and track the error rate.
> The next question is, which terms and conditions are wise when recording the training data? (should I train my models with speech which was recorded close to the microphone or should I do my recordings under the same conditions as I want to use the recognition system?)
The same that will be used during recognition.
--- (Edited on 4/6/2009 7:46 pm [GMT-0500] by nsh) ---
Thanks for your good reply!
There is one thing which I don't understand: Why do you use HERest after HRest in your training "chain" as the last step?
Using HVite for labeling is a very good idea! Haven't thought about that before! So far I've used an own tool which can generate my label files. But maybe HVite is better!
And I have a last question: Do you know if htk and julius/julian are level dependent? My knowlege of mfcc files says to me that it should not be dependent of the level, but I found out that one can significantly increase performance if you do an amplification before putting the audio-files into julius (if sound levels were very small before)! So maybe it is also better to amplify the sound files before using them for training with htk (if sound levels a very low)!?!?
Thanks so far! Regards!
--- (Edited on 4/7/2009 3:04 am [GMT-0500] by Visitor) ---
> There is one thing which I don't understand: Why do you use HERest after HRest in your training "chain" as the last step?
Why not, it reestimates all models with BW thus giving you the accuracy you need. But HERest is not the last for me, I use HMMIRest as the last step.
--- (Edited on 4/7/2009 3:20 am [GMT-0500] by nsh) ---
> And I have a last question: Do you know if htk and julius/julian
are level dependent? My knowlege of mfcc files says to me that it
should not be dependent of the level, but I found out that one can
significantly increase performance if you do an amplification before
putting the audio-files into julius (if sound levels were very small
before)! So maybe it is also better to amplify the sound files before
using them for training with htk (if sound levels a very low)!?!?
It may be endpointer issue with silence detection. Basically it's just a bug. As a workaround there is sense to normalize level.
--- (Edited on 4/7/2009 3:22 am [GMT-0500] by nsh) ---
Unfortunately, I think that I can't use the VoxForge models, since I'm not trying to set up my speech recognition system on phonem basis! Or I haven't found the right data on this page!?
My system should firstly just can recognise some simple comands like the digits from 0-12 and the words "neu", "Termin", "falsch", "Stop", "ja" and "nein" (in german)!
Regards!
--- (Edited on 4/7/2009 3:24 am [GMT-0500] by Visitor) ---
Hi Nick,
I think there is an advantage in using phoneme HMMs rather than whole word HMMs. If you want to add a new word you have to make a sufficient amount of recordings, create new model and train it. But, if you have phoneme HMMs all you need to do is to is to concatenate the phoneme HMMs and you have a new word.
The downside to phoneme HMMs is that you need more training data than for whole word HMMs, but you can use the VoxForge recordings for that. If you want to improve accuracy you can make your own recordings and combine them with the VoxForge corpus.
--- (Edited on 4/8/2009 2:49 am [GMT-0500] by tpavelka) ---
Hi,
I also think that it is more elegant to use phonem HMMs but as you said, one needs much more recordings! That's why I thought it would be better to keep my vocabulary small and to use whole word HMMs! However I'm still testing and thinking about what will be the smartest solution for my non-close talk claim!
Since I'm not trying to use a close talk microphone, I'm not very sure if it will be smart to use the voxforge data, because I have very strong reverberatings on my recordings. But maybe I should just try it out or I should try to compute the room impulse response function and then preprocess the VoxForge data with that... yeah, still lots of questions and challenging tasks!
@nsh: How many states do you use for your "sil" model?
Thanks again for all replies and discussions!
--- (Edited on 4/8/2009 5:37 am [GMT-0500] by Visitor) ---