VoxForge
Hi!
I am planing to perform specific project, a model that will recognize just one speaker.
Basically the software will be trained by me and I will be the only person using it.
In such case what kind of training is needed?
Thanks in advance, Alex
--- (Edited on 10/12/2010 6:44 am [GMT-0500] by Visitor) ---
Hello Alex
Training single-speaker model is not different from training multispeaker model. I recommend you train model for CMUSphinx using SphinxTrain. See the training tutorial
http://cmusphinx.sourceforge.net/wiki/tutorialam
The only thing you need to care about is to set number of tied states properly for the amount of your data. If you have 2-3 hours of training data, 2000 tied states will work for you.
--- (Edited on 10/13/2010 16:32 [GMT+0400] by nsh) ---
Hi!
Actually what I am asking is, if you are createing an application for SINGLE user and limited vocabulary(lets say 500 words) can the training work in such a way:
1) Record EVERY word you want to be recognized later, by saying it once during the training proccess.
2) The recognizer will be able to recognize only those words that were added to the database in step one.
Thanks in advance!
--- (Edited on 10/14/2010 10:26 am [GMT-0500] by Visitor) ---
> 1) Record EVERY word you want to be recognized
> later,bysaying it once during the training proccess.
Heh, your poor user. Do you care about him? Do you think he need to spend ages recording all 500 words you've invented for him? That's crazy. I'm certainly don't want to be your user. FYI, for reliable training the number of samples of the each context in db needs to be like 50-100.
Common practice is to suggest user to read 2-3 paragraphs of interesting text and adapt generic model to user speech. This way you can reach even superior accuracy and make user's life enjoying. Dragon Naturally Speaking for example suggests users to read funny "Dogbert's top secret management handbook". It has some funnies related to your question like
If you don't know what to do ask for the weekly report
> 2) The recognizer will be able to recognize only those words that were added to the database in step one.
Recognizer will recognize words specified in the language model. It's unrelated to words used during training/adaptation.
--- (Edited on 10/14/2010 19:43 [GMT+0400] by nsh) ---
hmm...
I didn't think that the number of samples of the each context in db needs to be like 50-100...
Thank you for clerifing that...
My assamption was that since the program should recognize just one user it will be enogh to say each word once.
I thought that for example same person saying a word "hello" will look similar(the voice wave) every time the same person says it...
Am I wrong with this assamption?
Alex
--- (Edited on 10/14/2010 11:24 am [GMT-0500] by Visitor) ---
> Am I wrong with this assumption?
Yes, there are thousand ways to say hello the way it will be very different :)
Check this
http://www.youtube.com/watch?v=99jVPJUeqr4
Interesting how many will you count
--- (Edited on 10/14/2010 20:36 [GMT+0400] by nsh) ---