VoxForge
Hi,
I'm working on a Speech Recognition software for speech improvement. We were previously using the Speech API of windows but we decided to move to an open source solution.
Our system is quite simple and should be able to recognize any word giving a recognition accuracy score. The whole program will be based on that score.
Technically it means that the text will be separated in sentences, each sentence being a separated language model. The only words the recognizer should be able to recognize are the words in the sentence.
The user read the word, and the recognizer having only one word to recognize will provide an accuracy score for this word.
The solution I'm using is FSG with Sphinx2. Unfortunately I got few issues :
1) The dictionnary being the dictionary of the story, I have to load it when we open the story. We cannot change the dictionary at runtime and therefore have to re-init the recognizer all the time.
2) I need to generate the dictionaries with the online LM Tool from the CMU Website, this is because there are of course words not included in the basic english dictionary and then pronunciation has to be generated. Do you know if Sphinx offers that kind of solution but offline?
3) Performances are poor, even with restriced context, the recognizer doesn't provide an accurate score. I'm using the CMU acoustic model.
4) it crashes a lot...
I was thinking about moving to another recognizer but I'm not sure I'll find my solution there. First of all, do you think I should move to pocket sphinx, and if yes why? Also, maybe the solution I'm using is not good.
My last question concerns the type of grammar I should use. Is LM a lot more accurate and better implemented than FSG? Generatin LM for sphinx is a pain, but if I really have to, I will move to LM.
Thanks for your help,
It will be appreciated a lot,
Boris
--- (Edited on 11/16/2009 5:15 am [GMT-0600] by ) ---
Hi Boris,
> I'm working on a Speech Recognition software for speech
>improvement.
You might want to check out Ivela's approach (Internet Voice e-learning application using Julius for speech recognition) in this post: My Java Application, please enter and test!
>My last question concerns the type of grammar I should use. Is LM a lot
>more accurate and better implemented than FSG? Generatin LM for
>sphinx is a pain, but if I really have to, I will move to LM.
If you want to recognize a small, predefined list of words or phrases, then a grammar-based recognizer is the way to go.
I don't know about Sphinx, but Julius can switch grammars on-the-fly.
Ken
--- (Edited on 11/16/2009 12:17 pm [GMT-0500] by kmaclean) ---
Thanks a lot Ken! I just had a look, unfortunately I cannot test their applet, the link seems broken.
I have a question concerning Julius. So my main issue is the recognition of unknown words. A user could be able to copy paste any kind of text and the recognizer will be able to decode it. So, how does the recognizer handle that?
I also saw that they had a "SAPI" implementation, but they say it's been japaneese only. Have you heard of such a thing in English?
Thanks, Boris.
--- (Edited on 11/16/2009 12:06 pm [GMT-0600] by ) ---
> I just had a look, unfortunately I cannot test their applet, the link
>seems broken.
The source code is on the site.
>So my main issue is the recognition of unknown words.
You need words known in advance with a language model or a grammar file. A language model just has *a lot* of words in it, so it covers most words it might encounter.
>A user could be able to copy paste any kind of text and the recognizer
>will be able to decode it. So, how does the recognizer handle that?
It doesn't recognize words it does not know... this is the same with commercial speech recognition engines... it can guess, but that is not useful if the word or phrase is not in it's language model.
>Have you heard of such a thing in English?
The VoxForge acoustic models are in English and can be used with HTK or Julius.
You need to study how to create acoustic models in HTK for use with the Julius speech recognition engine - the VoxForge tutorial is a good place to start :)
But before you switch to Julius, you should look at PocketSphinx in more detail since it has a more mature acoustic model, and you are somewhat familiar with the Sphinx-group of recognizers.
Ken
--- (Edited on 11/16/2009 1:35 pm [GMT-0500] by kmaclean) ---
Thanks for your help Ken.
Few precisions : First of all I need the largest Acoustic Possible. As I said, the user should be able to pronunce any word possible. The VoxForge one looks great, but I'm not sure it is as good as the Wall Street Journal acoustic model provided by the CMU, but correct me if I'm wrong.
One other reason I need to port my app to another recognizer is because there is no way to get VoxForge or WSJ compiled for Sphinx2. I maybe could using SphinxTrain but not sure.
Concerning the out of vocabulary problem, this is something I really need to solve. As I told you the user should be able to read any word, even "testwordtest". If this word is not in the dictionary, it will of course not work. So what I need is a text-to-pronunciation tool for the missing words. FLite (http://www.speech.cs.cmu.edu/flite/) does that very well but it's linux only and I couldn't get it compiled on win32 architecture unless you use Cygwin. Other issue, it uses stress, meaning that you need a dictionnary with stress pronunciations.
I moved to PocketSphinx (at least I'm trying :)), now I got difficulties to see how I could get Confidence Scores for the hyp words. Results come with other data called pprob (almost al the time 1), ascr (Acoustic Score) and others. Do you know what are these pprob and acoustic scores?
Best,
Boris
--- (Edited on 11/20/2009 11:42 am [GMT-0600] by mrshoki) ---
To be honest I think we first need to get to a common background in order to be able answer your question in details. For example you should understand that in order to have reliable confidence score instead of 1.0 you need a language model and dictionary that are representative enough to cover all possible variations.
Computer-assisted language learning is a large area with huge amount of publications. I suggest you to go through the following sources, optionaly reading textbook on speech technologies.
Rabiner's tutorial
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.131.2084
Confidence measures for speech recognition: A survey
Hui Jiang
www.cse.yorku.ca/~hj/mypubs/Jiang_sc05.pdf
A method for measuring the intelligibility and nonnativeness of phone quality
in foreign language pronunciation training
Goh Kawai and Keikichi Hirose
http://www.shlrc.mq.edu.au/proceedings/icslp98/PDF/AUTHOR/SL980782.PDF
The SRI EduSpeakTM System: Recognition and Pronunciation Scoring
Franco et al.
http://www.speech.sri.com/people/hef/papers/EduSpeak.p
--- (Edited on 11/21/2009 01:31 [GMT+0300] by nsh) ---
Thanks a lot nsh. I checked the different links you've provided and I understand a bit better confidence scoring.
So apparently sphinx2 and pocketsphinx are already using some of this methods to compute confidence scoring.
Sphinx 2 for example uses the following :
P(W) = n-th √ [ π(P(phone in pronunciation sequence)) ]
Concerning pocket sphinx they are using a different method, that's the reason why I get conf scores 0 < cs < 1 on Sphinx2 and prost probabilities = 1 on pocket sphinx.
I've no idea why they changed the computation method, but anyway, why do I get a relevant score on Sphinx2 even if my lm is restricted to a word? Is it because I'm using a different acoustic model?
I could also use some of the method explained by Hui in it's paper, with crossed pronunciations between languages, or I'm not sure I need that since I need to have a feedback on words and not ponunciation units. I mean I could but it seems complicated for what I need.
When I used Microsoft SAPI, I used FSG and I got relevant conf scoring even if there is only one possibility at a time (and therefore pprob = 1). So if I understand well, everything depends on the acoustic model used?
So a solution would be to create my own conf scoring method, and thus I need to access to phonems probs and see what I obtain.
EDIT : Actually, a relevant question would be : have "post probability" and "confidence score" the same meaning? I understand that having only one word possibility at a time would mean a probability of 1.0, but the confidence score should be something different with the quality of the result. And if different, why Sphinx > 2 don't have confidence scores?
--- (Edited on 11/26/2009 7:52 am [GMT-0600] by mrshoki) ---
--- (Edited on 11/26/2009 9:37 am [GMT-0600] by mrshoki) ---
Hi
Sorry for the late reply.
> I've no idea why they changed the computation method, but anyway, why do I get a relevant score on Sphinx2 even if my lm is restricted to a word? Is it because I'm using a different acoustic model?
The fact that you get score less than 1 doesn't that this score is relevant. I don't think so unless you have representative language model. In the sphinx2 to calculate
P(phone) = per-frame-acoustic-score(phone) / sum[per-frame-acoustic-score(all phones)]
you still need to have representative language model. Othewise you only divide by phones present in this word. and instead of "all phones" = "all phones in language" you get "all phones" = "all phones in word".
> When I used Microsoft SAPI, I used FSG and I got relevant conf scoring even if there is only one possibility at a time (and therefore pprob = 1). So if I understand well, everything depends on the acoustic model used?
Sorry, I can't suggest anything about Microsoft speech recognition internals.
> Actually, a relevant question would be : have "post probability" and "confidence score" the same meaning? I understand that having only one word possibility at a time would mean a probability of 1.0, but the confidence score should be something different with the quality of the result. And if different, why Sphinx > 2 don't have confidence scores?
Posterior probability is only one of confidence measures. There are multiple confidence measures. Phone-based posterior probabilty like in sphinx2 or word-based posterior probability one like in pocketsphinx. In theory word-based could be more precise because it captures the language structure better (of course with the large vocabulary). For small vocabulary phone-based is better indeed, but it doesn't mean vocabulary should be a single word.
There are more advanced methods using machine learning for confidence estimation.
http://jth2008.ehu.es/cd/pdfs/articulo/art_64.pdf
Confidence measures could be duration-based and so on. For more information see Hui's article I linked above.
--- (Edited on 11/30/2009 04:55 [GMT+0300] by nsh) ---
> So what I need is a text-to-pronunciation tool for the missing words.
Sequitur G2P - or search for "grapheme-to-phoneme" converter in Google.
--- (Edited on 11/24/2009 1:45 pm [GMT-0500] by kmaclean) ---