VoxForge
Hi list,
I have been studying and understanding the sphinx system for a project where I need to validate a single sentence pronounced by a multitude of users against a prototype one.
My first approach has been using sphinx_align with a dictionary generated with the LM tool containing just the sentence to be validated, using the hub4 acoustic model. This approach has proven unreliable in terms of false positives, as the engine, under the hood, is giving a high probability to dubious phones to be the right ones, when they are not.
To accomplish my goal I see two viable solutions then: the first one is using a large dictionary like WSJ and work around the result to build myself a confidence score from sphinx_decode. I think that the execution time would grow a lot in this case, and to me seems a bit overkill.
The second approach I can think of is using the library to completely skip the step of language modeling, and build myself a postprocessing that takes into account, for every detected phoneme, a set of similar phonemes and see if I can reconstruct the sentence I am looking for. I haven't looked in the sphinx3 library yet, so I don't know if it's possible to separate phoneme recognition from language modeling.
What do you suggest? Looking forward to your comments.
Thank you
Alessandro
--- (Edited on 12/31/2012 9:38 am [GMT-0600] by Visitor) ---
> I have been studying and understanding the sphinx system for a project where I need to validate a single sentence pronounced by a multitude of users against a prototype one.
This task is called "utterance verification". It's not supported by CMUSphinx toolkit right now.
You need a specific algorith for utterance verification, speech recognition algorithms, in particular phoneme decoding and large vocabulary decoding do not work for utterance verification. Before you start implementing something I suggest you to read few papers on utterance verification theory.
A simple google search could give you a lot of papers on the subject. As an introduction to the utterance verification you can check:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.93.6890
Essentially any algorithm for utterance verification should include an estimation of the probability of two distrinct hypothesis. One is that utterance is correct, another is that utterance is wrong. Second hypothesis can be checked with a phone loop.
> My first approach has been using sphinx_align with a dictionary generated with the LM tool containing just the sentence to be validated, using the hub4 acoustic model. This approach has proven unreliable in terms of false positives, as the engine, under the hood, is giving a high probability to dubious phones to be the right ones, when they are not.
Yes, this approach is not going to work.
> To accomplish my goal I see two viable solutions then: the first one is using a large dictionary like WSJ and work around the result to build myself a confidence score from sphinx_decode. I think that the execution time would grow a lot in this case, and to me seems a bit overkill.The second approach I can think of is using the library to completely skip the step of language modeling, and build myself a postprocessing that takes into account, for every detected phoneme, a set of similar phonemes and see if I can reconstruct the sentence I am looking for. I haven't looked in the sphinx3 library yet, so I don't know if it's possible to separate phoneme recognition from language modeling.
Those two approaches are better, but they are not efficient.
--- (Edited on 12/31/2012 18:57 [GMT+0300] by nsh) ---
Why do you think that the second approach I sggested would be bad performance-wise? Having a list of decoded phonemes it would be trivial to get a score against the sentence using an algorithm like this
http://en.wikipedia.org/wiki/Levenshtein_distance
It's probably a very DIY solution, but my problem is really specific too, no?
--- (Edited on 12/31/2012 12:42 pm [GMT-0600] by Visitor) ---
> Why do you think that the second approach I sggested would be bad performance-wise? Having a list of decoded phonemes it would be trivial to get a score against the sentence using an algorithm like this
This score (edit distance) will not help you to distinguish utterance from the alternatives. Once you get a phone string, you have no way to understand if the utterance was decoded incorrectly or pronounced incorrectly. For example, you can not distringuish between "hello world" decoded as "hello word" from the "hello word" decoded as "hello word", so you will not be able to distinguish between "word" and "world".
--- (Edited on 1/1/2013 19:33 [GMT+0300] by nsh) ---