VoxForge
I am using Julius + HTK and the voxforge scripts (latest versions of everything).
Say I have a grammar in the form:
FRUIT ( $FRUITS )
$FRUITS : ( APPLE | ORANGE | PEAR | ... )
Plus a lot of words to exercise the otherwise unused triphones.
While the number of members of the set of $FRUITS is about 10 (ten) I can get 100% accurate detection from Julius. Very satisfying. Now I am trying to increase the number of members of $FRUITS to over 150 (one hundred and fifty) and my recognition falls to pieces, with about 10% correct. Julius responses for what it considers the best fit appear to be quite irrational, I can't see a pattern at all.
Are there any theoretical reasons why I should be getting such bad results? Julius has more possible combinations to choose from and so has to be quite picky, but they all seem to be quite distinct in their own way.
The members of $FRUITS are nearly all words that do not appear in obelisk_lexicon so I have added them, however I think my allocation of phones is reasonably good since it works in the case of 10 items. Combinations which work in the case of 10 items fail in the case of the larger set.
I will try working through a set of 20, 30, 40 etc to see what happens. Just wondering if there are known limits to accuracy in this kind of context.
--- (Edited on 2/2/2009 9:23 am [GMT-0600] by colbec) ---
> Are there any theoretical reasons why I should be getting such bad results?
Well, it's kind of expected. The nature of speech doesn't allow you to recognize everything precisely and recognition accuracy depends on two factors - speed and vocabulary size.
The typical values are:
10 word - 0.2 RT - 99.9%
100 words - 0.5 RT - 98 %
5000 words - 1.5 RT - 93 %
50000 words - 4 RT - 80%
For example see http://cmusphinx.sourceforge.net/sphinx4/#speed_and_accuracy since such values are common for recognizers.
So plus minus quality of the htk voxforge model you can get the estimate. Sphinx voxforge model should be more precise though, it has around 94% accuracy on 3000 words task.
There are ways to improve accuracy both simple ones and very complicated ones. It will mostly require coding work on the engine itself. Though acoustic model tasks are also sensible. For example you can try to cleanup voxforge database and get several percents of improvement.
--- (Edited on 2/2/2009 5:24 pm [GMT-0600] by nsh) ---
Thanks nsh, interesting reading.
From a few tests with my own voice model and total words about 100 the pattern that is emerging given my setup is that 10-14 elements in $FRUITS gives a really reliable 100% recognition. Greater than 14 starts to suffer quickly and at 20 I average 50%. Further testing up in the hundred and above shows less than 10% and often 1%.
I have not done much with adapting the voxforge model yet. It will be interesting to see if it makes much difference.
--- (Edited on 2/3/2009 11:59 am [GMT-0600] by colbec) ---