VoxForge
Is it possible to get simular results with VoxForge/Julius? Or possibly CMU Sphinx?
http://www.youtube.com/watch?v=09HMLolKhp0
If not yet, how long will it take open source community to catch up? Your best guess
--- (Edited on 12/8/2008 12:14 pm [GMT-0600] by Visitor) ---
I did some more research, and it looks like abovementioned Google technology, Vlingo(Yahoo backed company) and Microsoft(with tellme.com) all use cloud computing for speech recognition.
I'm not sure if it is a sane idea, but will it make sense to organize some distributed project (BOINC based, or similar) with thousands computers using VoxForge to break user queries one by one. Point would be to spend 1000x times computer power on audio but still keeping response time within seconds.
I remember I have read here somewhere about acoustic model that take five times of real time on 2.8GHz processor... Imagine model that take 2000x of single quad core - would'nt it be way better?
Disclaimer: sorry, I'm noob here, and it very well could be that what I'm talking doesn't make much sense...
--- (Edited on 12/8/2008 8:54 pm [GMT-0600] by Visitor) ---
Hi TP,
>how long will it take open source community to catch up? Your best guess
Interesting question!
The current limiter is not the speech recognition engine. The current limiter is having enough speech to train decent acoustic models.
The VoxForge project has been collecting speech for a little over 2 years, and we only have 56 hours of speech. The Sphinx Group of speech recognition engines use 140 hours of speech for reasonably good command and control speech recognition. Assuming the same rate of submissions, we are going to need another 2.5 years for something similar to Sphinx's acoustic models...
Dictation speech recognition needs at least 1000 hours (maybe less depending on the quality of the audio and transcriptions...). Command and control applications can get away with less speech because they use a constrained vocabulary (only need to recognize a few predetermined words). Dictation applications have an unconstrained vocabulary, but up until the past few years, tended to be on PCs. PC-based dictation apps were able to benefit from wider audio bandwidth (16kHz sampling rate at 16 bits per sample) of the speech to be recognized - i.e. there is more 'information' that a speech recognition engine can use to find a match in its acoustic model.
Unconstrained speech recognition over the telephone is quite a feat, because of the narrower data pipe of telephone lines (8kHz-8bit) - there is less 'information' that a speech recognition engine can use to find a match in its acoustic model. This means that much, much more audio is required to compensate for this reduced amount of audio information - I would *guess* 5000 hours or greater.
So I don't see similar Google technology coming to open source for a long time - at least using the approach we are currently using.
Brough Turner has another approach, but there are Copyright issues that I am not sure how to resolve...
Ken
--- (Edited on 12/10/2008 12:20 pm [GMT-0500] by kmaclean) ---