Rest in Peas: The Unrecognized Death of Speech Recognition

Speech Recognition in the News

Flat

User: kmaclean
Date: 5/3/2010 2:19 pm

Views: 3413
Rating: 11

Interesting Article on Speech Recognition. The author, Robert Fortner, is not impressed with the rate of speech recognition improvements over the years. The passage that give the gist of his argument is:

We have learned that speech is not just sounds. The acoustic signal doesn’t carry enough information for reliable interpretation, even when boosted by statistical analysis of terabytes of example phrases. As the leading lights of speech recognition acknowledged last May, “it is not possible to predict and collect separate data for any and all types of speech…” The approach of the last two decades has hit a dead end.[...]

However, what is more interesting is the rebuttal by Jeff Foley (Nuance), who says in a comment:

First of all, any discussion of speech recognition is useless without defining the task--with the references to Dragon I'll assume we're talking about large vocabulary speaker dependent general purpose continuous automatic speech recognition (ASR) using a close-talking microphone. Remember that that "speech recognition" is successfully used for other tasks from hands-free automotive controls to cell phone dialing to over-the-phone customer service systems. For this defined task, accuracy goes well beyond the 20% WERR cited here. Accuracy even bests that for speaker independent tasks in noisy environments without proper microphones, but of course those have constricted vocabularies making them easier tasks. In some cases, you write about the failure to recognize "conversational speech," which is a different task involving multiple speakers and not being aware of an ASR system trying to transcribe words. Software products such as Dragon do not purport to accomplish this task; for that, you need other technologies which are still tackling this task.

And with respect to Fortner's comment that "The core language machinery had not changed since the 50s and 60s", Foley says:

[...] Actually, it was the Bakers' reliance on Hidden Markov Models (HMM) that made NaturallySpeaking possible. Where other ASR attempts focused on either understanding words semantically (what does this word mean?) or on word bigram and trigram patterns (which words are most likely to come next?), both techniques you described, the HMM approach at the phoneme level was far more successful. HMM's are pretty nifty; it's like trying to guess what's happening in a baseball game by listening to the cheers of the crowd from outside the stadium.[...]

Good thing Sphinx, HTK and Julius all use HMM-based acoustic models...

Previous • Next •


Username	Password