Speech Recognition Engines

Flat
engine comparison?
User: trevarthan
Date: 4/1/2007 1:06 am
Views: 12836
Rating: 22

Hello,

 

You all obviously know a lot more about these engines than me. However, I'll try to sum up what I've discovered through research and testing here. Hopefully everyone will add to my findings and correct my misconceptions. In particular, I'm interested in comparing Julius and Sphinx2 (perhaps pocketsphinx as a stand in. I don't think they're all that different.) I haven't had much luck with sphinx3, and sphinx4 is basically sphinx3 re-written in Java. And HTK is more of a toolkit than an ASR, right?

 

I think most english speaking people have more experience with sphinx because it comes with english language models and acoustic models. Julius comes with Japanese models. I've got a Japanese for Dummies book that I read from time to time, but overall I suck at Japanese, so I'm sticking with English.

 

Gentoo linux has a sphinx ebuild, and I believe I've seen RPMs for Fedora Core 5. Linux distribution penetration is still quite low for sphinx. Julius, however, while not having an english language or acoustic model (apart from those provided by this site) comes with a win32 build, which is a great advantage to the masses not running linux. I haven't seen any julius RPMs or ebuilds, but they do have a linux binary on the website. I personally run win32 and linux, so I've had fun playing with both sphinx and julius, but I've only used julius on win32 because I don't enjoy compiling code under win32. I'll compile and test julius on linux shortly.

 

Julius also has grammar based recognition (Julian.exe), which this project uses extensively. Julius also has dictation capabilities (Julius.exe), but I haven't seen english language models (n-gram files, etc) so I haven't been able to test it out. I think sphinx is similar. I think sphinx ships with a limited grammar and has the capability for dictation with an expanded grammar.

 

So far I've had much better recognition accuracy (worked really well, actually) with julian + voxforge on win32 than with sphinx + default AM/LM on linux. But my win32 machine is much faster than my linux machine and I bought a new headset mic before doing the win32 test so I'll have to repeat my test on linux with the new mic and test julius on my linux machine to be sure.

 

What I don't know is how the two systems compare under the hood. Does julius have functionality that sphinx lacks? What about the other way around?

 

Thanks! 

 

--- (Edited on 4/ 1/2007 1:06 am [GMT-0500] by trevarthan) ---

Re: engine comparison?
User: trevarthan
Date: 4/2/2007 12:26 am
Views: 334
Rating: 23

OK, I had a chance to test out julian + voxforge on my 933mhz linux machine today (yeah, old school, baby) with the same headset I used on the win32 machine. First, I had to swap out the onboard sound card (old i810 chipset - snd_intel8x0) with an equally old PCI Sound Blaster (snd_ens1371) that I had laying around to get rid of the awful mic static. After that the recognition quality was identical to what I had experience on win32 earlier in the day.

sphinx2-demo still has a terrible recognition rate with the new sound card and the same mic that I used to test julian. However, I'm comparing apples to oranges until I can use the voxforge grammar and AM with sphinx2. Has anyone tried this yet? Is it involved?

--- (Edited on 4/ 2/2007 12:26 am [GMT-0500] by trevarthan) ---

Re: engine comparison?
User: kmaclean
Date: 4/2/2007 1:26 pm
Views: 513
Rating: 20

Hi  trevarthan,

>until I can use the voxforge grammar and AM with sphinx2. Has anyone tried this yet? Is it involved?

Creating Sphinx group AMs from the VoxForge Speech Corpus is still on the 'to-do' list for the VoxForge site. 

You might want to take a look at Keith Vertanen's Sphinx training recipe to get an idea of how involved it might be, or the CMU Sphinxtrain docs.

Thanks for the reporting the results you have found thus far. 

Ken 

--- (Edited on 4/ 2/2007 2:26 pm [GMT-0400] by kmaclean) ---

Re: engine comparison?
User: dbaxter
Date: 9/30/2007 7:34 am
Views: 246
Rating: 16

I am also interested in the comparison of sphinx2 and julian. Especially for a grammar based task when sphinx2 uses the same (voxforge) acoustic model as julian.

Can someone report results? Or - equally interesting - theoretical reasons, why one should perform better than the other? I read somewhere that HTK is much more thoroughly developed than sphinx, but I'm not sure whether this is only related to the HMM training or also to the recognition engine of julian.

I'd be also willing to do the comparison myself. However I have not been successful to use the voxforge AM with sphinx. Does a complete tutorial exist on how to convert the model? 

--- (Edited on 9/30/2007 7:34 am [GMT-0500] by Visitor) ---

Re: engine comparison?
User: kmaclean
Date: 9/30/2007 9:10 am
Views: 3117
Rating: 15

>I am also interested in the comparison of sphinx2 and julian. Especially for a grammar based task when sphinx2 uses the same (voxforge) acoustic model as julian.

Keith Vertanen created acoustic models using the Wall Street Corpus for HTK and Sphinx.  He includes test results for both.  You could run Julian using the HTK acoustic models and compare with Sphinx 2.

>Or - equally interesting - theoretical reasons, why one should perform better than the other?

See Arthur Chan's article in this post: Speech Recognition Engine comparison.  Basically, Julius was designed for dictation applications.  My understanding is that Sphinx was not - though there is a project where they are working on a Sphix 4 dictation SRE (evaldictator).

>However I have not been successful to use the voxforge AM with sphinx. Does a complete tutorial exist on how to convert the model?

See these posts:

Trefnydd is a project by Ivan A. Uemlianin, and he is planning to complete Wout's work on his HTK to Sphinx3 Acoustic Model.

Ken 

--- (Edited on 9/30/2007 10:10 am [GMT-0400] by kmaclean) ---

Re: engine comparison?
User: dbaxter
Date: 10/1/2007 3:09 am
Views: 247
Rating: 23
Thank you for this overview! I'll try both Vertenen's models and the Speech Recognition Model Converter.

--- (Edited on 10/1/2007 3:10 am [GMT-0500] by Visitor) ---

PreviousNext