VoxForge
I was wondering if it is possible to merge a completed accoustic model (let's say CMU Communicator, because it's the best 'free' one that I know of at the moment) with your own training data, or even another accoustic model (say the one from VoxForge for example)? Or would I have to scrap the CMU communicator and start from scratch...
Currently I am getting about a 70% recognition rate with CMU communicator using Sphinx 2 on a large vocabulary over VoIP.
I would hate to have to scrap my 'success' and start from scratch with a potentially a much higher WER.
Speaking of which, what kind of success rates have people been seeing on an AM trained with the VoxForge data?
--- (Edited on 2/24/2008 9:01 am [GMT-0600] by oeginc) ---
> Currently I am getting about a 70% recognition rate with CMU communicator using Sphinx 2 on a large vocabulary over VoIP
Everything depends on the size of vocabulary and complexity of grammar you'd like to use. If you want to use really large vocabulary like one used in Switchboard you won't get better rate than 70%. If your vocabulary is medium-sized, another acoustic model won't bring you better accuracy as well. 70% is what IBM get's on Switchboard.
I suggest you to define your task more precisely - post number of speakers, size and complexity of vocabulary. Then you should think on building your language model, on adaptation of the model to your speakers and on optimizing parameters. Adaptation will help you and actually it is the way to "merge" model and new data.
--- (Edited on 2/24/2008 9:20 am [GMT-0600] by nsh) ---
Ok, if you are looking for more details, here you go.. :)
Calls are coming in over VoIP to an Asterisk server which in turn is calling an AGI application, I am recording the speaker using the Asterisk RECORD command and saving the output as a .wav file, processing it using sox to convert it to RAW format (sox in.wav -r 8000 -c 1 -s -w out.wav resample -ql) at which point I am calling Sphinx to decode the audio datausing the CMU Communicator acoustic model.
Speakers will be in the hundreds or thousands pretty easily.
Vocabulary contains approximately 45,000 entries (not sure what you consider "really large" vocabulary, but mine is all of the cities and states in the United States. There are applications out there (Free-411 for example) that you speak the city/state into and they "appear" to have a much better than 70% recognition rate, maybe I am wrong.
That being said, the originaly question still stands... I want to know if there is some way to take the "compiled" CMU Communicator data and add other training data to it (Ie: Could you take all of the already compiled corpuses and combine them into one really good one?)
--- (Edited on 2/24/2008 9:45 pm [GMT-0600] by oeginc) ---
> Vocabulary contains approximately 45,000 entries (not sure what you consider "really large" vocabulary, but mine is all of the cities and states in the United States.
Hm, it should be better for sure. There are too many places where you can make a mistake - using bad language model, compression artifacts affect the quality too, beams could be too small and so on. If you can share examples on which you get error rate we could try to optimize things.
> I want to know if there is some way to take the "compiled" CMU Communicator data and add other training data to it
It's called adaptation. Sphinxtrain supports MLLR for example.
--- (Edited on 2/25/2008 12:22 am [GMT-0600] by nsh) ---
Ok, here is what I did...
0. I downloaded SphinxBase & Sphinx2, compiled & installed. I downloaded the CMU Communicator grammar (because it's the best one that I know of at the moment) and installed it.
1. Created a dictionary with all of the cities, states, and then city/state combinations (I'm not sure why, but one of the other programmers here suggested I put the cities & states by themselves as well as concatenated (which is how we ask the caller to say them)).
2. I uploaded the dictionary to the Sphinx Knowledge Base Tool to create my language model. Doing only the cities & states in Michigan (just for testing purposes because the SKBT won't exact a dictionary of 44,000+ words I had to limit it) I ended up with 973 lines in my .dic file (so 973 unique words, right?). After installing this dictionary I need to replace the EL phone in the .dic file with L for GOBLES MICHIGAN (Sphinx2 doesn't like EL for some reason).
3. I setup Asterisk to pass incoming calls to my script via the AGI-PHP interface.
4. My script answers the call, asks for a city/state, records a .wav file, runs sox on it with the following:
First run: sox <filename>.wav -e stat -V
to get the volume information, then I use that to increase the sound file to the maximum volume without clipping.
Second run: sox -v <volume scale> <in filename>.wav -r 8000 -c 1 -s -w -V <out filename>.raw resample -ql
I'm sure there is a better way to "clean" the audio sample, but I haven't found it yet.. I've been playing with the compand option of sox, but can't quite figure out a good way to remove or lower the background noise that is created from a telephone call.
5. After I've "cleaned" the audio and converted it to RAW format, I run it thru sphinx with the following options:
"/usr/local/bin/sphinx2_continuous -verbose 2 -samp 8000 -agcmax false -adcin TRUE -adcext raw -ctlfn $tmpFilename -ctloffset 0 -ctlcount 100000000 -datadir . -langwt 6.5 -fwdflatlw 8.5 -rescorelw 9.5 -ugwt 0.5 -fillpen 1e-10 -silpen 1e-10 -inspen 0.65 -top 1 -topsenfrm 3 -topsenthresh -70000 -beam 2e-06 -npbeam 2e-06 -lpbeam 2e-05 -lponlybeam 0.0005 -nwbeam 0.0005 -fwdflat TRUE -fwdflatbeam 1e-08 -fwdflatnwbeam 0.0003 -bestpath TRUE -kbdumpdir ${TASK} -lmfn ${TASK}/localredirect.lm -dictfn ${TASK}/localredirect.dic -ndictfn ${HMM}/noisedict -phnfn ${HMM}/phone -mapfn ${HMM}/map -hmmdir ${HMM} -hmmdirlist ${HMM} -8bsen TRUE -sendumpfn ${HMM}/sendump -cbdir ${HMM}";
And this yields me (after 3 different speakers and 55 or so recorded .wav's) about 71% sentance recognition rate.
I ran some testing this weekend trying to tweak the langwt, fwdflatlw, rescorelw, and ugwt options to see if I could get better results but seeing that I really have no idea what I'm doing and the documentation out there is sparse at best I wasn't able to improve beyond the 71%
I've uploaded the dictionary source (in the .txt) and the results I got back from the SKBT here:
http://www.gigasize.com/get.php?d=rb4o5wxymvc
I've uploaded the .wav files I was using for testing here:
http://www.gigasize.com/get.php?d=2p6wdw8xc7f
Let's see, did I forget anything? Probably.. Just let me know what else you might need.
Any help you could give me on increasing the accuracy, and creating larger dictionaries would be greatly appreciated.
--- (Edited on 2/25/2008 11:59 am [GMT-0600] by Visitor) ---
Well, there is huge space for improvements. I fixed a few mistakes in your dictionary and language model and now I get only one word recognized incorrectly probably due to bad pronouncation. Word accuracy is 99.3% with wsj1 model and latest pocketsphinx. You can find all my files here:
http://www.mediafire.com/?camj5ujy1xw
What I did:
Fixed belleville pronouncation
Fixed okomos vs okemos
Added farmington hills to the corpus
Fixed brighton prnouncation
So everything is about using proper dictionary and proper model. Also don't think that WER will be 1.3%, actually the performance of WSJ is around 96%, that is the accuracy you'll approach once you'll have more data.
Few more generic advices:
1. Use pocketsphinx instead of sphinx2. Sphinx2 is deprecated and in sphinx svn you can checkout latest perl module to interoperate with pocketsphinx from asterisk with agi
2. Fix your dictionary and restrict your model as I wrote before
3. When you'll reach the limit, use MAP adaptation to improve accuracy. Once you'll setup things properly we can return to this later.
4. Consider speech donation to Voxforge :)
--- (Edited on 2/25/2008 2:21 pm [GMT-0600] by nsh) ---
--- (Edited on 2/25/2008 2:22 pm [GMT-0600] by nsh) ---
First of all, THANK YOU. Wow, this made such a tremendous difference, I was shocked..
Secondly, I apologize for taking so long to get back to you, I was just so excited things were working I went on a coding marathon. :)
I had tried PocketSphinx originally, but couldn't find any good examples on how to get it working (where as there were many examples for Sphinx2, hence why I "downgraded").
I setup PocketSphinx, switched back to the WSJ1 AM, loaded up the fixed LM and ran my tests and sure enough things worked beautifully.
Now the questions remain:
1. How do I take a list of sentenence (Ie: City/State combinations) and turn that into my LM? As I mentioned above, I have been using the Sphinx Knowledge Base Toolkit to do it for me (you upload a text file, then download the complete LM). This has been working for "smaller" dictionaries, but for my entire United States list of cities/states it chokes.
2. What is MAP adaption?
3. I was looking into LumenVox Speech Recognition Engine and notice they claim to have all sorts of "technology" to remove background noises, etc. I imagine the majority of this could be done by running the incoming speech thru SOX, maybe with some COMPAND options or something. Obviously I'm not an audio genius, but I would have thought that by now there would be more documented examples on exactly how to "clean up" audio and prepare it for Speech Recognition... My audio, by the way, is coming in over VoIP thru an Asterisk Server.
P.S. I actually had already donated some speech to the VoxForge project, and plan to do some more later as well as have some of my employees help out.
--- (Edited on 2/28/2008 4:23 pm [GMT-0600] by Visitor) ---
> How do I take a list of sentenence (Ie: City/State combinations) and turn that into my LM? As I mentioned above, I have been using the Sphinx Knowledge Base Toolkit to do it for me (you upload a text file, then download the complete LM). This has been working for "smaller" dictionaries, but for my entire United States list of cities/states it chokes.
Well, it's a complicated question actually. Online tool is a hacked version of cmucmtk available from sphinx site. I don't know how it was modified, but I suppose its possible to try to reproduce it. Join #cmusphinx on freenode or ask on sphinx-sdmeet mailing list on sourceforge, we'll try to get it working.
The second and to my opinion more important issue is that you'd like to get higher probability for bigger towns probably. So that language model will care more about New York than New Vasuki. It's not so straightforward to build such model although it would be very interesting task. Again, this needs additional discussion and it's a separate issue.
> 2. What is MAP adaption?
Adaptation is described here:
http://www.speech.cs.cmu.edu/cmusphinx/moinmoin/AcousticModelAdaptation
Although recent sphinxtrain allow you much more interesting things.
> I was looking into LumenVox Speech Recognition Engine and notice they claim to have all sorts of "technology" to remove background noises
No, actually lumenvox is much more advanced than sphinx. Of course it's not about simple sox, but rather complicated normalization, noise reduction. Even interface they use is much better than AGI and allows things like proper endpointing (detection of the phrase end instead of fixed recording for sphinx). But it's rather big work.
If you are ready for something listed above, we can continue, either with LM or with proper noise reduction/endpointing/adaptation. We just need to choose the direction since it's hard to cover everything at once.
> P.S. I actually had already donated some speech to the VoxForge project
--- (Edited on 2/28/2008 4:52 pm [GMT-0600] by Visitor) ---
Well, which path to head down is pretty tricky because I really need ALL of them, LOL.
My application will be accepting incoming calls from numerous (ie: unlimited different speakers), probably the majority will be cell phones, and the majority will probably be while driving, hence a large amout of background noise (radio, road noise, passengers talking, etc). In order for this to be successful, I will need to find a way of maintaining a 90% or better recognition rate even under those conditions. This makes cleaning the incoming audio stream important.
I've tried some things with sox and compand that removed the "background" noise, but unfortunately in the process it also clipped part of my speech (and it didn't work consistently amoung speakers).
I looked into the MAP adaption on that link you sent, it sounds pretty good but here's what I was thinking (and you definitely sound like you have enough knowledge on the subject). Why doesn't/hasn't someone taken something like WSJ1 and added/adapted it using all of these other speech files (Ie: ones from VoxForge, CMU, etc) that are available. Wouldn't that increase the recognition rate overall, or is adaption limited to increasing the recognition rate for only one speaker? My idea was to create a sort of 'Super Accoustic Model'. Perhaps I'm all wet here, but I think this is something everyone would benefit from - or is WSJ1 really the best we can expect to get?
I also really need to figure out how to build good quality large language models, but I believe I may be able to figure that one out on my own (especially if you have any links handy to information on the subject).
So.... Any help you could give me on cleaning up the incoming audio would be excellent and greatly appreciated. In the meantime, I'll start putzing with the LM on my own and see what I come up with.
Thanks!
--- (Edited on 2/28/2008 5:50 pm [GMT-0600] by oeginc) ---