One of the issues that arises when multiple audio input/output options are available is whether, theoretically, it is better to put all the eggs in one basket or to keep them separate.
Take a household multimedia centre. The user could have a range of options from bluetooth wireless to wired USB or wired into the MB headsets, each of which offers different advantages and disadvantages, quality of audio, portability, etc.
You have two options, build the SRE model using all the headsets equally, modifying the single monolithic model so that the same model is used for recognition no matter what headset is chosen. Alternatively, build separate models for each headset and run that specific model when expecting good recognition, keeping the model pure to the quirks of that headset.
Are there any reasons to expect better results from the one approach or the other?
>Are there any reasons to expect better results from the one approach
>or the other?
The general rule is that an acoustic model best recognizes speech from the type of audio it was trained with. As David Gelbart says in this post:
My attitude (which I believe is the norm among ASR engineers) is the more targeted data, the better for performance, although the performance curve may flatten due to diminishing returns.
Your "monolithic" acoustic model might be good enough with lots of samples from the cleanest microphone you have (good for monophone and triphone coverage), and with a smaller proportion from each of the 'noisy' mics you plan to use.
If the performance of recognition is still not satisfactory on one of your mics, then building separate acoustic models might be required, but this does not necessarily mean you have to re-record all your prompts:
Thanks Ken, this makes sense.
What I should do is over time create a number of models each with a different mike based on identical grammars so they can stand alone, get some baseline recognition data from each separately, write a script which would create a monolithic model from various combinations using audio prompts and samples from each and test the result. Proof of the pudding...
I'll post back if I get anything interesting.
>I'll post back if I get anything interesting.
I'd be interested in your results even if you don't get "anything interesting" - it might save someone else from going down the same rabbit hole :)
I'm falling.... I'm falling...
I have set up a little project page for this at
If there are any comments, now is the time before I get started.
>I have set up a little project page for this at
It seems like all the combined acoustic models performed better (I+III, II+III, I+II+III), except for the I & II combined model (recorded using speech using these microphones: I - Jabra bt2040; II - Sennheiser PC131).
In addition, your Logitech USB Wireless (III) seems to have performed the best.
>* denotes a first test in a series, Julius is more liable to make errors on the
>first thing heard.
you might want to look at the Julius CMN parameter (from this post):
When you first start up Julian, you should see a notice like this:
------------- System Info end -------------
* NOTICE: The first input may not be correctly recognized *
* since no CMN parameter is available on startup. *
This is telling you that Julian takes the cepstral mean of the last 5 seconds of speech as the initial cepstral mean at the beginning of each input. So Julian looks at the previous 5 seconds of speech to get an average (cepstral mean) in order to recognize speech. That is why in Julian's default configuration it never recognizes what you say for the first few utterances, as it tries to figure out this average.
You can get around this by using "-cmnsave filename" to record a representative average for your environment, and then use "-cmnload filename" and "-cmnnoupdate" to use then cmn you saved and not try to recalculate it on the fly. Theoretically your confidence scores should start looking reasonable, and you should be able to determine whether a word is in your grammar or not.
Thanks for your thoughts, Ken. While my experiment is more lithomancy than science I think for me it was useful.
I'm very concerned about the difference between the first and second tests with the bluetooth+sennheiser testing with the BT. 10 followed by zero 24 hours later? This is highly suspicious and probably indicates something else at work. I tried to give the Jabra a freshly recharged battery for each daily set of tests but it is still possible that one or more of my batteries is not as good as is required. Or something else was happening. The BT was overall the least consistent and requires more rigorous treatment. My initial impression was that the BT sets in general performed a lot better than this. My feeling is that more testing would show the BT in a better light.
I was hoping for more disparity between the combinations of two and the final combination of 3. Probably choosing to record only two prompts instead of 4 in the initial round would have produced more errors at each stage. Having so few errors at the higher levels means I have no measure of "pollution" of a model by a poor component.
The lack of errors in the combined acoustic models is probably purely a matter of the increased number of audio samples to work with. The bottom layer is working with about 150 prompt samples, next layer about 300 and the top layer about 500. Later I intend to try the topmost model with a headset entirely foreign to the model and see what results come of that.
On the whole I am encouraged. I will certainly consider preparing models with audio samples from a range of devices with the thought that I can grab the most suitable device according to the situation. Pulseaudio does a good job helping me manage which device is active at any one time, now I just need to build into the model the verbal instruction to switch on the fly from one to another.
On the topic of the CMN parameter, I did think about this but with the principle of Occam's Razor decided not. It is easy enough in a DM to ignore the first couple of inputs, or to make the first inputs meaningful in a general way that allows Julius to get set up. After all, in normal speech we say things like "How are things?" without really caring what the response content is just so that we can hear the other person say something else equally meaningless so that we can process the tone of voice, gestures, etc.
Now that's a word you don't see often... sounds like something that should be in a Monty Python skit...