VoxForge
Email discussion I had with Arthur Chan (author of the article Do we have a true open source dictation machine?)
Hi Arthur,
Just read your article on titled "Do we have a true open source
dictation machine?" - great article. I hope you don't mind, but I
posted it (in its entirety) on the VoxForge Web site
(www.voxforge.org). VoxForge just happens to be trying to address
one of the concerns you mentioned in your article about the lack of
good acoustic models. We are collecting user submitted speech for
incorporation into a GPL Acoustic Model. Currently we have a
Julius/HTK AM being created daily, incorporating newly submitted audio
on a nightly basis. One for Sphinx is planned for.
i have a question ...
I am confused as to which approach to take in the creation of the
VoxForge speech corpora. Up until now (and until I decide
whether to change things, or not ...) I have been asking users to
submit 'clean' speech - i.e. record their submission to ensure that all
noise (i.e. non-speech noise such as echo, hiss, ...) is kept to an
absolute minimum. One guy (very ingeniously I thought) records
his submissions in his closet or in his car!
But some people, whose opinions I respect, say that I should not be
collecting clean speech, but collecting speech in its 'natural
environment', warts and all, with echo and hiss and all that (but
avoiding other background noise such as people talking or radios or
TVs). On some recordings, the hiss is quite noticeable.
What confuses me is that I see that some speech recognition microphones
are sold with built-in echo and noise cancellation, and they say that
this improves a (commercial) speech recognizer's performance.
This indicates to me that I should be collecting clean speech, and then
use noise reduction and echo cancellation front-end on the speech
recognizer, because that is what commercial speech recognition engines
seem to be doing.
And further, if clean speech is required, should I be using noise
reduction software on the submitted audio. My attempts at noise
reduction have not been that successful, with the resulting 'musical
noise' that replaces the removed noise giving me very poor recognition
results.
I was wondering what your thoughts on this might be,
thanks for your time,
regards,
Ken MacLean
--- (Edited on 1/29/2007 10:10 am [GMT-0500] by kmaclean) ---
Hi Ken,
First of all, my regards to your work. You chose a difficult task and I believe the community should thank to you.
I have left CMU but I am still very interested in speech recognition. I hope that these could help you.
>Just read your article on titled "Do we have a true open source dictation machine?" - great article.
J Ah, I loved ranting too much…... But I am glad if you like
it. Though please make sure the disclaimer is attached, I hope that this
is interpreted as my own opinion.
>I hope you don't mind, but I posted it >(in its entirety) on the VoxForge Web site (www.voxforge.org). VoxForge just happens to be trying >to address one of the concerns you >mentioned in your article about the lack of good acoustic models. We are collecting user submitted >speech for incorporation into a GPL >Acoustic Model. Currently we have a Julius/HTK AM being created daily, incorporating newly >submitted audio on a nightly basis. One >for Sphinx is planned for.
I appreciate this. Though, if someone in the community could create a HTK from/to Sphinx model converter, I think at the end it doesn’t matter.
>i have a question
...
>I am confused as to
which approach to take in the creation of the VoxForge speech
corpora. Up until now (and until I decide whether to >change things, or not ...)
I have been asking users to submit 'clean' speech - i.e. record their
submission to ensure that all noise (i.e. non->speech noise such as echo, hiss, ...) is
kept to an absolute minimum. One guy (very ingeniously I thought) records
his submissions in his >closet
or in his car!
>But some people,
whose opinions I respect, say that I should not be collecting clean speech, but
collecting speech in its 'natural >environment',
warts and all, with echo and hiss and all that (but avoiding other background
noise such as people talking or radios or TVs). >On some recordings, the hiss is quite
noticeable.
>What confuses me is
that I see that some speech recognition microphones are sold with built-in echo
and noise cancellation, and they say >that
this improves a (commercial) speech recognizer's performance. This
indicates to me that I should be collecting clean speech, and then >use noise reduction and
echo cancellation front-end on the speech recognizer, because that is what
commercial speech recognition engines >seem
to be doing.
>And further, if
clean speech is required, should I be using noise reduction software on the
submitted audio. My attempts at noise reduction >have not been that successful, with the
resulting 'musical noise' that replaces the removed noise giving me very poor
recognition results.
This is a good question. The best answer for this question is “it depends”. I prefer clean speech but I hope to give you my view on why there is such a controversy in the first place.
Usually, the type of data you need to train an acoustic model depends on the noisy condition when the recognizer is actually being used. If we consider the earliest day of commercial dictation development, they chose to use your path. That is to use clean speech. In the past and even in the present, it makes a lot of sense. The theory of speech recognition is essentially statistical-based. So one has to assume the training and testing (user) environment actually matched, at the point the assumption is that “office” is an environment which has “no noise”. That’s probably why it was chosen in this way.
Research-wise, this decision is made because noisy speech recognition at that point was not well-understood. (Even now, it’s probably not.) Business-wise, the decision is made because at that time there was only Dragon and VV in the market, they could largely control what the users need.
Both research and business communities don’t usually not the correct reasoning, a practical way to think of it is this when a model training by clean speech is operated in a clean environment. The recognition rate is usually the best. That’s probably why people choose to collect clean data and assume the environment to be clean.
So what if there is noise? First thing to know, noisy speech recognition is difficult. It is probably a magnitude harder than generally speech recognition. So it is an area which requires specific consideration.Generally, when we hit noisy conditions with a clean-speech model, there are two ways to solve it, one is to collect noisy data (just like what the folks suggest you) or you could device some techniques to solve it. According to a conversation between me and Jim Baker (the founder of Dragon), at that time, what Dragon does was to use audio-books and carefully process them for training. When they hit a mismatch problem such as echo and air condition noise, they have devised several techniques which make recognition to be more robust in noise. To me, it means that if one could spend human-time on improving the speech recognizer itself, noisy condition could also be tackled. Another way to think about it is that because they have a model trained from clean speech, when they need to device new noise technique for the recognizer, they could easily do so.
Now for you, before you could collaborate with people who have time and energy to devise a technique for you. Collecting just clean data will probably mean a restriction for the user environment. Many users don’t understand why it has to be done in this way, so they might not like it. That’s probably the *true* reason why a user dislike a clean model.
On the other hand, what’s the problem if you want to collect noisy data? This is usually tougher than one thinks. The major problem is there is only one single clean condition but there could be many possible noisy conditions. Noisy conditions could be “a user using cell phone”, “ a user is calling from a telephone”, “the environment has air conditioning”, “a user is in the battlefield”.
Clean condition could simply be defined as “if the background noise is inaudible”. Professional way to do it is to create a sound-proved environment to record the speech. More usually, it is obtained by asking users to “operate the recognizer in a quiet environment”.
So, the industry regards noisy speech recognition as a tough problem, AURORA evaluation is how one try to standardized different noisy conditions. People try to deal noise by categorize them into different conditions. They further characterize noise by its relative magnitude with the clean speech (i.e. Signal-Noise Ratio (SNR)). To me, these are quite artificial standard. That just shows people don’t really understand noisy speech recognition now. One thing I learnt though is that training a recognizer using very noisy speech (-20 dB or lower) doesn’t give one a reasonable speech recognition accuracy in any environments.
That’s probably what I will say about your problem, my take is this, if you don’t have too much resource, training the model with clean speech is probably a better idea. This gives you a first system so that you could gather more. (Say if you build a dictation system, with user’s permission, you could probably get their data to a server). When gather data with different conditions, you might need to consider it in a case by case basis. Say if you see a lot of users hit noise A, then probably you want to collect those data more.
-a
--- (Edited on 1/29/2007 10:14 am [GMT-0500] by kmaclean) ---
--- (Edited on 1/29/2007 10:16 am [GMT-0500] by kmaclean) ---
>I
think I will take the following approach: I will collect speech from anyone who
wants to supply it, but carefully classify it as either clean >speech or noisy speech (I will likely use
measurements that will be much more fined grained than that, but you get the
idea...). I do not >want
to dissuade anyone from contributing their speech. A big part of this is
trying to create an active Free/Open Source Speech >Recognition community that will be
self-sustaining.
>I can then generate
Acoustic Models from the audio set that makes the most sense (I realize that as
I get more and more data, computation >times
will prevent this from easily being done, but I am hoping Moore's law will help
me out in that respect - this is a long term project!). >From the outset, I can see that there
might be 2 broad groupings of audio that might be useful for speech
recognition: one containing only >clean
speech and one containing 'cleaned' noisy speech (but not too noisy) and clean
speech together.
>I agree with you
(not that I am a speech recognition expert by any measure ...) that noiseless
speech offers the most flexibility. Because >on the recognition side, you have the
option of running the audio to be recognized through filters/noise reduction
algorithms, and on the >Acoustic
Model creation side, if you know the target environment beforehand, and can
model it, run the clean audio through a simulation >to 'add' the desired environmental noise
to the audio before training the Acoustic Model. The problem will be to
find algorithms to reduce >the
'musical noise' generated by some noise reduction algorithms.
I think this is a good plan. Keeping the submission requirement is always good. My only comment is that you could also try to balance the amount of training data you will use after you obtain them. In this way, you could control.
>I'd like to post this discussion on the VoxForge web site, please let me know if this is OK.
Ok.
>BTW, I actually do think that even if there were algorithms developed that could convert Sphinx AMs to HTK, Julius or ISIP, we would >still have a problem to solve ... because the audio used to generate those AMs is 'closed source'. There may be differences on which Open >Source license is the best to ensure that the source audio always remains accessible (for better or worse, I chose GPL ...), but I think that it is important that the source be available, you never know how it might be used to move Open Source Speech Recognition forward.
I think we are on the same page. I agreed with you that just having the model but not having the audio data will not be a long term solution for Open Source.
This is great conversation, if you need any help, you could just send me a mail.
-a
--- (Edited on 1/29/2007 10:17 am [GMT-0500] by kmaclean) ---