VoxForge
My email to Joe Picone, ISIP (Institute for Signal and Information Processing)
Hi Joe,
I am the admin with VoxForge (www.voxforge.org).
We collect user submitted speech for
incorporation into a GPL Acoustic Model. Currently we build a Julius/HTK
Acoustic Model (AM) that incorporates newly submitted speech audio
on a nightly basis. We plan to create AMs for ISIP and Sphinx.
We corresponded a while back on licensing of the Switchboard corpus.
I have a question ...
I am confused as to which approach to take in the creation of the
VoxForge speech corpora. Up until now (and until I decide
whether to change things, or not ...) I have been asking users to
submit 'clean' speech - i.e. record their submission to ensure that all
noise (i.e. non-speech noise such as echo, hiss, ...) is kept to an
absolute minimum. One guy (very ingeniously I thought) records
his submissions in his closet or in his car!
But some people, whose opinions I respect, say that I should not be
collecting clean speech, but collecting speech in its 'natural
environment', warts and all, with echo and hiss and all that (but
avoiding other background noise such as people talking or radios or
TVs). On some recordings, the hiss is quite noticeable.
What confuses me is that I see that some speech recognition microphones
are sold with built-in echo and noise cancellation, and they say that
this improves a (commercial) speech recognizer's performance.
This indicates to me that I should be collecting clean speech, and then
use a noise reduction and echo cancellation front-end on the speech
recognizer, because that is what commercial speech recognition engines
seem to be doing.
And further, if clean speech is required, should I be using noise
reduction software on the submitted audio. My attempts at noise
reduction have not been that successful, with the resulting 'musical
noise' that replaces the removed noise giving me very poor recognition
results.
I was wondering what your thoughts on this might be,
thanks for your time,
regards,
Ken MacLean
--- (Edited on 2/13/2007 10:47 am [GMT-0500] by kmaclean) ---
--- (Edited on 2/13/2007 10:48 am [GMT-0500] by kmaclean) ---
I think I was one of those who recommended "avoiding other people talking or radios or TVs". But I was sloppy to do so in that fashion. I should have written in terms of signal-to-noise ratio (SNR), the ratio of speech power to noise power. I suspect (although I'm not completely certain) that when building an acoustic model for dictation, it's not a good idea to include audio which has a SNR too low for dictation to be feasible, since that increases acoustic model variance while not necessarily making the acoustic model better prepared for feasible test data. So I should have recommended against including those noise types when they are too loud, not against including them period.
--- (Edited on 2/13/2007 5:45 pm [GMT-0600] by Visitor) ---
I feel like elaborating a bit on what I wrote earlier:
"I suspect ...it's not a good idea to include audio which has a SNR too low for dictation to be feasible"
The minimum feasible SNR that today's speech recognition software can take dictation at is a lot higher than for a human listener taking dictation.
But actually, the SNR that can be handled depends on the type of noise, because some noises are easier for the computer to deal with than others. Background speech is particularly hard.
Dictation is very challenging for the computer since there are so many possible sentences that the user might say. A small-vocabulary command-and-control application is easier for the computer so it can work at a lower SNR.
--- (Edited on 11/29/2007 6:00 pm [GMT-0600] by DavidGelbart) ---
I seem to be somewhat more optimistic than Prof. Picone regarding the use of noise reduction as a preprocessor. Both of the finalist entries in the ETSI Aurora competition on speech recognition in noise used a noise reduction stage. The noise reduction stage from one of those systems (http://www.icsi.berkeley.edu/speech/papers/qio/) has also been used successfully to remove background noise from meeting room recordings prior to speech recognition. When I have listened to its output, I've never heard any tonal noises, although I am not sure about the fricative problem mentioned by Prof. Picone. However, the algorithm is designed for noise which has a magnitude spectrum that is roughly constant over time (you can check for this by looking at a spectrogram). So I believe it is well suited for fan noise, for example, but not appropriate for background speech, keystrokes, or music. (It might be well suited for the hiss you've observed, depending on the spectrogram of the hiss.) Also, the algorithm must be given some audio data which contains only the background noise, and no foreground speech, so that it can estimate the noise magnitude spectrum. What if that audio data contains an inappropriate kind of background noise, such as background speech? Perhaps the noise suppression would actually make performance worse, in that case, and that may be one reason for Prof. Picone's skepticism.
"This indicates to me that I should be collecting clean speech, and then use a noise reduction and echo cancellation front-end on the speech recognizer" If you want to use noise reduction, I recommend that you collect training data that matches the intended application as much as possible, including noises, and then apply noise reduction both to the training data and in the application. This is because there will be residual noise left after the noise reduction has run, and this residual noise should be taken into account during acoustic model training. Note that this means you will collect the same training data whether or not you will use noise reduction. So you don't need to decide now whether you will use noise reduction. You could even build two acoustic models from the same training data, one with noise reduction and one without, and offer users the choice.
--- (Edited on 2/13/2007 6:32 pm [GMT-0600] by Visitor) ---
I was reading the various msg on the topic and decidedto add my 2 cents to the debate.
As per the Wave theory (physics) waves are superposable and thus the result of the superposition is a complex wave. The Fourier Transformation is based on this principle and eventually generates a spectrum of all sunusoidal waves / amplitudes appearing in the coumpounded wave (spectrum).
Following the theory the principle works both ways : addition AND substraction of waves. Thus the problem is not to record noisy speech but how to effectively substract the unwanted noise / background info etc.
The ideal trick to this job is to have 2 sound sources, one close to the mouth - to capture voice - and another one away from it - to capture the background sounds.
Then substract. This might not be a reality yet but I am convinced that it will come ... we are actually working on it...
Thus in my opinion the Corpus should only contain CLEAN stuff.
Which by the way can lead to another interesting approach: the corpus data can be "polluted" with background noises (as pre-processing step by applying the same Wave theory) and then matched against the incoming waves... which effectively is what is being suggested in one of the previous posts!
Essentially we need to treat 2 channels independently : the voice channel AND the background channel... I'm convinced that this is the holy grail.
serial_strat
--- (Edited on 4/ 3/2007 10:16 am [GMT-0500] by Visitor) ---
Hi serial_strat,
thanks for the post,
>The ideal trick to this job is to have 2 sound sources, one close to the mouth - to capture voice - and another one away from it - to capture the background sounds. Then substract. This might not be a reality yet but I am convinced that it will come ... we are actually working on it...
I
have also seen reference to cellphones that use this approach, but
which are not yet a commercial reality. I assume computer
headsets will someday use something similar.
>Thus in my opinion the Corpus should only contain CLEAN stuff.
Most of the audio on Librivox is pretty clean. Since all the audio on VoxForge is stored in SVN, we can easily create sets of audio for different purposes, and hedge our bets - i.e. one tag (or set) for all clean audio, one tag for all noisy audio, and another for all audio everything.
Ken
--- (Edited on 4/ 3/2007 9:51 pm [GMT-0400] by kmaclean) ---
Sorry for replying to a 2 year old statement. This was taken from http://www.voxforge.org/home/forums/message-boards/audio-discussions/more-on-collecting-speech-audio-for-free-gpl-speech-corpus
which was an email to Joe Picone, ISIP (Institute for Signal and Information Processing)
>And further, if clean speech is required, should I be using noise reduction software on the submitted audio. My attempts at noise reduction have not been that successful, with the resulting 'musical noise' that replaces the removed noise giving me very poor recognition results.
I have taken quite noisey wav files and run them through Adobe SoundBooth CS4's noise reduction.
You can adjust this noise/musical noise shifting to a middle ground of 'acceptible.'
I am not a salesman for Adobe. I am however struggling in creating a quality ASR for a hospital in southern India.
I'm in quite the naturally noisey environment, and the doctors will be using the inexpensive noisey headphone mike.
A sound engineer told me you should record in the best environment possible. With quality data you can do what ever you want.
The point of an effective ASR is to understand human speech and it will be done at a hospital under working conditions.
If I use noise cancelling filter on one user, the software will have to be adaptable to each different condition each user has, which will require different filters.
How do you incorporate that?
It does leave me a little bewildered at times.
Thanks for your thoughts,
Chetanji
--- (Edited on 2/6/2009 3:57 am [GMT-0600] by Visitor) ---
Hi Chetanji,
>If I use noise cancelling filter on one user, the software will have to be
>adaptable to each different condition each user has, which will require
>different filters.
As I stated in this post:
Theoretically, you might be able to use two microphones, one to recognize the target speech and another to pick up the background noise that you would feed in to the Julius spectral substraction algorithm. I have read of noise cancelling headsets that use two microphones in a similar way for noise cancellation.
Ken
--- (Edited on 2/6/2009 10:26 am [GMT-0500] by kmaclean) ---