VoxForge
Re: More on Collecting Speech Audio for Free GPL Speech Corpus
I seem to be somewhat more optimistic than Prof. Picone regarding the use of noise reduction as a preprocessor. Both of the finalist entries in the ETSI Aurora competition on speech recognition in noise used a noise reduction stage. The noise reduction stage from one of those systems (http://www.icsi.berkeley.edu/speech/papers/qio/) has also been used successfully to remove background noise from meeting room recordings prior to speech recognition. When I have listened to its output, I've never heard any tonal noises, although I am not sure about the fricative problem mentioned by Prof. Picone. However, the algorithm is designed for noise which has a magnitude spectrum that is roughly constant over time (you can check for this by looking at a spectrogram). So I believe it is well suited for fan noise, for example, but not appropriate for background speech, keystrokes, or music. (It might be well suited for the hiss you've observed, depending on the spectrogram of the hiss.) Also, the algorithm must be given some audio data which contains only the background noise, and no foreground speech, so that it can estimate the noise magnitude spectrum. What if that audio data contains an inappropriate kind of background noise, such as background speech? Perhaps the noise suppression would actually make performance worse, in that case, and that may be one reason for Prof. Picone's skepticism.
"This indicates to me that I should be collecting clean speech, and then use a noise reduction and echo cancellation front-end on the speech recognizer" If you want to use noise reduction, I recommend that you collect training data that matches the intended application as much as possible, including noises, and then apply noise reduction both to the training data and in the application. This is because there will be residual noise left after the noise reduction has run, and this residual noise should be taken into account during acoustic model training. Note that this means you will collect the same training data whether or not you will use noise reduction. So you don't need to decide now whether you will use noise reduction. You could even build two acoustic models from the same training data, one with noise reduction and one without, and offer users the choice.
--- (Edited on 2/13/2007 6:32 pm [GMT-0600] by Visitor) ---