VoxForge
Hello all,
I'm new here. Lately I've been absorbing everything I can get my hands on regarding speech recognition for a few projects I have in mind. I've read Arthur Chan's letter to the community and I agree with his assessment. In addition, I'm hoping this site will be instrumental in bringing the community it's first open source text/speech corpus.
I spent more than a few hours playing with Sphinx in it's various incarnations tonight and well ... it's a start. I'll likely be downloading from this site soon in an attempt to make my own acoustic model. (Right after I get a decent mic)
But anyway, on to the reason for this message:
Why don't we auto generate a couple hundred hours of text to speech from festival and any cheap commercial text-to-speech engine out there that we can get our hands on?
Sure, it won't be as good as a human read corpus, but with all the different voices and accents (i.e. british vs. american, and male vs female) it's got to be better than nothing, right? Besides, is bad speech really undesirable? Don't we want less than perfect samples? Won't that make the model more robust?
What do you think? Cheap couple hundred hours?
--- (Edited on 3/31/2007 2:11 am [GMT-0500] by Visitor) ---
Hi trevarthan,
Thanks for the great question.
Speech Recognition is difficult. The general rule is that you need to train your Acoustic Models with audio as close as possible to the type of speech that your Speech Recognition System will be recognizing. If you train with Speech generated from a Text-to-Speech engine, your Acoustic Models will be great at recognizing speech generated from a Text-to-Speech Engine, not so good at recognizing real speech.
We've got access to tons of speech from LibriVox. It is 128 bit MP3 based audio, which is not perfect, but should be good enough for our purposes.
My understanding of Text-to-Speech Systems (such as Festival) is that they use a couple of different approaches to generate speech. There is diphone based speech generation, which requires a person (usually an actor) to record many nonsense words to try to capture as many of the diphones (we use triphones for the VoxForge Acoustic Models) of the English language as possible. These would not give us adequate coverage for all the triphones in the english language, and would not necessarily give us the correct context for the triphones. In addition, they only have a few voices - this is not good for us since we need audio from many people.
In a TTS engine that only uses diphones to generate speech, the system must come up with different tricks (or heuristics, or rules, ...) to compensate for situations where diphones do not fully describe the sound required to make a word intelligible. The human brain can sort of "fill in the blanks" of the sounds generated from a Text-to-Speech engine, so the TTS does not have to be perfect. Speech Recognition Engines are not so good at doing this.
Another approach used by Festival is to use the HTS engine
(replacing the diphone engine), which
essentially uses Acoustic Models to generate speech (the reverse
process to how Speech Recognition engines use HMMs). The
generated speech is only as good as the speech used to
create the Acoustic Model in the first place - if you don't have
enough speech data, the generated speech does not sound so well.
Again, this speech generally comes from a few people - we need many.
In the VoxForge Acoustic Model, we actually already incorporate much of the speech audio that was used to generate most of the voices provided by Festival.
So the general rule is that it is better to use speech from real people, as many of them as possible, using uncompressed audio recording formats if possible, reading as many different texts as possible.
Another important factor in using speech audio from real people is so we can have the source for the derivative Acoustic Models. You never know how someone might use the audio to create new and improved Acoustic Models.
Regardless, I have been wrong before, and you might want to try out a small scale system using your own voice to train a text to speech engine (the FestVox site provides instructions on how to do this), and then use the generated TTS voice to train an Acoustic Model, and see how well it works to recognize your voice. It might be easier to get the audio for a specific Festival voice, and create an AM using the Festival output to see how it works. But this would not be as good a test, because you would be trying to recognize with audio the Festival voice was trained with. You should not need that much audio (30min+) to do a quick test determine if your approach is worth following. The VoxForge site provides a tutorial and how-to to show you how to create your own Acoustic Model.
Hope this helps,
Ken
--- (Edited on 3/31/2007 3:15 pm [GMT-0400] by kmaclean) ---
"If you train with Speech generated from a Text-to-Speech engine, your Acoustic Models will be great at recognizing speech generated from a Text-to-Speech Engine, not so good at recognizing real speech."
Well, let's assume that it would only improve recognition quality for TTS, as you say. This doesn't make sense to me, but I'll take your word for it as I'm not as familiar as I'd like to be with the internal workings of these speech recognition engines. Would training on a combined TTS sample and human sample detract from the quality of the human voice recognition as opposed to training on just the human part of the sample? Or would it simply improve the engine's ability to recognize TTS?
I hope that it's an additive process (Similar to the way that humans learn to recognize speech in that we can learn to understand different dialects of english without hindering our ability to recognize dialects we already know). If not, that's too bad and makes our job a lot more difficult because people talk with very different dialects in different geographic areas.
On a side note, I'm also surprised to learn that changing the sample rate of the audio is unacceptable (i.e. phone system audio @ 8khz vs. mic audio from a sound card at 16khz). I had hoped that speech engines would be intelligent enough to abstract the recognition process from the audio format. But apparently that isn't the case. I hope this limitation can be overcome in the future. It makes speech recognition a lot less flexible.
I'm also surprised to learn that you're using MP3 audio samples for training. If the sample rate can't be changed without hindering recognition performance, what would audio compression do?
--- (Edited on 3/31/2007 9:47 pm [GMT-0500] by Visitor) ---
Hi,
My comments follow:
>Well, let's assume that it would only improve recognition quality for TTS, as you say. This doesn't make sense to me, but I'll take your word for it as I'm not as familiar as I'd like to be with the internal workings of these speech recognition engines.
It's good to be skeptical ... playing around with the VoxForge Acoustic Model Creation Tutorial will help you get up to speed. Once you are comfortable with things, try some experiments with different sources of audio - don't take my word as gospel.
>Would training on a combined TTS sample and human sample detract from the quality of the human voice recognition as opposed to training on just the human part of the sample? Or would it simply improve the engine's ability to recognize TTS?
I don't know. I'm not sure anyone has tried this before. I would make for a great experiment. Logically speaking, you would think it would be an additive process (as you have stated). However, when I was playing around with noise removal on some user submissions, and integrated them into the VoxForge Corpus, it caused "artifacts" to appear in the resulting audio that prevented recognition from working properly. See these posts for more information:
What are Best Practices for Collecting Speech for a Free GPL Speech Corpus?
More on Collecting Speech Audio for Free GPL Speech Corpus
Comments on: "A good acoustic model needs to be trained with speech recorded in the environment it is targeted to recognize"
I think that using TTS speech audio would cause the similar speech artifacts to appear, and with the same resulting degradation in speech recognition.
>I hope that it's an additive process (Similar to the way that humans learn to recognize speech in that we can learn to understand different dialects of English without hindering our ability to recognize dialects we already know).
A Speech Recognition Engine *can* make adjustments for differences in dialects, given enough speech audio (hundreds of hours), it's just that speech generated from a TTS engine has "missing information", its not close enough to the real thing and a speech recognition engine would likely have trouble using it to recognize real speech.
>I'm also surprised to learn that changing the sample rate of the audio is unacceptable (i.e. phone system audio @ 8khz vs. mic audio from a sound card at 16khz)
You *can* downsample from 16kHz to 8kHz, and an acoustic model generated from this audio works fine to recognize audio data at an 8kHz sampling rate (that's what we do to create the VoxForge Acoustic Models). But you cannot use upsampled audio data in the creation of Acoustic Models (this post on the Asterisk site provides a brief explanation).
>I'm also surprised to learn that you're using MP3 audio samples for training. If the sample rate can't be changed without hindering recognition performance, what would audio compression do?
This was discussed in this thread (see Tony Robinson's second post).
Ken
--- (Edited on 4/ 1/2007 12:39 pm [GMT-0400] by kmaclean) ---
2019 update:
paper: Almost Unsupervised Text to Speech and Automatic Speech Recognition
samples: https://speechresearch.github.io/unsuper/
--- (Edited on 5/27/2019 6:42 pm [GMT-0400] by kmaclean) ---