VoxForge
Creating a new thread from comments made by David Gelbart in another thread:
Even if you have some targeted data, though, data from other, less targeted environments can still be useful.
For instance, my research group has built (with SRI) an experimental system for transcribing meetings, using recordings from mics on a meeting room table. We start off with a system which is trained on huge amounts of TELEPHONE conversations, which helps us get coverage of triphones and speaker variations. We then continue the training with meeting room data, which we have much less of. I don't recall that we've ever done an experimental comparison between this approach and training only on the meeting room data, but it's at the least our expectation that this approach beats training only on the meeting room data.
My attitude (which I believe is the norm among ASR engineers) is the more targeted data, the better for performance, although the performance curve may flatten due to diminishing returns.
For the sake of completeness, though, I want to give a theoretical counter-example to the idea that more targeted data is always better. This counter-example is based on two ideas: the targeted category is broad in some acoustic sense, and we are using an adaptation algorithm which can cope with that sort of broadness. (By 'adaptation algorithm' I mean algorithms such as MLLR speaker adaptation.) My counter-example is only for one specific kind of broadness and one specific adaptation algorithm, but I would not be surprised if it were possible to make the same sort of argument with other adaptation algorithms also.
Let's say that you are trying to target the broad category of non-headset-mic speech recognition in rooms which have widely varying degrees of reverberation. (I haven't done the acoustic calculations, but perhaps such a variation in reverberation level could happen if some have acoustic damping ceiling tiles and thick carpets, and some have no damping tiles and hardwood floors. Or perhaps not.) If you train on data from this broad category, there will be a sizable loss of sharpness (in other words, added variance) in the acoustic models due to the varying degrees of reverberation (although I suspect SAT would help preserve sharpness to an extent). Loss of sharpness is NOT necessarily a bad thing, since it's very important for the acoustic model to model variation, but unnecessary amounts of variance in the models can hurt performance, as I mentioned in my earlier comp.speech.research post regarding desktop vs. headset mics.
Now for the adaptation algorithm. Guenter Hirsch has recently developed an adaptation algorithm which (if I recall correctly) starts with a low-reverberation (thus, sharp) acoustic model, and then reverberates the model to match the specific degree of reverberation in the user's room. The resulting model can be sharper than a model trained on a lot of reverberant data with widely varying amounts of reverberation, and thus, it seems possible that it would perform better.
This counter-example doesn't change my "the more targeted data, the better for performance" attitude, because
1) It's just a theoretical counter-example. I don't recall ever coming across this effect in practice, either in my own work or when reading others' work. (I may post about this on comp.speech.research to see if anyone thinks differently, though.)
2) If you are worried about this kind of effect, you can always try it both ways and see what works better.
Regards,
David
--- (Edited on 2/ 8/2007 10:06 pm [GMT-0500] by kmaclean) ---
Hi David,
Just to clarify, did you 'adapt' your Telephony Acoustic Models using the meeting room audio you collected, or did you train your Acoustic Models using the Telephony and Meeting Room audio data mixed together? I assume you downsampled the Meeting Room audio to the same sampling rate and bits per sample as the Telephony audio, and thus created 8kHz-8bit AMs?
The reason why I am asking is that I had an email discussion with Brough Turner with respect to a old post he made on his blog called Large Speech Corpora. In it he discusses the use of podcasting audio as a possible source of audio data for creating Acoustic Models ('AM's) for Speech Recognition Engines. I commented that the use of lossy compressed audio (such as MP3 or Ogg recordings) was not a good source of audio for training AMs. However, he rightly notes that mp3 audio, although lossy, is probably better quality speech than what you would find in telephony audio (G.711) ...
Would this be a possible large source of audio to get coverage for triphones and speaker variation (similar to your telephone audio), that could then be used to either create an AM that is then adapted with headset microphone data, or used together with targeted audio for the creation of a new acoustic model? I've always been under the impression that lossy compressed audio is not suitable for AM creation.
thanks,Ken
--- (Edited on 2/ 8/2007 10:54 pm [GMT-0500] by kmaclean) ---
> Just to clarify, did you 'adapt' your Telephony Acoustic
Models
> using the meeting room audio you collected, or did you train
> your Acoustic Models using the Telephony and Meeting Room
> audio data
mixed together?
The first one ('adapt'). I wasn't involved in that work myself, but I suppose it is much faster to do it that way, considering the huge amounts of telephony data involved.
If you want to read more about it, here are some references:
http://www.icsi.berkeley.edu/speech/papers/meeteval04_amispringer.pdf
http://www.icsi.berkeley.edu/speech/papers/icslp2004-meeting-system.pdf
http://www.icsi.berkeley.edu/ftp/global/pub/speech/papers/nist2005-meeting-system.pdf
> I assume you downsampled theWe downsampled the Meeting Room audio to 8 kHz. Other than this, we did not do anything that I know of to make the Meeting Room audio more like telephone audio. We did use cepstral mean normalization throughout, which should have helped normalize for the spectral differences caused by the different transmission channels (microphone characteristics and room acoustics in the meeting rooms, and many different types of telephone connections and telephone equipment in the telephone data).
By 8-bit encoding I guess you mean u-law. As far as I know we did not convert the Meeting Room audio down to 8 bit u-law. Interesting idea, though. Using less bits for the encoding amounts to adding noise (quantization noise) to the data, and maybe that's a bad thing, but on the other hand, perhaps the telephony acoustic models have been affected by that noise in some systematic way that would make it worthwhile to treat the Meeting Room audio the same way, so that the Meeting Room audio matches the telephony models better? U-law has higher quantization error at higher loudness levels, so I guess louder phonemes will have more quantization noise, which could lead to more acoustic model variance for those phonemes. On the other hand, MFCC features use a logarithmic magnitude scale, which could counteract this effect. Also, the variance due to quantization noise may be quite swamped by other sources of variance. I'll ask on comp.speech.research about this, and I think I will try an experiment. Please check back with me in a few months about this. If you run any tests yourself, I would be curious to know the result.
> I've always been under the impression that lossy compressed > audio is not suitable for AM creation.
I expect lossless audio is better. But then, 16 kHz audio is better than 8 kHz audio, yet my group used 8 kHz audio because we had loads of 8 kHz data. So if you can get lots of podcast data, it may help you.
> Would this be a possible large source of audio to get coverage > for triphones and speaker variation (similar to your telephone > audio), that could then be used to either create an AM that is > > then adapted with headset microphone data, or used together > with targeted audio for the creation of a new acoustic model?
If you can get accurate transcriptions, I think it's well worth a try.
You may want to get in touch with Udhyakumar Nallasamy (http://udhyakumar.tripod.com/). He has done some experiments on the effect of MP3 coding on speech recognition. I have some of his results, but I don't know whether he used normalization methods such as cepstral mean subtraction, and without that information (along with information about the length of time the cepstral mean was calculated over, if CMS was used) it's harder for me to interpret the results. Also, the results I have don't cover increasing the amount of training data by mixing compressed and uncompressed data, which is what you are really interested in.
Regards,
David Gelbart
--- (Edited on 2/ 9/2007 8:06 pm [GMT-0600] by Visitor) ---
> By 8-bit encoding I guess you mean u-law. As far as I know we > did not
convert the Meeting Room audio down to 8 bit u-law.
> Interesting idea,
though.
It looks like my research group is going to try your idea. I expect we'll have the results within two months. However, even if this doesn't help our performance, it may still help yours, since our ASR system for meeting recognition is a tough baseline to improve on because it's a slow-running, multi-pass system with lots of features.
You can find code for u-law conversion in sox and other places. There is even official code from the ITU at
http://www.itu.int/rec/T-REC-G.191/en
The STL2005 ITU code costs money to download, but it's redistributable under a roughly BSD-style license. The license appears in a PDF file that comes with the download. The STL1996 ITU code can be downloaded for free and is under the same license or a similar one.
--- (Edited on 2/10/2007 6:31 pm [GMT-0600] by Visitor) ---
Hi David,
Sorry for the delay in getting back to you ... my replies are below:
>By 8-bit encoding I guess you mean u-law. As far as I know we did not convert the Meeting Room audio down to 8 bit u-law.
When I was talking about "8kHz-8bit AMs", I should have clarified that I meant an 8kHz sampling rate at 8 bits per sample (called 'data size' in sox terminology).
My understanding is that a POTS (Plain Old Telephone Service) land-line telephone has a bandwidth limit of 64kbps. Therefore any audio captured from a POTS line is limited to a sampling rate of 8kHz at 8-bits per sample (8000 samples per second * 8-bits per sample = 64000bps = 64kpbs), regardless of encoding. Your Meeting Room Audio data would likely have been recorded at a 16kHz sampling rate (or higher) at 16 bits per sample. I was just wondering if you downsampled your Meeting Audio data (from 16kHz to 8kHz) and reduced the bits per sample (from 16 bits to 8 bits). I have always assumed that the sample rate and bits per sample of the audio used in the creation of the AM had to match the target audio to be recognized. I did not do any testing to support this view, just assumed that that is the way it was done. Based on your comments that does not seem to be the case, and thinking about it it makes sense, because it is just the resolution of the data that is different.
In addition, I was not even thinking about telephony data encoding (i.e. your reference to u-law). In my experiments with Julius, HTK and Asterisk I was using PCM data encoding. Though I am glad that my question helped you look in a new direction with respect to your work.
>I expect lossless audio is better. But then, 16 kHz audio
is better than 8 kHz audio, [...] So if you can get lots of
podcast data, it may help you.
Thanks that's good to know. VoxForge will likely stick with uncompressed (and lossless compressed) audio for the near term. However Librivox and Gutenburg audio have loads of transcribed audio in various compressed formats, and using these in the same way you used your telephony audio would be a good experiment to see if they could help improve the VoxForge AM triphone coverage.
>You may want to get in touch with Udhyakumar Nallasamy. He has done some experiments on the effect of MP3 coding on speech recognition.
Thanks, I will get in touch with him,
All the best,
Ken
--- (Edited on 2/12/2007 3:23 pm [GMT-0500] by kmaclean) ---
>It looks like my research group is going to try your idea. I expect we'll have the results within two months.
Now that's service!
As I alluded in my previous email, I never really got in to different encodings, just making sure that the sample rate and bits per sample of the audio used to create the AM matched the target speech audio to be recognized. In my experiments with Asterisk a while back, I used 8kHz sampling rate at 16 bits per sample and PCM data encoding. I think Asterisk converts everything to PCM internally, so the different encoding methods were transparent to me, at least on the interface I was using to get the call audio to a Julius speech recognition server. In addition, I think Julius only accepts audio at 16 bits per sample, so it worked out well that Asterisk provided this conversion internally.
Ken
--- (Edited on 2/12/2007 3:34 pm [GMT-0500] by kmaclean) ---
Hey Ken,
What were the results of your asterisk + Julian tests? I noticed that the default asterisk U-law codec (you referred to it as 'PCM', I think) is only 8 bit 8 khz. But Julius is 16 bit 8 khz. Won't that cause problems?
--- (Edited on 4/ 3/2007 11:37 am [GMT-0500] by trevarthan) ---
Hi Jesse,
It's been a while (Asterisk version 1.0.9 I think), but from what I remember, Asterisk converts all audio channels to 8kHz-16bit PCM (on FD 3 using AGI) internally, so codec conversion really was not an issue for speech recognition with Julian. Recognition was 'OK', but as I said, I did not have a good enough Acoustic Model at the time.
Asterisk now has an API for Speech Recognition, so I am not sure how that changes things.
Ken
--- (Edited on 4/ 3/2007 9:14 pm [GMT-0400] by kmaclean) ---