VoxForge
Hi Ken,
I just made some short recordings, but used a small piece of classical
music so that the input would be the same with every piece. To prevent
that you'll be judging my voice instead of the recording quality.
They're not that big, but I posted them on the web anyway. I don't
want to swamp your inbox.
You can find them at:
...
Enjoy your New Year's Eve1
Robin
--- (Edited on 1/28/2007 9:15 am [GMT-0500] by kmaclean) ---
--- (Edited on 1/28/2007 9:16 am [GMT-0500] by kmaclean) ---
--- (Edited on 1/28/2007 9:17 am [GMT-0500] by kmaclean) ---
--- (Edited on 1/28/2007 9:18 am [GMT-0500] by kmaclean) ---
$ arecord --list-devicesThen execute the arecord command again, but this time specifying the USB microphone detected in your listing (Card #1 Device #0 in this example):
card 0: I82801DBICH4 [Intel 82801DB-ICH4], device 0: Intel ICH [Intel 82801DB-ICH4]
Subdevices: 1/1
Subdevice #0: subdevice #0
card 0: I82801DBICH4 [Intel 82801DB-ICH4], device 1: Intel ICH - MIC ADC [Intel 82801DB-ICH4 - MIC ADC]
Subdevices: 1/1
Subdevice #0: subdevice #0
card 0: I82801DBICH4 [Intel 82801DB-ICH4], device 2: Intel ICH - MIC2 ADC [Intel 82801DB-ICH4 - MIC2 ADC]
Subdevices: 1/1
Subdevice #0: subdevice #0
card 0: I82801DBICH4 [Intel 82801DB-ICH4], device 3: Intel ICH - ADC2 [Intel 82801DB-ICH4 - ADC2]
Subdevices: 1/1
Subdevice #0: subdevice #0
card 1
: default [Samson C01U ], device 0: USB Audio [USB Audio]
Subdevices: 1/1
Subdevice #0: subdevice #0
$ arecord -f dat -D hw:1,0 -d 5 test.wavLet me know how you make out.
--- (Edited on 1/28/2007 9:18 am [GMT-0500] by kmaclean) ---
--- (Edited on 1/28/2007 9:19 am [GMT-0500] by kmaclean) ---
--- (Edited on 1/28/2007 9:21 am [GMT-0500] by kmaclean) ---
It may be useful to use a spectrogram view of a recorded sentence. There should be essentially zero energy in the spectrogram at frequencies above one half the physical sampling rate, even if the audio was upsampled by software after being recorded. Still, it may be tricky to determine exactly what the max sampling rate of the hardware is, especially considering that (as far as I know) it may lie somewhere in between the suggested possibilities. So I think Ken's suggestion of simply using 16 kHz is very sensible.
--- (Edited on 2/ 6/2007 9:59 pm [GMT-0600] by Visitor) ---
"I think Ken's suggestion of simply using 16 kHz is very sensible."
I wrote that before I read the second page with the information about arecord. Never mind.
"A good acoustic model needs to be trained with speech recorded in the environment it is targeted to recognize."
Even if you have some targeted data, though, data from other, less targeted environments can still be useful.
For instance, my research group has built (with SRI) an experimental system for transcribing meetings, using recordings from mics on a meeting room table. We start off with a system which is trained on huge amounts of TELEPHONE conversations, which helps us get coverage of triphones and speaker variations. We then continue the training with meeting room data, which we have much less of. I don't recall that we've ever done an experimental comparison between this approach and training only on the meeting room data, but it's at the least our expectation that this approach beats training only on the meeting room data.
My attitude (which I believe is the norm among ASR engineers) is the more targeted data, the better for performance, although the performance curve may flatten due to diminishing returns.
For the sake of completeness, though, I want to give a theoretical counter-example to the idea that more targeted data is always better. This counter-example is based on two ideas: the targeted category is broad in some acoustic sense, and we are using an adaptation algorithm which can cope with that sort of broadness. (By 'adaptation algorithm' I mean algorithms such as MLLR speaker adaptation.) My counter-example is only for one specific kind of broadness and one specific adaptation algorithm, but I would not be surprised if it were possible to make the same sort of argument with other adaptation algorithms also.
Let's say that you are trying to target the broad category of non-headset-mic speech recognition in rooms which have widely varying degrees of reverberation. (I haven't done the acoustic calculations, but perhaps such a variation in reverberation level could happen if some have acoustic damping ceiling tiles and thick carpets, and some have no damping tiles and hardwood floors. Or perhaps not.) If you train on data from this broad category, there will be a sizable loss of sharpness (in other words, added variance) in the acoustic models due to the varying degrees of reverberation (although I suspect SAT would help preserve sharpness to an extent). Loss of sharpness is NOT necessarily a bad thing, since it's very important for the acoustic model to model variation, but unnecessary amounts of variance in the models can hurt performance, as I mentioned in my earlier comp.speech.research post regarding desktop vs. headset mics.
Now for the adaptation algorithm. Guenter Hirsch has recently developed an adaptation algorithm which (if I recall correctly) starts with a low-reverberation (thus, sharp) acoustic model, and then reverberates the model to match the specific degree of reverberation in the user's room. The resulting model can be sharper than a model trained on a lot of reverberant data with widely varying amounts of reverberation, and thus, it seems possible that it would perform better.
This counter-example doesn't change my "the more targeted data, the better for performance" attitude, because
1) It's just a theoretical counter-example. I don't recall ever coming across this effect in practice, either in my own work or when reading others' work. (I may post about this on comp.speech.research to see if anyone thinks differently, though.)
2) If you are worried about this kind of effect, you can always try it both ways and see what works better.
Regards,
David
--- (Edited on 2/ 6/2007 11:13 pm [GMT-0600] by Visitor) ---
Created a new thread at Comments on: "A good acoustic model needs to be trained with speech recorded in the environment it is targeted to recognize"\
Ken
--- (Edited on 2/ 8/2007 10:17 pm [GMT-0500] by kmaclean) ---