Audio and Prompts Discussions

Flat
Re: sample freq issue not covered by FAQ
User: kmaclean
Date: 1/28/2007 8:15 am
Views: 1526
Rating: 53

Hi Ken,

I just made some short recordings, but used a small piece of classical
music so that the input would be the same with every piece. To prevent
that you'll be judging my voice instead of the recording quality.

They're not that big, but I posted them on the web anyway. I don't
want to swamp your inbox.

You can find them at:

...


Enjoy your New Year's Eve1

Robin

--- (Edited on 1/28/2007 9:15 am [GMT-0500] by kmaclean) ---

Re: sample freq issue not covered by FAQ
User: kmaclean
Date: 1/28/2007 8:16 am
Views: 402
Rating: 63
Hi Robin,

Thanks for the samples,

From what I can tell so far, your microphone is likely sampling at 32kHz-16bit.  I can't tell the difference between your 32kHz and 48kHz recordings. 

I usually listen to the silence portions of the audio in question and listen for differences in the pitch of the hiss (at high volumes).  Not really scientific, but its the best approach I have come across so far ...

If you can, just record record youself saying "test test test" and leave a 2-3 seconds of silence before and after your recording, at 32kHz and 48kHz, and make links to the files as you did before.  If there is no difference, I would assume that your mic records at 32kHz-16bit.

Remember, we are not looking for TV announcer quality voices (just listen to my voice recordings ...  :) ) or audio quality.  For free Speech Recognition to work, we need a large variety of speech (people, prompts files with various phonemes and triphones) recorded in a variety of environments (rooms with echo, such as hardwood floors or tiles, and rooms with no echo, such as carpet, etc) and a large variety of recording equipment (headset mics, desktop mics, built in mics, and yes USB mics, ...).  A good acoustic model needs to be trained with speech recorded in the environment it is targeted to recognize.  A post by David Gelbert explains this a bit better (see this link, scroll down until you get to his message).  Unfortunately I have not yet made this clear on the web site - I will shortly.

Sorry for drawing this out, but it is good we are addressing a process of figuring this out now,

thanks for all you help,

Ken

--- (Edited on 1/28/2007 9:16 am [GMT-0500] by kmaclean) ---

Re: sample freq issue not covered by FAQ
User: kmaclean
Date: 1/28/2007 8:17 am
Views: 271
Rating: 6
Thanks for looking at the samples.

>  From what I can tell so far, your microphone is likely sampling at
> 32kHz-16bit.  I can't tell the difference between your 32kHz and 48kHz
> recordings.

Perhaps I should've tried 44,1 kHz as well, that (or something in that
region) seems to be sort of a standard frequency as well. I'll record
some test, test, test (plus silence) samples asap.

>  Sorry for drawing this out, but it is good we are addressing a process of
> figuring this out now,

Quite happy to learn about what method to use to record good
soundfiles for speechmodels. The new posts on that thread were
interesting.

A website/project like this is bound to be a work in process... It's
nice to see that it attracts the attention of some experienced people.
It was already good to start out with though, which is why it attracts
a serious and critical audience.

I'll gladly help out more (and finally add some recordings) in the new year.

Cheers,
Robin

--- (Edited on 1/28/2007 9:17 am [GMT-0500] by kmaclean) ---

Re: sample freq issue not covered by FAQ
User: kmaclean
Date: 1/28/2007 8:18 am
Views: 265
Rating: 20
Hi Ken,

Last time we spoke we were trying to figure out the sample frequencies
of my microphone.
I finally made the samples you asked for and hope you can still have a 'look'.
They're at:
 
...
 
It wasn't such a lot of work obviously, but I had some other things to
do and tried to focus on those. I am still determined to contribute
though.

The build script you have created sounds very interesting by the way!

Thanks in advance,
Robin

--- (Edited on 1/28/2007 9:18 am [GMT-0500] by kmaclean) ---

Re: sample freq issue not covered by FAQ
User: kmaclean
Date: 1/28/2007 8:18 am
Views: 275
Rating: 15
Hi Robin,

I can't tell the difference in the recordings - which would seem to indicate that the max sample rate for your mic in 32kHz.  But in retrospect, I don't think anyone can really tell the difference.

Luckily, I think I found a solution to our problem.  I found  some information on USB mics on the Audacity site (http://audacityteam.org/wiki/index.php?title=USB_mic_on_Linux)

Try the following command:

$ arecord -f dat  -d 5 test.wav

If this works OK, then your USB mic can record at 48kHz.  Unlike Audacity, arecord will not try to upsample audio received from a mic with a lower maximum sampling rate.

You may need to determine your card and device id (if you have an integrated sound card on your motherboard).

If so, use this command:

     $ arecord --list-devices

card 0: I82801DBICH4 [Intel 82801DB-ICH4], device 0: Intel ICH [Intel 82801DB-ICH4]
Subdevices: 1/1
Subdevice #0: subdevice #0
card 0: I82801DBICH4 [Intel 82801DB-ICH4], device 1: Intel ICH - MIC ADC [Intel 82801DB-ICH4 - MIC ADC]

Subdevices: 1/1
Subdevice #0: subdevice #0
card 0: I82801DBICH4 [Intel 82801DB-ICH4], device 2: Intel ICH - MIC2 ADC [Intel 82801DB-ICH4 - MIC2 ADC]
Subdevices: 1/1
Subdevice #0: subdevice #0

card 0: I82801DBICH4 [Intel 82801DB-ICH4], device 3: Intel ICH - ADC2 [Intel 82801DB-ICH4 - ADC2]
Subdevices: 1/1
Subdevice #0: subdevice #0
card 1
: default [Samson C01U ], device 0: USB Audio [USB Audio]
Subdevices: 1/1
Subdevice #0: subdevice #0
Then execute the arecord command again, but this time specifying the USB microphone detected in your listing (Card #1 Device #0 in this example):
$ arecord -f dat -D hw:1,0 -d 5 test.wav
Let me know how you make out.

thanks,

Ken

--- (Edited on 1/28/2007 9:18 am [GMT-0500] by kmaclean) ---

Re: sample freq issue not covered by FAQ
User: kmaclean
Date: 1/28/2007 8:19 am
Views: 323
Rating: 33
Success!

arecord is definitely a good tool for this type of job. The format shortcuts (cd cdr & dat) yielded no succes, (perhaps because they are stereo?) but specifically asking for a 'Signed 16 bit Little Endian' and choosing all sorts of frequencies resulted in good feedback e.g.

arecord -f S16_LE -r 60000 -D hw:1,0 -d 10 testS16_LE.wav

gave:

Recording WAVE 'testS16_LE.wav' : Signed 16 bit Little Endian, Rate 60000 Hz, Mono
Warning: rate is not accurate (requested = 60000Hz, got = 48000Hz)
please, try the plug plugin (-Dplug:hw:1,0)

and

arecord -f S16_LE -r 500000 -D hw:1,0 -d 10 testS16_LE.wav

gave:

arecord: main:467: bad speed value 500000

However arecord did not warn @50000 Hz. and that seems odd given the first example!

The filesizes seemed to suggest the ability to record at many different frequencies, but on 'closer' inspection arecord seems to record untill it reaches the 'requested number of samples' (i.e. freq*duration). However in the file's metadata it is written that the file is only as long as requested!

So it would be save practice to only use frequencies specifically mentioned in the warnings.

Anyway, my problem does seem to be solved!

Thanks for all your help, I hope this feedback was usefull.

Robin

--- (Edited on 1/28/2007 9:19 am [GMT-0500] by kmaclean) ---

Re: sample freq issue not covered by FAQ
User: kmaclean
Date: 1/28/2007 8:21 am
Views: 299
Rating: 23
Hi Robin,

Excellent!

thanks for all your help,

Ken

--- (Edited on 1/28/2007 9:21 am [GMT-0500] by kmaclean) ---

Re: sample freq issue not covered by FAQ
User: David Gelbart
Date: 2/6/2007 9:59 pm
Views: 243
Rating: 16

It may be useful to use a spectrogram view of a recorded sentence.  There should be essentially zero energy in the spectrogram at frequencies above one half the physical sampling rate, even if the audio was upsampled by software after being recorded.   Still, it may be tricky to determine exactly what the max sampling rate of the hardware is, especially considering that (as far as I know) it may lie somewhere in between the suggested possibilities.   So I think Ken's suggestion of simply using 16  kHz is very sensible.

--- (Edited on 2/ 6/2007 9:59 pm [GMT-0600] by Visitor) ---

Re: sample freq issue not covered by FAQ
User: David Gelbart
Date: 2/6/2007 11:13 pm
Views: 2945
Rating: 14

"I think Ken's suggestion of simply using 16  kHz is very sensible."

I wrote that before I read the second page with the information about arecord. Never mind.

"A good acoustic model needs to be trained with speech recorded in the environment it is targeted to recognize."

Even if you have some targeted data, though, data from other, less targeted environments can still be useful.

For instance, my research group has built (with SRI) an experimental system for transcribing meetings, using recordings from mics on a meeting room table.  We start off with a system which is trained on huge amounts of TELEPHONE conversations, which helps us get coverage of triphones and speaker variations.  We then continue the training with meeting room data, which we have much less of.   I don't recall that we've ever done an experimental comparison between this approach and training only on the meeting room data, but it's at the least our expectation that this approach beats training only on the meeting room data.

My attitude (which I believe is the norm among ASR engineers) is the more targeted data, the better for performance, although the performance curve may flatten due to diminishing returns.

For the sake of completeness, though, I want to give a theoretical counter-example to the idea that more targeted data is always better.   This counter-example is based on two ideas: the targeted category is broad in some acoustic sense, and we are using an adaptation algorithm which can cope with that sort of broadness.  (By 'adaptation algorithm' I mean algorithms such as MLLR speaker adaptation.)  My counter-example is only for one specific kind of broadness and one specific adaptation algorithm, but I would not be surprised if it were possible to make the same sort of argument with other adaptation algorithms also.

Let's say that you are trying to target the broad category of non-headset-mic speech recognition in rooms which have widely varying degrees of reverberation. (I haven't done the acoustic calculations, but perhaps such a variation in reverberation level could happen if some have acoustic damping ceiling tiles and thick carpets, and some have no damping tiles and hardwood floors.  Or perhaps not.)   If you train on data from this broad category, there will be a sizable loss of sharpness (in other words, added variance) in the acoustic models due to the varying degrees of reverberation (although I suspect SAT would help preserve sharpness to an extent).   Loss of sharpness is NOT necessarily a bad thing, since it's very important for the acoustic model to model variation, but unnecessary amounts of variance in the models can hurt performance, as I mentioned in my earlier comp.speech.research post regarding desktop vs. headset mics.

Now for the adaptation algorithm.  Guenter Hirsch has recently developed an adaptation algorithm which (if I recall correctly) starts with a low-reverberation (thus, sharp) acoustic model, and then reverberates the model to match the specific degree of reverberation in the user's room.  The resulting model can be sharper than a model trained on a lot of reverberant data with widely varying amounts of reverberation, and thus, it seems possible that it would perform better.

This counter-example doesn't change my "the more targeted data, the better for performance" attitude, because

1) It's just a theoretical counter-example.  I don't recall ever coming across this effect in practice, either in my own work or  when reading others' work.   (I may post about this on comp.speech.research to see if anyone thinks differently, though.) 

2) If you are worried about this kind of effect, you can always try it both ways and see what works better. 

Regards,

David 

 

 

 

 

 

--- (Edited on 2/ 6/2007 11:13 pm [GMT-0600] by Visitor) ---

Re: sample freq issue not covered by FAQ
User: kmaclean
Date: 2/8/2007 9:17 pm
Views: 298
Rating: 24

Created a new thread at Comments on: "A good acoustic model needs to be trained with speech recorded in the environment it is targeted to recognize"\

Ken 

--- (Edited on 2/ 8/2007 10:17 pm [GMT-0500] by kmaclean) ---

PreviousNext