Frequently Asked Questions

Why record at highest sampling/bits per sample rates?
User: kmaclean
Date: 1/1/2010 12:06 pm
Views: 9512
Rating: 18


Speech Recognition Engines need Acoustic Models trained with speech audio that has the same sampling rate and bits per sample as the speech it will recognize.  The different speech mediums have limitations that affect speech recognition.

Telephony Bandwidth Limitations 

For example, for telephony speech recognition, the limitation is the 64kbps bandwidth of a telephone line.  This only permits a sampling rate of 8kHz and a sampling resolution of 8-bits per sample. Therefore, to perform speech recognition on a telephone line, you need Acoustic Models trained using audio recorded at an 8kHz sampling rate with 8-bits per sample.  VoIP applications usually have the same limitations since they allow interconnection to Public Service Telephone Network (PSTN).

Desktop Sound Card and Processor Limitations 

For desktop Command and Control applications,  your PC's sound card determines your maximum sampling rate and bits per sample, and the power of your CPU determines what kinds of acoustic models your Speech Recognition Engine can process efficiently.

So why record at highest sampling/bits per sample rates?

Speech Recognition Engines work best with Acoustic Models trained with audio recorded at higher sampling rate and bits per sample.  However, since current hardware (CPUs and/or sound cards) is not powerful enough to support Acoustic Models trained at higher sampling rates and bits per sample, and telephony applications have bandwidth limitations (as discussed above), a compromise is required.  VoxForge has decided that the best approach (for now) is to collect speech recorded at the highest sampling rate your audio card support, at 16-bits per sample, and then downsample the audio to sampling rates that can be supported by the speech medium

For example, for Command and Control applications on a desktop PC, you can downsample the 48kHz/16-bit audio to 16kHz/16-bit audio, and create Acoustic Models from this.  This approach permits us to be backward compatible with older Sound Cards that may not support the higher sampling rates/bits per sample, and also permit us to look to the future so that any submitted audio at higher sampling rates/bits per sample will be usable down the road when Sound Cards that support higher sampling rates/bits per sample will become more common, and processing power increases.

For Telephony applications, to create Acoustic Models from audio recorded at a sample rate of 48kHz with 16-bits per sample, you must first downsample the audio to a sample rate of 8kHz/8-bit per sample, and then create an Acoustic Model from this.

Some VoIP PBXs, such as Asterisk, actually represent audio data internally at 8kHz/16-bit sampling rates, even though the codec used might only support 8kHz/8-bit sampling rates.  Therefore VoIP PBX's like Asterisk can use Acoustic Models trained on audio with8kHz/16-bit sampling rates.