Click here to register.

Acoustic Model Creation

Speech Recognition Engine Files 

Speech Recognition engines require two types of files to recognize speech.  They require an Acoustic Model, which is created by taking audio recordings of speech, and their transcriptions, and 'compiling' them into a statistical representations of the sounds that make up each word.  They also require a Language Model or Grammar file.  A Language Model is a file containing the probabilities of sequences of words.  A Grammar is a much smaller file containing sets of predefined combinations of words.  Language Models are used for dictation applications, whereas Grammars are used in Desktop Command and Control or Telephony IVR type applications.

Acoustic Models 

Audio can be encoded at different Sampling Rates (i.e. samples per second - the most common being: 8kHz, 16kHz, 32kHz, 44.1kHz, 48kHz and 96kHz), and different Bits per Sample (the most common being: 8-bits, 16-bits or 32-bits).   Speech Recognition engines work best if the Acoustic Model they use was trained with speech audio which was recorded at the same Sampling Rate/Bits per Sample as the speech being recognized. 

Telephony 

For Telephony, the limiting factor is the bandwidth at which speech can be transmitted.  For example, your standard land-line telephone only has a bandwidth of 64kbps at a sampling rate of 8kHz and 8-bits per sample (8000 samples per second * 8-bits per sample = 64000bps = 64kpbs).  Therefore, for Telephony based speech recognition, you need Acoustic Models trained with 8kHz/8-bit speech audio files. 

For Voice over IP ("VoIP"), the codec used usually determines the sampling rate/bits per sample of speech transmission.  If you use a codec with a higher sampling rate/bits per sample for speech transmission (to improve the sound quality), then your Acoustic Model must be trained with audio data that matches that sampling rate/bits per sample.  In the specific case of the Asterisk PBX system, audio is upsampled internally to 8kHz/16-bits regardless of the codec sampling/bits per sample.  Therefore, Asterisk needs an Acoustic Model trained with 8kHz/16-bit audio data.

Desktop 

For speech recognition on your PC, the limiting factor is your sound card.  Most sound cards today can record  at sampling rates of between 16kHz-48khz of audio, with bit rates of 8 to 16-bits per sample, and playback at up to 98kHz.

As a general rule, a Speech Recognition Engine works better with Acoustic Models trained with speech audio data recorded at higher sampling rates/bits per sample.  But using audio with too high a sampling rate/bits per sample can slow your recognition engine down.  You need a balance. Thus for desktop speech recognition, the current standard is Acoustic Models trained with speech audio data recorded at sampling rates of 16kHz/16bits per sample.

You can still use Acoutic Models trained at 8 kHz for desktop applications, but you generally need at least twice (and usually more ...) the audio data to get comparable recognition results of Acoustic Models trained at 16kHz. 

Additional information can be found at the following link:

How Speech Recognition Works 

 


Comments

Click the 'Add' link to add a comment to this page.

Note: You need to be logged in to add a comment!

Search

By Visitor - 1/27/2015

i want to use julius can u explain about julius..i dont know how to use that saftware and how to make interface.plz can you can me.

By mmm - 6/9/2010 - 1 Replies

hi

hi
 
i have some questions: can anyone answers me?
 
1- can i have ready recording files from somewhere (prompts + wav files) ?
 
2- is it a necessary to record files bymyself??
 

3- should i use test files which i recorded by myself ?because at my project i used ready files for test which i do not record?
 
i do not understand adaption and why we need it?
it nessecary to adaptive training?
i did not do this step
 
 
4- the website contains only 30 files for training ,do you think they are enough for a good training?

By kmaclean - 2/27/2008

With respect to using higher sampling rates for speech, the following excerpt from SPEECH and LANGUAGE PROCESSING: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, By  Daniel Jurafsky and  James H. Martin, second edition draft chapters (I don't think the draft chapters are on-line anymore, however, the book is well worth the price if you are interested in Speech Recognition) is very helpful:

Recall that the ?rst step in processing speech is to convert the analog representations (?rst air pressure, and then analog electric signals in a microphone), into a digital signal. This process of analog-to-digital conversion has two steps: sampling and quantization. A signal is sampled by measuring its amplitude at a particular time; the sampling rate is the number of samples taken per second. In order to accurately measure a wave, it is necessary to have at least two samples in each cycle: one measuring the positive part of the wave and one measuring the negative part.

More than two samples per cycle increases the amplitude accuracy, but less than two samples will cause the frequency of the wave to be completely missed. Thus the maximum frequency wave that can be measured is one whose frequency is half the sample rate (since every cycle needs two samples). This maximum frequency for a given sampling rate is called the Nyquist frequency.

Most information in human speech is in frequencies below 10,000 Hz; thus a 20,000 Hz sampling rate would be necessary for complete accuracy. But telephone speech is ?ltered by the switching network, and only frequencies less than 4,000 Hz are transmitted by telephones. Thus an 8,000 Hz sampling rate is suf?cient for telephone-bandwidth speech like the Switchboard corpus.  A 16,000 Hz sampling rate (sometimes called wideband) is often used for microphone WIDEBAND speech.

Even an 8,000 Hz sampling rate requires 8000 amplitude measurements for each second of speech, and so it is important to store the amplitude measurement ef?ciently. They are usually stored as integers, either 8-bit (values from -128–127) or 16 bit (values from -32768–32767). This process of representing real-valued numbers as integers is called quantization because there is a minimum granularity (the quantum size) and all values which are closer together than this quantum size are represented identically.