A Statistical Language Model is a file used by a Speech Recognition Engine to recognize speech. It contains a large list of words and their probability of occurrence. It is used in dictation applications.
An Acoustic Model is a file used by a Speech Recognition Engine for
Speech Recognition. It contains a statistical representation of
the distinct sounds that make up each word in the Language Model or
Grammar. More information can be found on the Background re: Acoustic Model Creation Document.
A Speaker Dependent Acoustic Model is exactly what its name suggests - it is an Acoustic Model that has been tailored to recognize a particular person's speech. Such Acoustic Models are usually trained using audio from a particular person's speech. However you can also take a generic Acoustic Model and adapt it to a particular person's speech to create a Speaker Dependent Acoustic Model.
A Speaker Independent Acoustic Model can recognize speech from a person who did not submit any speech audio that was used in the creation of the Acoustic Model.
The reason for the distinction is that it takes much more speech audio training data to create a Speaker Independent Acoustic Model than a Speaker Dependent Acoustic Model.
A Speech Corpus (or Spoken Corpus) is a database of speech audio files and text transcriptions of these audio files in a format that can be used to create Acoustical Models (which can then be used with a Speech Recognition Engine). ISIP's Switchboard database is a good example of this.
A corpus is one such database. Corpora is the plural of corpus (i.e. it is many such databases).
There are two types of Speech Copora:
(1) Read Speech - which includes
(2) Spontaneous Speech - which includes:
A Speech Decoder (or simply "Decoder") is the software portion of the speech recognition engine. In addition to a Decoder, Speech Recognition engines need an Acoustical Model and a Language Model or Grammar in order to recognize speech.
An acoustic model is a file that contains statistical representations of each of the distinct sounds that makes up a word. Each of these statistical representations is assigned a label called a phoneme. The English language has about 40 distinct sounds that are useful for speech recognition, and thus we have 40 different phonemes.
An acoustic model is created by taking a large database of speech (called a speech corpus) and using special training algorithms to create statistical representations for each phoneme in a language. These statistical representations are called Hidden Markov Models ("HMM"s). Each phoneme has its own HMM.
For example, if the system is set up with a simple grammar file to recognize the word "house" (whose phonemes are: "hh aw s"), here are the (simplified) steps that the speech recognition engine might take:
This get a little more complicated when you start using Language Models (which contain the probabilities of a large number of different word sequences), but the basic approach is the same.
Downsampling (or subsampling) is the process of reducing the sampling rate of a signal. This is usually done to reduce the data rate or the size of the data. For details, please refer to this wikipedia link.
A paper by Mitchel Weintraub and Leonardo Neumeyer called CONSTRUCTING TELEPHONE ACOUSTIC MODELS FROM A HIGH-QUALITY SPEECH CORPUS provides some background on the use of downsampled High Quality Speech Audio in applications that can only use lower sampling rates.
Is the process of taking the text transcription of an audio speech segment and determining where in time particular words occur in the speech segment.
As opposed to speech recognition, where the object is to take an audio speech segment and generate its text transcription.
For more info, see:
Speech Recognition Engines that can perform forced alignment (from command line or script):
Free software is software that gives users the four essential freedoms:
For more information, see the definition of Free Software on the Free Software Foundation's (FSF) website. The FSF promotes the development and use of free software, particularly the GNU operating system, used widely in its GNU/Linux variant.
G2P refers to grapheme-to-phoneme conversion. This is the process of using rules to generate a pronunciation for a word (for creating a pronunciation dictionary). The rules are usually created by a automated statistical analysis of a pronunciation dictionary.
The G2P algorithm is used to generate the most probable phone list for a word not contained in the pronunciation dictionary (i.e. out-of-vocabulary words) used to create the G2P rules.
The process of converting a sequence of letters into a sequence of phones is called grapheme-to-phoneme conversion, sometimes shortened g2p. The job of a grapheme-to-phoneme algorithm is thus to convert a letter string like cake into a phone string like [K EY K].
GPL refers to the 'GNU General Public License'. Copyright provides an author with the right to control copies and changes to a work, whereas the GPL license (also referred to as "copyleft") provides a user with the right to copy and change a work.
The preamble to the GPL license follows:
The GNU General Public License is a free, copyleft license for
software and other kinds of works.
The licenses for most software and other practical works are designed
to take away your freedom to share and change the works. By contrast,
the GNU General Public License is intended to guarantee your freedom to
share and change all versions of a program--to make sure it remains free
software for all its users. We, the Free Software Foundation, use the
GNU General Public License for most of our software; it applies also to
any other work released this way by its authors. You can apply it to
your programs, too.
When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
them if you wish), that you receive source code or can get it if you
want it, that you can change the software or use pieces of it in new
free programs, and that you know you can do these things.
To protect your rights, we need to prevent others from denying you
these rights or asking you to surrender the rights. Therefore, you have
certain responsibilities if you distribute copies of the software, or if
you modify it: responsibilities to respect the freedom of others.
For example, if you distribute copies of such a program, whether
gratis or for a fee, you must pass on to the recipients the same
freedoms that you received. You must make sure that they, too, receive
or can get the source code. And you must show them these terms so they
know their rights.
Developers that use the GNU GPL protect your rights with two steps:
(1) assert copyright on the software, and (2) offer you this License
giving you legal permission to copy, distribute and/or modify it.
For the developers' and authors' protection, the GPL clearly explains
that there is no warranty for this free software. For both users' and
authors' sake, the GPL requires that modified versions be marked as
changed, so that their problems will not be attributed erroneously to
authors of previous versions.
Some devices are designed to deny users access to install or run
modified versions of the software inside them, although the manufacturer
can do so. This is fundamentally incompatible with the aim of
protecting users' freedom to change the software. The systematic
pattern of such abuse occurs in the area of products for individuals to
use, which is precisely where it is most unacceptable. Therefore, we
have designed this version of the GPL to prohibit the practice for those
products. If such problems arise substantially in other domains, we
stand ready to extend this provision to those domains in future versions
of the GPL, as needed to protect the freedom of users.
Finally, every program is threatened constantly by software patents.
States should not allow patents to restrict development and use of
software on general-purpose computers, but in those that do, we wish to
avoid the special danger that patents applied to a free program could
make it effectively proprietary. To prevent this, the GPL assures that
patents cannot be used to render the program non-free.
A grapheme is basically a letter ("a", "b", ...).
Open-source software is an antonym for closed source and refers to any computer software whose source code is available under a license that permits users to study, change, and improve the software, and to redistribute it in modified or unmodified form. It is often developed in a public, collaborative manner.
See this Wikipedia entry for more information.
In addition, see the Open Source Initiative (OSI) web site. OSI is a non-profit corporation dedicated to managing and promoting the Open Source Definition for the good of the community, specifically through the OSI Certified Open Source Software certification mark and program.
IVR is an acronym for: Interactive Voice Response.
Older IVR systems allowed users to call in to a system and use the keys on their telephone (also called 'touch-tones') to navigate a series of menus to get information or conduct a transaction. The system would respond to the user over the phone using Text-to-Speech.
Newer IVR systems use a speech-based interface (using Speech Recognition and Text to Speech) to permit a caller to get similar information or conduct similar transactions. The menu structure of speech-based IVRs tends to be 'flatter' than with touch-tone menus, because the available options are not longer limited to the keys on a telephone keypad.
The CMU_ARCTIC database was constructed at the Language Technologies Institute at Carnegie Mellon University. It consists of around 1150 utterances selected from out-of-copyright texts from Project Gutenberg.
The prompt file used in the CMU_ARCTIC database were originally designed as US English single speaker prompt file for Speech Synthesis research (i.e Text to Speech). Since it is phonetically balanced, we are using it to generate prompt files for the creation of Speech Recognition Acoustic Models.
Groups sharing an identifiable accent may be defined by any of a wide variety of common traits. An accent may be associated with the region in which its speakers reside (a geographical accent), the socio-economic status of its speakers, their ethnicity, their caste or social class, their first language (when the language in which the accent is heard is not their native language), and so on.
A phoneme is the smallest structural unit that distinguishes meaning in a language. Phonemes are not the physical segments themselves, but are cognitive abstractions or categorizations of them.
On the other hand, phones refer to the instances of phonemes in the actual utterances - i.e. the physical segments.
For example (from this article):
the words "madder" and "matter" obviously are composed of distinct phonemes; however, in american english, both words are pronounced almost identically, which means that their phones are the same, or at least very close in the acoustic domain.
Speech Recognition Engines ("SRE"s) are made up of the following components:
A Speech Recognition System ('SRS') on a desktop computer does what a typical user of speech recognition would expect it to do: you speak a command into your microphone and the computer does something, or you dictate something to the computer and it types it out the corresponding text on your screen.
An SRS typically includes a Speech Recognition Engine and a Dialog Manager (and may or may not include a Text to Speech Engine).
A VoixeXML Interpreter, does just what it says - it interprets
VoiceXML documents. But does not, by itself, recognize speech or
output speech responses. It is the core smarts of a VoiceXML
platform, but does not have the Application Programming Interfaces ("API") necessary to communicate
with an ASR and TTS and/or PBX systems. A VoiceXML Browser
contains a VoiceXML interpreter, and includes generic APIs to
ASR, TTS and PBX systems. However it does not, by itself,
recognize speech or output speech responses. It still requires
separate ASR and TTS and/or PBX systems, and its APIs still need to be modified to work with specific ASR/TTS/PBX systems.
A VoixeXML Interpreter, does just what it says - it interprets VoiceXML documents. But does not, by itself, recognize speech or output speech responses. It is the core smarts of a VoiceXML platform, but does not have the Application Programming Interfaces ("API") necessary to communicate with an ASR and TTS and/or PBX systems.
A VoiceXML Browser contains a VoiceXML interpreter, and includes generic APIs to ASR, TTS and PBX systems. However it does not, by itself, recognize speech or output speech responses. It still requires separate ASR and TTS and/or PBX systems, and its APIs still need to be modified to work with specific ASR/TTS/PBX systems.A VoiceXML platform, is a 'turnkey' VoiceXML Speech Recoginition and Text to Speech System that works with a PBX. It works 'out of the box' and includes a VoiceXML Browser and the required TTS, ASR and PBX subsystems. A VoiceXML platform may also be called a 'VoiceXML Spoken Dialog System'.
Uncompressed audio is audio without any compression applied to
it. This includes audio recorded in PCM or WAV
Lossless audio compression is where audio is compressed without losing any information or degrading the quality at all. Examples of lossless formats includes WMA Lossless or FLAC in Matroska.
Lossy audio compression attempts to apply to discard as much 'irrelevant' data as possible from the original audio, thereby producing a file much smaller than the original that sounds almost identical. This results in a much smaller filesize then lossless or uncompressed audio. Lossy audio formats include AC3, DTS, AAC, MPEG-1/2/3, Vorbis, and Real Audio.
The pronunciation dictionnary is HTK specific and is different than the file you created in Step 1. The HTK file is used in creating Acoustic Models. The sample.dict file is used in Step 1 is part of the Julian Grammar. The Julian sample.dict file will usually only contain a subset of the words and pronunciation information contained in the HTK pronunciation dictionnary.
Note: There is duplication of pronunciation information in the Julius sample.dict file (part of the Julius Grammar) and the HTK pronunciation dictionary (used in the creation of HTK Acoustic Models). This can cause errors if you don't get your pronunciation data just right - so be careful.
Monophone: The pronunciation of a
word can be given as a series symbols that correspond to the individual
units of sound that make up a word. These are called 'phonemes'
or 'phones'. A monophone refers to a single phone.
Triphone: A triphone is simply a group of 3 phones in the form "L-X+R" - where the "L" phone (i.e. the left-hand phone) precedes "X" phone and the "R" phone (i.e. the right-hand phone) follows it.
Below is an example of the conversion of a monophone declaration of the word "TRANSLATE" to a triphone declaration (the first line shows the "monophone" declaration, and the second line shows the "triphone" declaration):
TRANSLATE [TRANSLATE] t r @ n s l e t
In the CMU dictionnary, which has close to 130,000 word pronunciations, there are only 43 phones, but there are close to 6000 triphones.
Phoneme Example Translation
------- ------- -----------
AA odd AA D
AE at AE T
AH hut HH AH T
AO ought AO T
AW cow K AW
AY hide HH AY D
B be B IY
CH cheese CH IY Z
D dee D IY
DH thee DH IY
EH Ed EH D
ER hurt HH ER T
EY ate EY T
F fee F IY
G green G R IY N
HH he HH IY
IH it IH T
IY eat IY T
JH gee JH IY
K key K IY
L lee L IY
M me M IY
N knee N IY
NG ping P IH NG
OW oat OW T
OY toy T OY
P pee P IY
R read R IY D
S sea S IY
SH she SH IY
T tea T IY
TH theta TH EY T AH
UH hood HH UH D
UW two T UW
V vee V IY
W we W IY
Y yield Y IY L D
Z zee Z IY
ZH seizure S IY ZH ER
In order for Speech Audio Files to be 'compiled' into Acoustic Models, the speech contained in the audio file must be labelled in some way. This can be done using orthographic transcriptions (transcriptions containing the actual words uttered) or using phonetic transcriptions (transcriptions contraining the sounds that make up the words). These transcriptions are usually contained in a separate text file, and are linked to the speech audio file by a common file name.
Trancriptions can also be 'time aligned' (where the file contains the start and end time of each word or phone) or not (no time stamps indicating the start or end of a word or phone).
Training Acoustic Models with short segments of transcribed speech (10-15 words long), with no time alignments, seems to create the best acoustic models.
In addition, please do not submit any lossy compressed audio (such as MP3 or Ogg Vorbis) converted to an uncompressed format (such as WAV or AIFF). For example, if you convert your audio from MP3 to WAV, information will still be missing from the audio stream, even though it has been converted to WAV.
If you have contributed speech for all the Prompts in the Speech Submisson Section, you may be interested in contributing additional speech recordings using your own prompts or transcribing audio recorded by others.
These types of submissions will help to ensure that we get speech audio for as many different words as possible (especially words not already included in our Phoneme Coverage Prompts), and thus provide coverage for as many triphones as possible. It is not enough to get many different people reading the same VoxForge created Phoneme Prompts files (why? because the resulting Acoustic Models will only be as good as the triphones covered in those files). We need a large variety texts to ensure we cover as many of the triphones in the English language as possible.
Suggestions for user-submitted prompts:
Don't worry if you don't have the time (or the inclination) to create VoxForge style prompts and/or audio files. We can convert your "one big prompt file" and corresponding "one big speech audio file" (in uncompressed wav format) into VoxForge style prompt and audio files. What is important is that we get as many varied speech audio contributions as possible.
Linux is the kernel of a GNU/Linux operating system, it does
not include all the other software needed to create an operating system.
Linux is not an operating system. Linux is the kernel: the program in the system that allocates the machine's resources to the other programs that you run. The kernel is an essential part of an operating system, but useless by itself; it can only function in the context of a complete operating system.
See this link for more information.
You can download a FLAC decoder here:
The Phoneme Prompts files were adapted from the prompt files contained in the CMU_ARCTIC database, which was originally designed for creating voices for the Festival Text to Speech engine. Since
it is phonetically balanced, we used it to generate prompt files for the creation of the VoxForge Speech Corpus.