VoxForge
In the dict file you created in Step 2, the pronunciation of a word was given by a series of phonemes (also called monophones - i.e. a single phone). To generate a triphone (i.e. a group of 3 phones) declaration from monophones, the "L" phone (i.e. the left-hand phone) precedes "X" phone and the "R" phone (i.e. the right-hand phone) follows it. The triphone is declared in the form "L-X+R".
Below is an example of the
conversion to a triphone declaration of the word "TRANSLATE" (the first
line shows the "monophone" declaration, and the second line shows the
"triphone" declaration):
TRANSLATE [TRANSLATE] t r ae n z l ey t |
(Note that we may also get biphones (i.e. a group of 2 phones) at the beginning and end of the word.)
We
are therefore moving to an improved level of recognition
accuracy. So far, we have created a monophone Acoustic Model,
which can be used with Julius. But with such a model, we are not
looking at the 'context' of the monophone. The SRE is trying to
match the sound that it has heard to a single phone - a single
sound.
With a triphone acoustic model, we are essentially looking for a monophone in the "context" other monophones - i.e. the one immediately before and the one immediately after (if they exist - it may be the beginning or end of the word). This greatly improves recognition accuracy, because the SRE is looking to match a specific sequence of 3 sounds together (a triphone), rather than only one sound. This is like using a 3 word Google search rather than a single word Google search - you get more accurate results. Triphones reduce the possibility of error caused by confusing one sound with another, because we are now looking for a distinct sequence of 3 sounds.
Up
until now, we have glossed over what hidden markov models (hmm) are by saying that they are
esssentially statistical representations of the phones that make up a
word. But an hmm is made up of many 'states', and these states can be shared (in the
same way that the sp and sil phones now share their centre 'state'
after step 7). These clustered or 'tied' states are sometimes called senones.
It does not make sense to share states between monophones, because they are so different. Otherwise, why define the monophone? The point is that you want different sounds to be modelled separately, so the speech recognition engine can tell them apart.
However, when you start looking at triphones, each with its own hmm definition, you start getting multiple instances of triphones with states that are similar enough that the data can be shared among a group of triphones. This sharing process is called 'tying'. Therefore, we can 'tie' the states of many triphone hmms so that they share the same set of parameters. This way, when we reestimate these new tied parameters, the data from each of the original untied parameters is pooled so that a better estimate can be obtained.
Basically, we don't have enough speech data to model all possible triphone combinations contained in the words of our training set, so we 'cheat' and share parts of the data amongst similar triphones to improve recognition.
To convert the monophone transcriptions in the aligned.mlf file you created in Step 8 to an equivalent set of triphone transcriptions, you need to execute the HLEd command. HLEd can be used to generate a list of all triphones for which there is at least one example in the training data.
First you need to create the mktri.led edit script:
WB sp WB sil TC |
Then you execute the HLEd (label file editor) command as follows:
Linux:
HLEd -A -D -T 1 -n triphones1 -l '*' -i wintri.mlf mktri.led aligned.mlf |
Windows:
HLEd -A -D -T 1 -n triphones1 -l * -i wintri.mlf mktri.led aligned.mlf |
This creates 2 files:
Next, download the Julia script mktrihed.jl to your 'voxforge/bin' folder, then create the mktri.hed file by executing:
julia ../bin/mktrihed.jl monophones1 triphones1 mktri.hed |
This
creates the mktri.hed
file. This file contains a clone command 'CL' followed by a series of 'TI' commands to 'tie' HMMs so that they share the same set of parameters. This way, when we reestimate these new tied parameters (with HRest below) the data from each of the original untied parameters is pooled so that a better estimate can be obtained.
Then create 3 more folders: hmm10-12
Next, execute the HHEd command:
(HHEd is the HTK hmm definition editor and is mainly used for applying 'tyings' across selected HMM parameters.)
HHEd -A -D -T 1 -H hmm9/macros -H hmm9/hmmdefs -M hmm10 mktri.hed monophones1 |
The files created by this command are:
Next run HERest 2 more times:
HERest -A -D -T 1 -C config -I wintri.mlf -t 250.0
150.0 3000.0 -S train.scp -H hmm10/macros -H hmm10/hmmdefs -M hmm11
triphones1 |
You will also get lots of warnings (-2331). These are occuring because we don't have much training data. These can be safely ignored for this tutorial.
The files created by this command are:
HERest -A -D -T 1 -C config -I wintri.mlf -t 250.0
150.0 3000.0 -s stats -S train.scp -H hmm11/macros -H hmm11/hmmdefs -M
hmm12 triphones1 |
The files created by this command are: