Acoustic Model Discussions

Flat
VoxForge: acoustic model parameters
User: kmaclean
Date: 10/17/2007 8:28 pm
Views: 10309
Rating: 20
Email from David Gelbart: 
Hi Ken,

Congratulations on your 1 year anniversary and reaching 27 hours of data.  I'm writing to comment on a technical issue that I'm not sure if you are aware of:

The general rule I have seen with ASR systems is that, as the amount of training data increases, it eventually becomes necessary to add more acoustic model parameters in order to get the full benefit of the additional data.  On the other hand, using too many acoustic model parameters may cause overfitting (in other words, the system starts modeling quirks of the training data to the point where the system's performance on non-training data is worsened).

Thus, you may need to periodically tune the number of acoustic model parameters you are using.  I suppose the easiest way to do this is to create a test set which does not overlap with the training set, and measure word recognition accuracy on the test set for various acoustic model sizes.

One way to increase the number of parameters is to use more Gaussians in the Gaussian mixtures.  (One way to do this in HTK is to add one or more additional mixup stages.  This has the advantage that you can use your test set to compare recognition accuracy before and after the mixup, so that you can obtain your recognition accuracy numbers without having to retrain a system from scratch each time.)

Another way to increase the number of parameters is to move from monophones to triphones (unless you are using triphones already).

Another way is to reduce the amount of state-tying.

Regards,
David

--- (Edited on 10/17/2007 9:28 pm [GMT-0400] by kmaclean) ---

Re: VoxForge: acoustic model parameters
User: kmaclean
Date: 10/17/2007 8:33 pm
Views: 372
Rating: 23

My reply: 

Hi David,

thanks for keeping an eye on the VoxForge project!


Thus, you may need to periodically tune the number of acoustic model
parameters you are using.  
I did not realize I needed to do this on an on-going basis (as the corpus gets larger ...), thanks.

I suppose the easiest way to do this is to
create a test set which does not overlap with the training set, and
measure word recognition accuracy on the test set for various acoustic
model sizes.
I basically have no real acoustic model testing (just some 'sanity-testing' using recordings of my own voice) - I agree it needs to be done.
 
One way to increase the number of parameters is to use more Gaussians
in the Gaussian mixtures.  
My training recipe is based on the HTK Tutorial, which describes the creation of "continuous density mixture Gaussian tied-state triphones with clustering performed using phonetic decision trees". I have not looked at increasing the number of Gaussian mixtures per state, and am not sure how.  I am not even sure I understand what a Gaussian is (more reading is required on my part...)

Keith Vertanen's HTK training recipe site has a paper where he describes the results of using different combinations of parameters.  With respect to the Number of Gaussians (section 2.2) he says:

Recognition experiments where conducted on models with a varying number of Gaussians per state. Results for a single Gaussian per state were omitted from the graphs for clarity. In all cases the omitted single Gaussian model performed much worse than the semi-continuous or two Gaussian model.

Both Nov'92 (figure 1 and 2) and si dt s2 (figure 7 and 8) tasks show continued reductions in WER as exponentially more Gaussians are added to the models. Noticeable gains were made even from 16 to 32 Gaussians suggesting even more Gaussians might prove advantageous.

The large number of Gaussians per state does not come for free, the real-time factor increases significantly as more Gaussians were added (figures 4, 5, 10 and 11).

Using the Sphinx recognizer, further tests were done on models with 64 and 128 Gaussians. As shown in figure 13, more Gaussians provided no additional benefit on either the Nov'92 or si dt s2 test sets. Using so many Gaussians also slows the recognizer to significantly below real-time (figure 14)

(One way to do this in HTK is to add one or
more additional mixup stages.  This has the advantage that you can use
your test set to compare recognition accuracy before and after the
mixup, so that you can obtain your recognition accuracy numbers
without having to retrain a system from scratch each time.)
I am not sure what you mean by additional "mixup stages".  In Keith's training recipe, his train_mixup.sh script seems to be doing what you are talking about.  From the comments in the script:
# Mixup the number of Gaussians per state, from 1 up to 8.
# We do this in 4 steps, with 4 rounds of reestimation
# each time.  We mix to 8 to match paper "Large Vocabulary
# Continuous Speech Recognition Using HTK"
#
# Also per Phil Woodland's comment in the mailing list, we
# will let the sp/sil model have double the number of
# Gaussians.
#
# This version does sil mixup to 2 first, then from 2->4->6->8 for
# normal and double for sil.

The following is a section from his train_mixup.sh script:
#######################################################
# Mixup sil from 1->2
HHEd -B -H $TRAIN_WSJ0/hmm17/macros -H $TRAIN_WSJ0/hmm17/hmmdefs -M $TRAIN_WSJ0/hmm18 $TRAIN_WSJ0/ mix1.hed $TRAIN_WSJ0/tiedlist >$TRAIN_WSJ0/hhed_mix1.log

#HERest -B -m 0 -A -T 1 -C $TRAIN_COMMON/config -I $TRAIN_WSJ0/wintri.mlf -t 250.0 150.0 1000.0 -S train.scp -H $TRAIN_WSJ0/hmm18/macros -H $TRAIN_WSJ0/hmm18/hmmdefs -M $TRAIN_WSJ0/hmm19 $TRAIN_WSJ0/tiedlist >$TRAIN_WSJ0/hmm19.log

#HERest -B -m 0 -A -T 1 -C $TRAIN_COMMON/config -I $TRAIN_WSJ0/wintri.mlf -t 250.0 150.0 1000.0 -S train.scp -H $TRAIN_WSJ0/hmm19/macros -H $TRAIN_WSJ0/hmm19/hmmdefs -M $TRAIN_WSJ0/hmm20 $TRAIN_WSJ0/tiedlist >$TRAIN_WSJ0/hmm20.log

#HERest -B -m 0 -A -T 1 -C $TRAIN_COMMON/config -I $TRAIN_WSJ0/wintri.mlf -t 250.0 150.0 1000.0 -S train.scp -H $TRAIN_WSJ0/hmm20/macros -H $TRAIN_WSJ0/hmm20/hmmdefs -M $TRAIN_WSJ0/hmm21 $TRAIN_WSJ0/tiedlist >$TRAIN_WSJ0/hmm21.log

#HERest -B -m 0 -A -T 1 -C $TRAIN_COMMON/config -I $TRAIN_WSJ0/wintri.mlf -t 250.0 150.0 1000.0 -S train.scp -H $TRAIN_WSJ0/hmm21/macros -H $TRAIN_WSJ0/hmm21/hmmdefs -M $TRAIN_WSJ0/hmm22 $TRAIN_WSJ0/tiedlist >$TRAIN_WSJ0/hmm22.log

$TRAIN_TIMIT/train_iter.sh $TRAIN_WSJ0 hmm18 hmm19 tiedlist wintri.mlf 0
$TRAIN_TIMIT/train_iter.sh $TRAIN_WSJ0 hmm19 hmm20 tiedlist wintri.mlf 0
$TRAIN_TIMIT/train_iter.sh $TRAIN_WSJ0 hmm20 hmm21 tiedlist wintri.mlf 0
$TRAIN_TIMIT/train_iter.sh $TRAIN_WSJ0 hmm21 hmm22 tiedlist wintri.mlf 0

where mix1.hed contains:
MU 2 {sil.state[2-4].mix}

It seems to me, that this one of *many* additional training steps that occur after Step 10 from the HTK Tutorial (which creates the tied-state triphones), where you incrementally increase the number of Gaussian models per state - i.e. the "mixup stages" that you were referring to.

Another way to increase the number of parameters is to move from
monophones to triphones (unless you are using triphones already).
yes, we already use triphones

Another way is to reduce the amount of state-tying.
Keith also describes his approach to training acoustic models with varying numbers of tied stats:
HTK and Sphinx acoustic models were trained varying the number of tied-states (senones) between 4000, 6000, 8000 and 10000.

In the case of HTK, the exact number of tied-states cannot be specified, but instead thresholds are given to the phonetic decision tree state clustering step. The outlier threshold (RO) was held constant and the threshold controlling clustering termination (TB) was varied (see table 3).

On the "easy" 5K vocabulary Nov'92 task, there was little or no WER advantage in using more tied-states for either Sphinx (figure 1) or HTK (figure 2). On the "harder" 60K vocabulary si dt s2 task there appears to be a modest advantage to more tied-states using Sphinx (figure 7), but little difference using HTK (figure 8).

Of course having more tied-states requires the decoder to compute more Gaussian likelihoods per observation. This is shown by the increased xRT factor for the higher numbers of
tied-states in figures 4, 5, 10, and 11.

So from Keith discussion, I think I can figure out how to adjust the number of tied-states in HTK (using RO and TB).

One thing I noticed from Keith's scripts is that it seems like you need to chunk the process in order to avoid errors with large speech corpora.  From his train_iter.sh script:

# Does a single iteration of HERest training.
#
# This handles the parallel splitting and recombining
# of the accumulator files.  This is neccessary to
# prevent inccuracies and eventual failure with large
# amounts of training data.
#
# According to Phil Woodland, one accumulator file
# should be generated for about each hour of training
# data.
#

Do you do something similar in your AM training?

Basically, it seems like I've got to study Keith's scripts to ensure that the VoxForge Acoustic Models are as accurate as possible as the corpus increases in size.

Can I include this as a thread on the VoxForge site?

thanks,

Ken

 

--- (Edited on 10/17/2007 9:33 pm [GMT-0400] by kmaclean) ---

Re: VoxForge: acoustic model parameters
User: kmaclean
Date: 10/17/2007 8:35 pm
Views: 311
Rating: 18
> I am not even sure I understand what a Gaussian is (more reading is required
> on my part...)
I have some tutorial material linked at
http://www.icsi.berkeley.edu/~gelbart/edu.html that may be
useful.  Among the online material, I especially recommend the
Columbia/IBM slides.  Week 3 talks about Gaussians.  A Gaussian in
speech recognition is the same as a Gaussian probability density
function in probability & statistics.  Along with the slides for Week
3, you can find a list of textbook readings that go along with it.
These books may be hard to find in public libraries but you could try
inter-library loan or a university library (or buy them).

>> (One way to do this in HTK is to add one or
>> more additional mixup stages.  This has the advantage that you can use
>> your test set to compare recognition accuracy before and after the
>> mixup, so that you can obtain your recognition accuracy numbers
>> without having to retrain a system from scratch each time.)
>
> I am not sure what you mean by additional "mixup stages".  In Keith's
> training recipe, his train_mixup.sh script seems to be doing what you are
> talking about.
Yes.  I think the section in the HTK manual that describes this is
titled 'Mixture Incrementing'.

> One thing I noticed from Keith's scripts is that it seems like you need to
> chunk the process in order to avoid errors with large speech corpora.  From
> his train_iter.sh script:
...
> Do you do something similar in your AM training?
I have only used HTK with small corpora and whole-word modeling (not
triphones).  So I cannot provide much advice regarding chunking or
state-tying.

I think the htk-users mailing list is the best forum for your HTK
questions.  If you write to that list, I think it would be good to
include a description of the VoxForge project and what you've
accomplished so far.  That may help motivate people to help you, and
it will spread awareness of your project.

> Can I include this as a thread on the VoxForge site?
Please do.

Regards,
David

--- (Edited on 10/17/2007 9:35 pm [GMT-0400] by kmaclean) ---

Re: VoxForge: acoustic model parameters
User: kmaclean
Date: 10/17/2007 8:36 pm
Views: 386
Rating: 22
> I suppose the easiest way to do this is to
> create a test set which does not overlap with the training set, and
> measure word recognition accuracy on the test set for various acoustic
> model sizes.
Whether the test set should include data from speakers that are in the
training set or only contain data from speakers which are not in the
training set depends on how you envision the ASR being used.  You
could even have test sets of both types, or a test set that had
speakers of both types.

Regards,
David

--- (Edited on 10/17/2007 9:36 pm [GMT-0400] by kmaclean) ---

Re: VoxForge: acoustic model parameters
User: nsh
Date: 10/18/2007 1:46 am
Views: 2894
Rating: 31

Simple HDecode + HResults can give very important results on the quality of the database really. Until that it's very hard to understand what's going on there.

For example in Russian we still have around 80% accuracy and can't make it better partially because of bad transcription, partially due to the bad language model.

--- (Edited on 10/18/2007 1:46 am [GMT-0500] by nsh) ---

PreviousNext