Acoustic Model Discussions

Flat
int32 wraparound Error during decoding
User: joebob
Date: 2/9/2010 5:52 am
Views: 6158
Rating: 6

Hi,

 I am trying to train up some telephony acoustic models from the Fisher corpus using SphinxTrain and I have been running into a couple of problems during decoding that I can't seem to suss, which I think are related to my training environment.


 In the past I have successfully trained acoustic models with SphinxTrain for several languages but seem to be having trouble with the Fisher corpus for some reason.  I suspect that the issue has something to do with either,

  • The conversion to .wav format:  the Fisher data is in compressed nist Sphere format, so in order to obtain the mfc files I need to jump through some hoops.  At present I am converting the 8kHz compressed .sph files to .wav format with the sph2pipe tool,
    $ sph2pipe -f rif -p orig.sph orig.wav
  • the segmentation:  the Fisher corpus contains very long utterances, accompanied by transcriptions containing detailed timing information for individual speakers.  I have been using sox to merge the original 2 channels into one and then break the utterances into smaller segments based on the timing information in the transcripts: 
    $ sox orig.wav -c 1 -w -s orig-1.wav trim segstart segstop-segstart
    I also segment the transcripts as necessary and have checked as best I can with random selection of segments that the resulting audio files and transcriptions really do match.
  • the feature extraction: I'm using the SphinxTrain setup with make_feats.pl to perform the wave2feat feature extraction on the corpus that results form the segmentation.  My feature setup at present is as follows,
    -alpha 0.97
    -dither yes
    -doublebw no
    -nfilt 31
    -ncep 13
    -lowerf 200
    -upperf 3500
    -nfft 512
    -wlen 0.0256
    -srate 8000
    -transform legacy
    -feat 1s_c_d_dd

I am currently using just a very small subset of this rather enormous corpus, which consists of just 1.5 hours of data. I've set the final number of densities to 8, 3-state hmms, 1000 senones, and continuous models.  I'd like to work out these kinks before moving forward with a larger chunk of training data.


Using these parameters and the conversion setup above I am able to train my models without incident, and the only errors that appear in the logs during my training steps:

  1. 00.verify/verify_all.pl
  2. 20.ci_hmm/slave_convg.pl
  3. 30.cd_hmm_untied/slave_convg.pl
  4. 40.buildtrees/slave.treebuilder.pl
  5. 45.prunetree/slave.state-tying.pl
  6. 50.cd_hmm_tied/slave_convg.pl

are typical errors related to a very small number of utterances where no final state was reached, e.g.,

utt:   706            fe_03_00085-58  137    0    28 23 ERROR: "backward.c", line 431: final state not reached


This gave me the initial impression that my models were OK, but when I try to run sphinx3_decode on a couple of the training utterances as a sanity check, using all the same decoding parameters as were used during training, I invariably get many of the following errors,

ERROR: "srch_time_switch_tree.c", line 848: ***ERROR*** Fr 9, best HMM score > 0 (2146874566); int32 wraparound?


I checked these out in the sphinxtrain faq,

http://www.speech.cs.cmu.edu/sphinxman/logfiles.html#201

but altering the filler model dictionary didn't seem to have any effect, and the transitions matrices all looked OK. 


The decoding does complete, and the hypotheses being generated are reasonable, but they are invariably truncated, i.e., the hypotheses are much much shorter than the input utterances.  This strikes me as a silence modeling issue maybe, but I can't seem to figure out how to fix it, or why it isn't generating problems during training.

 

One other thing I noticed was, when looking at the variances,

$ printp -gaufn variances

there are quite a few zero entries,

mgau 141                                                                                                                                                                                                 
feat 0                                                                                                                                                                                                   
density    0 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+0\
0 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00\
 

which seem likely to be wrong.

I also tried including the force alignment steps in the training procedure but this did not have any effect.

Any advice would be greatly appreciated and I can provide further information if it will be helpful.

--- (Edited on 2/9/2010 5:52 am [GMT-0600] by joebob) ---

--- (Edited on 2/9/2010 6:03 am [GMT-0600] by joebob) ---

Re: int32 wraparound Error during decoding
User: nsh
Date: 2/9/2010 6:21 am
Views: 68
Rating: 6

its easy to find in mdef file which phone in which context gau 1414 belongs to. then you need to find why this context is not widely represented in training data. fisher by default has too many fillers. other possible mistake could be in trimming the silence arounf each utterance. there must be about 0.2 s of silence.

--- (Edited on 2/9/2010 15:21 [GMT+0300] by nsh) ---

Re: int32 wraparound Error during decoding
User: joebob
Date: 2/9/2010 6:57 am
Views: 241
Rating: 7

Hi, 

 

Thanks for the quick reply.  I've already removed the superfluous fillers that aren't represented in my subset, so that is probably OK. 

 

It sounds like the issue is most likely the silence padding on either end of the utterance.  I have just used the transcript timing information exactly as-is, so this is almost certainly truncating things in some inappropriate places.  Also, some of the segments are extremely short and occasionally contain ust an 'um' or an 'oh yeah'.  I had thought about merging these based on conversational turns rather than strict adherence to the transcripts, but strictly adhering made for simpler scripts.  Looks like I may have to revise that approach.

 

Is is sufficient to use sox to pad each of the utterance segments with 0.2s silence?  I've never tried that and wonder if it is acceptable?


Is it normal for this not to raise error issues during the training process?

--- (Edited on 2/9/2010 6:57 am [GMT-0600] by joebob) ---

Re: int32 wraparound Error during decoding
User: nsh
Date: 2/10/2010 1:53 am
Views: 64
Rating: 6

> Is is sufficient to use sox to pad each of the utterance segments with 0.2s silence?  I've never tried that and wonder if it is acceptable?

It's easier to try that then to guess

> Is it normal for this not to raise error issues during the training process?

It's a minor bug that could be fixed. But first it needs to be reported of course. With dropping some utterances that were not aligned properly training material for some senone could disappear and that could cause zero variance.


> I've already removed the superfluous fillers that aren't represented in my subset, so that is probably OK. 


Then it's sill interesting which phone senone 141 belongs to.

 

--- (Edited on 2/10/2010 10:57 [GMT+0300] by nsh) ---

Re: int32 wraparound Error during decoding
User: joebob
Date: 2/10/2010 6:34 am
Views: 160
Rating: 6

Hi,

  Thanks again for the reply.  With respect to the padding I was a bit worried that maybe inadvertently adding zeros to the ends of the file would screw things up.  Applying a bit of dithering seems to make that issue moot. 

  I still cannot decode everything, and although I looked up the zero variance senones, I was unable to find anything particularly suspicious.  I'm thinking now that the issue is more basic and has to do with the way I've configured the trainer. I'll provide a response if I suss out the issue.

--- (Edited on 2/10/2010 6:34 am [GMT-0600] by joebob) ---

Re: int32 wraparound Error during decoding
User: joebob
Date: 2/11/2010 5:13 pm
Views: 2434
Rating: 6

Hi again,

  I sorted out the problem by eliminating the sox commands from my segmentation approach and relying entirely on sph2pipe for the whole proces.  So instead of running the conversion to wav and then using sox to perform the segmentation according to the transcript, I'm now just running one sph2pipe command for each transcription segment,

 

$ sph2pipe -p -f rif -t starttime:endtime largefile.sph shortseg.wav


This eliminated the problem so my guess is that the real issue was a mismatch in the timing acquired by sox versus the transcript.  Anyway I've now segmented the massive fisher corpus and am on my way to models comprising a couple of hundred hours of the data.

 

Thanks again for the responses, they got me going in the right direction.

--- (Edited on 2/11/2010 5:13 pm [GMT-0600] by joebob) ---

PreviousNext