Acoustic Model Discussions

Flat
HTK MFCC Study
User: kmaclean
Date: 3/20/2011 8:28 pm
Views: 9050
Rating: 6

This study was on the http://mphmedia.net web site, which is no longer active.  I am posting it because it contains some excellent comparison information for HTK mfcc targetkinds (before it gets flushed from Google's cache):

HTK MFCC Study

The goal is to compare all the possible mfcc TARGETKINDs. It may be that some of those are missing. This is an initial study for which feedback has not been used to correct and improve the results. The low recognition results of the models should not reflect on their potential in that the study was designed for TARGETKIND comparative purposes using a minimum of acoustic data to reduce the model assembly and testing time.

Results
Acoustic Model Obtained
TARGETKINDWord %
Correct
Word
Accuracy
Words
Correct
Deletion
Errors
Substitution
Errors
Insertion
Errors
Word
Errors
Sentence
Errors
No tokens at
beam 250
No tokens at
final node
MFCC_0_D_A_Z 9.11% -2.36 3841 4949
11.74%
33350
79.14%
4834
11.47%
102.36% 100% 7 0
MFCC_D_A_E_Z 8.51% -1.39 3588 5477
13.00%
33075
78.49%
4175
9.91%
101.39% 100% 4 0
MFCC_0_D_Z 8.41% 2.62 3546 9057
21.49%
29537
70.09%
2442
5.79%
97.38% 100% 4 0
MFCC_0_D_A_T_Z 8.38% -6.87 3532 3616
8.58%
34992
83.04%
6427
15.25%
106.87% 100% 9 0
MFCC_D_A_Z 8.12% 0.55 3421 6963
16.52%
31756
75.36%
3191
7.57%
99.45% 100% 4 0
MFCC_D_E_Z 7.80% 3.19 3288 9653
22.91%
29199
69.29%
1943
4.61%
96.81% 100% 4 0
MFCC_D_A_T_Z 7.32% -2.96 3084 4962
11.78%
34094
80.91%
4331
10.28%
102.96% 100% 10 0
MFCC_D_Z 7.30% 3.07 3076 10150
24.09%
28914
68.61%
1784
4.23%
96.93% 100% 2 0
MFCC_D_A_T_E_Z 6.33% -2.27 2667 5386
12.78%
34087
80.89%
3625
8.60%
102.27% 100% 6 0
MFCC_D_A_T_E 6.16% -2.00 2597 5689
13.50%
33854
80.34%
3440
8.16%
102.00% 100% 11 0
MFCC_D_A_E 5.81% 0.01 2450 7642
18.13%
32048
76.05%
2447
5.81%
99.99% 100% 39 1
MFCC_0_D_A 5.78% -1.37 2434 6679
15.85%
33027
78.37%
3010
7.14%
101.37% 100% 44 1
MFCC_0_Z 5.48% 4.09 2308 15296
36.30%
24536
58.22%
585
1.39%
95.91% 100% 0 0
MFCC_0_D_A_T 5.39% -3.80 2273 5139
12.20%
34728
82.41%
3876
9.20%
103.80% 100% 48 1
MFCC_D_A 5.28% 1.18 2225 9611
22.81%
30304
71.91%
1726
4.10%
98.82% 100% 39 2
MFCC_0_E_Z 5.23% 3.61 2206 13989
33.20%
25945
61.57%
684
1.62%
96.39% 100% 7 0
MFCC_D_E 5.21% 2.83 2194 12284
29.15%
27662
65.64%
1003
2.38%
97.17% 100% 57 2
MFCC_D_A_T 5.15% -1.94 2170 6614
15.70%
33356
79.16%
2988
7.09%
101.94% 100% 31 0
MFCC_0_D 5.12% 2.57 2156 12573
29.84%
27411
65.05%
1073
2.55%
97.43% 100% 58 1
MFCC_E_Z 4.96% 3.95 2089 16136
38.29%
23915
56.75%
425
1.01%
96.05% 100% 4 0
MFCC_D 4.24% 2.74 1785 15510
36.81%
24845
58.96%
630
1.50%
97.26% 100% 69 0
MFCC_E 2.61% 2.02 1101 20264
48.09%
20775
49.30%
248
0.59%
97.98% 100% 155 0
MFCC_0_E 2.47% 1.79 1042 19072
45.26%
22026
52.27%
286
0.68%
98.21% 100% 157 2
MFCC_0 2.25% 1.67 950 21507
51.04%
19683
46.71%
246
0.58%
98.33% 100% 223 1
Acoustic Model Not Obtained
TARGETKINDHCopy Error
MFCC_0_D_A_E cannot convert to TARGETKIND
MFCC_0_D_A_E_Z cannot convert to TARGETKIND
MFCC_0_D_A_N_E incompatible TARGETKIND
MFCC_0_D_A_N_E_Z incompatible TARGETKIND
MFCC_0_D_A_T_E cannot convert to TARGETKIND
MFCC_0_D_A_T_E_Z cannot convert to TARGETKIND
MFCC_0_D_A_T_N_E incompatible TARGETKIND
MFCC_0_D_A_T_N_E_Z incompatible TARGETKIND
MFCC_0_D_E cannot convert to TARGETKIND
MFCC_0_D_E_Z cannot convert to TARGETKIND
MFCC_0_D_N_E incompatible TARGETKIND
MFCC_0_D_N_E_Z incompatible TARGETKIND
MFCC_D_A_N_E incompatible TARGETKIND
MFCC_D_A_N_E_Z incompatible TARGETKIND
MFCC_D_A_T_N_E incompatible TARGETKIND
MFCC_D_A_T_N_E_Z incompatible TARGETKIND
MFCC_D_N_E incompatible TARGETKIND
MFCC_D_N_E_Z incompatible TARGETKIND
Notes and Operating Conditions

The wav files and their transcriptions were obtained from VoxForge: 8khz 16bit wav files, 1402 speakers, 4120 sentences, 42140 words.

The acoustic model creation method is one that was provided by VoxForge with their acoustic model toward the beginning of 2009. This linkprovides a version I have assembled from the VoxForge method. The link does not show the actual method used for this report but is the basis for it. The primary HTK sequences are the same.

The mfcc creation configuration is the following with the TARGETKIND changed for each kind.

 

TARGETKIND     = MFCC_0_D_A_N_E
ZMEANSOURCE    = TRUE
TARGETRATE     = 100000.0
SAVECOMPRESSED = FALSE
SAVEWITHCRC    = FALSE
WINDOWSIZE     = 250000.0
USEHAMMING     = TRUE
PREEMCOEF      = 0.969
NUMCHANS       = 24
CEPLIFTER      = 22
NUMCEPS        = 12

The language model is created from the list of words in the sentences using 'HBuild wordList LanguageModel'. The idea is to create the weakest language model so that the acoustic model will dominate the results.

HVite is used as the decode method. HDecode is likely a better operational choice but implementation of the HVite method using the weak language model is fairly direct.

 

HVite -A -D -T 1 -H ./interim_files/hmm15/macros -H ./interim_files/hmm15/hmmdefs -S ./train.scp_01 -l '*' \
-i hvite/recout.mlf_01 -w /voxforge/master/LMNet -p 0.0 -s 5.0 ./interim_files/dict2 ./interim_files/tiedlist

'No tokens at beam 250' and 'No tokens at final node' give the count of those messages obtained from the HVite run just after the message 'realign hmm7' seen in the above VoxForge acoustic model generation bash file. These counts provide a measure of acoustic model quality from that HVite run.

Click a column heading to sort on that column.

The company I currently work for has a strategic goal to use speech recognition. However, this work is done on my own without their coordination, time or equipment.

Observations

The Acoustic Model Obtained Table is sorted in Word % Correct descending order. This may be the preferred ranking of the models in that the HTK Book on page 79, the line after the 5.9 section heading, holds that the Differential Coefficient models appearing higher in this ranking are better models.

Word Accuracy is sometimes remarked as a good indicator and you can sort on that column by clicking its heading. A second click will put the values in descending order. This result shows a significant shift in the model rankings with the models having more parameters, including the Differential Coefficient models, tending toward the bottom. The opposite is the case for the initial Word % Correct order. The model MFCC_0_D_A_T_Z gives the best example of this by ranking four, toward the top, in Word % Correct order and last in Word Accuracy order.

All of the models having the N parameter failed during the initial HCopy, mfcc assembly, step. It may be that I have not used the parameter properly or it may not work in the current HTK version.

The model parameter combinations having the three parameters 0, D, and E failed and yet the models with just 0 and E, and D and E worked.

P.S. If anyone has any information on who the author is, please let me know so I can contact him/her and make sure it is OK post this info here. 

--- (Edited on 3/20/2011 9:28 pm [GMT-0400] by kmaclean) ---

Re: HTK MFCC Study
User: kmaclean
Date: 3/20/2011 10:44 pm
Views: 3685
Rating: 9

While cleaning up some old emails I noticed that Neil Nelson was the author of this study.  He posted a link to it on the HTK-Users web site, where it was noted that:

Two problems: first is that if your results are far below expected
accuracies, they are not likely to be of interested to anyone.  The
more serious issue is that at the error range you are talking about,
all that is happening is that the DP-search that HResults does is
desperately trying to match *any* correct words in the sentence to any of the others.  Any variations in error rate you observe will be
no use.

 

--- (Edited on 3/20/2011 11:44 pm [GMT-0400] by kmaclean) ---

PreviousNext