VoxForge
This study was on the http://mphmedia.net web site, which is no longer active. I am posting it because it contains some excellent comparison information for HTK mfcc targetkinds (before it gets flushed from Google's cache):
The goal is to compare all the possible mfcc TARGETKINDs. It may be that some of those are missing. This is an initial study for which feedback has not been used to correct and improve the results. The low recognition results of the models should not reflect on their potential in that the study was designed for TARGETKIND comparative purposes using a minimum of acoustic data to reduce the model assembly and testing time.
TARGETKIND | Word % Correct | Word Accuracy | Words Correct | Deletion Errors | Substitution Errors | Insertion Errors | Word Errors | Sentence Errors | No tokens at beam 250 | No tokens at final node |
---|---|---|---|---|---|---|---|---|---|---|
MFCC_0_D_A_Z | 9.11% | -2.36 | 3841 | 4949 11.74% |
33350 79.14% |
4834 11.47% |
102.36% | 100% | 7 | 0 |
MFCC_D_A_E_Z | 8.51% | -1.39 | 3588 | 5477 13.00% |
33075 78.49% |
4175 9.91% |
101.39% | 100% | 4 | 0 |
MFCC_0_D_Z | 8.41% | 2.62 | 3546 | 9057 21.49% |
29537 70.09% |
2442 5.79% |
97.38% | 100% | 4 | 0 |
MFCC_0_D_A_T_Z | 8.38% | -6.87 | 3532 | 3616 8.58% |
34992 83.04% |
6427 15.25% |
106.87% | 100% | 9 | 0 |
MFCC_D_A_Z | 8.12% | 0.55 | 3421 | 6963 16.52% |
31756 75.36% |
3191 7.57% |
99.45% | 100% | 4 | 0 |
MFCC_D_E_Z | 7.80% | 3.19 | 3288 | 9653 22.91% |
29199 69.29% |
1943 4.61% |
96.81% | 100% | 4 | 0 |
MFCC_D_A_T_Z | 7.32% | -2.96 | 3084 | 4962 11.78% |
34094 80.91% |
4331 10.28% |
102.96% | 100% | 10 | 0 |
MFCC_D_Z | 7.30% | 3.07 | 3076 | 10150 24.09% |
28914 68.61% |
1784 4.23% |
96.93% | 100% | 2 | 0 |
MFCC_D_A_T_E_Z | 6.33% | -2.27 | 2667 | 5386 12.78% |
34087 80.89% |
3625 8.60% |
102.27% | 100% | 6 | 0 |
MFCC_D_A_T_E | 6.16% | -2.00 | 2597 | 5689 13.50% |
33854 80.34% |
3440 8.16% |
102.00% | 100% | 11 | 0 |
MFCC_D_A_E | 5.81% | 0.01 | 2450 | 7642 18.13% |
32048 76.05% |
2447 5.81% |
99.99% | 100% | 39 | 1 |
MFCC_0_D_A | 5.78% | -1.37 | 2434 | 6679 15.85% |
33027 78.37% |
3010 7.14% |
101.37% | 100% | 44 | 1 |
MFCC_0_Z | 5.48% | 4.09 | 2308 | 15296 36.30% |
24536 58.22% |
585 1.39% |
95.91% | 100% | 0 | 0 |
MFCC_0_D_A_T | 5.39% | -3.80 | 2273 | 5139 12.20% |
34728 82.41% |
3876 9.20% |
103.80% | 100% | 48 | 1 |
MFCC_D_A | 5.28% | 1.18 | 2225 | 9611 22.81% |
30304 71.91% |
1726 4.10% |
98.82% | 100% | 39 | 2 |
MFCC_0_E_Z | 5.23% | 3.61 | 2206 | 13989 33.20% |
25945 61.57% |
684 1.62% |
96.39% | 100% | 7 | 0 |
MFCC_D_E | 5.21% | 2.83 | 2194 | 12284 29.15% |
27662 65.64% |
1003 2.38% |
97.17% | 100% | 57 | 2 |
MFCC_D_A_T | 5.15% | -1.94 | 2170 | 6614 15.70% |
33356 79.16% |
2988 7.09% |
101.94% | 100% | 31 | 0 |
MFCC_0_D | 5.12% | 2.57 | 2156 | 12573 29.84% |
27411 65.05% |
1073 2.55% |
97.43% | 100% | 58 | 1 |
MFCC_E_Z | 4.96% | 3.95 | 2089 | 16136 38.29% |
23915 56.75% |
425 1.01% |
96.05% | 100% | 4 | 0 |
MFCC_D | 4.24% | 2.74 | 1785 | 15510 36.81% |
24845 58.96% |
630 1.50% |
97.26% | 100% | 69 | 0 |
MFCC_E | 2.61% | 2.02 | 1101 | 20264 48.09% |
20775 49.30% |
248 0.59% |
97.98% | 100% | 155 | 0 |
MFCC_0_E | 2.47% | 1.79 | 1042 | 19072 45.26% |
22026 52.27% |
286 0.68% |
98.21% | 100% | 157 | 2 |
MFCC_0 | 2.25% | 1.67 | 950 | 21507 51.04% |
19683 46.71% |
246 0.58% |
98.33% | 100% | 223 | 1 |
TARGETKIND | HCopy Error |
---|---|
MFCC_0_D_A_E | cannot convert to TARGETKIND |
MFCC_0_D_A_E_Z | cannot convert to TARGETKIND |
MFCC_0_D_A_N_E | incompatible TARGETKIND |
MFCC_0_D_A_N_E_Z | incompatible TARGETKIND |
MFCC_0_D_A_T_E | cannot convert to TARGETKIND |
MFCC_0_D_A_T_E_Z | cannot convert to TARGETKIND |
MFCC_0_D_A_T_N_E | incompatible TARGETKIND |
MFCC_0_D_A_T_N_E_Z | incompatible TARGETKIND |
MFCC_0_D_E | cannot convert to TARGETKIND |
MFCC_0_D_E_Z | cannot convert to TARGETKIND |
MFCC_0_D_N_E | incompatible TARGETKIND |
MFCC_0_D_N_E_Z | incompatible TARGETKIND |
MFCC_D_A_N_E | incompatible TARGETKIND |
MFCC_D_A_N_E_Z | incompatible TARGETKIND |
MFCC_D_A_T_N_E | incompatible TARGETKIND |
MFCC_D_A_T_N_E_Z | incompatible TARGETKIND |
MFCC_D_N_E | incompatible TARGETKIND |
MFCC_D_N_E_Z | incompatible TARGETKIND |
The wav files and their transcriptions were obtained from VoxForge: 8khz 16bit wav files, 1402 speakers, 4120 sentences, 42140 words.
The acoustic model creation method is one that was provided by VoxForge with their acoustic model toward the beginning of 2009. This linkprovides a version I have assembled from the VoxForge method. The link does not show the actual method used for this report but is the basis for it. The primary HTK sequences are the same.
The mfcc creation configuration is the following with the TARGETKIND changed for each kind.
TARGETKIND = MFCC_0_D_A_N_E ZMEANSOURCE = TRUE TARGETRATE = 100000.0 SAVECOMPRESSED = FALSE SAVEWITHCRC = FALSE WINDOWSIZE = 250000.0 USEHAMMING = TRUE PREEMCOEF = 0.969 NUMCHANS = 24 CEPLIFTER = 22 NUMCEPS = 12
The language model is created from the list of words in the sentences using 'HBuild wordList LanguageModel'. The idea is to create the weakest language model so that the acoustic model will dominate the results.
HVite is used as the decode method. HDecode is likely a better operational choice but implementation of the HVite method using the weak language model is fairly direct.
HVite -A -D -T 1 -H ./interim_files/hmm15/macros -H ./interim_files/hmm15/hmmdefs -S ./train.scp_01 -l '*' \ -i hvite/recout.mlf_01 -w /voxforge/master/LMNet -p 0.0 -s 5.0 ./interim_files/dict2 ./interim_files/tiedlist
'No tokens at beam 250' and 'No tokens at final node' give the count of those messages obtained from the HVite run just after the message 'realign hmm7' seen in the above VoxForge acoustic model generation bash file. These counts provide a measure of acoustic model quality from that HVite run.
Click a column heading to sort on that column.
The company I currently work for has a strategic goal to use speech recognition. However, this work is done on my own without their coordination, time or equipment.
The Acoustic Model Obtained Table is sorted in Word % Correct descending order. This may be the preferred ranking of the models in that the HTK Book on page 79, the line after the 5.9 section heading, holds that the Differential Coefficient models appearing higher in this ranking are better models.
Word Accuracy is sometimes remarked as a good indicator and you can sort on that column by clicking its heading. A second click will put the values in descending order. This result shows a significant shift in the model rankings with the models having more parameters, including the Differential Coefficient models, tending toward the bottom. The opposite is the case for the initial Word % Correct order. The model MFCC_0_D_A_T_Z gives the best example of this by ranking four, toward the top, in Word % Correct order and last in Word Accuracy order.
All of the models having the N parameter failed during the initial HCopy, mfcc assembly, step. It may be that I have not used the parameter properly or it may not work in the current HTK version.
The model parameter combinations having the three parameters 0, D, and E failed and yet the models with just 0 and E, and D and E worked.
P.S. If anyone has any information on who the author is, please let me know so I can contact him/her and make sure it is OK post this info here.
--- (Edited on 3/20/2011 9:28 pm [GMT-0400] by kmaclean) ---
While cleaning up some old emails I noticed that Neil Nelson was the author of this study. He posted a link to it on the HTK-Users web site, where it was noted that:
Two problems: first is that if your results are far below expected
accuracies, they are not likely to be of interested to anyone. The
more serious issue is that at the error range you are talking about,
all that is happening is that the DP-search that HResults does is
desperately trying to match *any* correct words in the sentence to any of the others. Any variations in error rate you observe will be
no use.
--- (Edited on 3/20/2011 11:44 pm [GMT-0400] by kmaclean) ---