HTK MFCC Study - voxforge.org

Acoustic Model Discussions

Flat

HTK MFCC Study

User: kmaclean
Date: 3/20/2011 8:28 pm

Views: 9366
Rating: 6

This study was on the http://mphmedia.net web site, which is no longer active. I am posting it because it contains some excellent comparison information for HTK mfcc targetkinds (before it gets flushed from Google's cache):

HTK MFCC Study

The goal is to compare all the possible mfcc TARGETKINDs. It may be that some of those are missing. This is an initial study for which feedback has not been used to correct and improve the results. The low recognition results of the models should not reflect on their potential in that the study was designed for TARGETKIND comparative purposes using a minimum of acoustic data to reduce the model assembly and testing time.

Results

Acoustic Model Obtained

TARGETKIND	Word % Correct	Word Accuracy	Words Correct	Deletion Errors	Substitution Errors	Insertion Errors	Word Errors	Sentence Errors	No tokens at beam 250	No tokens at final node
MFCC_0_D_A_Z	9.11%	-2.36	3841	4949 11.74%	33350 79.14%	4834 11.47%	102.36%	100%	7	0
MFCC_D_A_E_Z	8.51%	-1.39	3588	5477 13.00%	33075 78.49%	4175 9.91%	101.39%	100%	4	0
MFCC_0_D_Z	8.41%	2.62	3546	9057 21.49%	29537 70.09%	2442 5.79%	97.38%	100%	4	0
MFCC_0_D_A_T_Z	8.38%	-6.87	3532	3616 8.58%	34992 83.04%	6427 15.25%	106.87%	100%	9	0
MFCC_D_A_Z	8.12%	0.55	3421	6963 16.52%	31756 75.36%	3191 7.57%	99.45%	100%	4	0
MFCC_D_E_Z	7.80%	3.19	3288	9653 22.91%	29199 69.29%	1943 4.61%	96.81%	100%	4	0
MFCC_D_A_T_Z	7.32%	-2.96	3084	4962 11.78%	34094 80.91%	4331 10.28%	102.96%	100%	10	0
MFCC_D_Z	7.30%	3.07	3076	10150 24.09%	28914 68.61%	1784 4.23%	96.93%	100%	2	0
MFCC_D_A_T_E_Z	6.33%	-2.27	2667	5386 12.78%	34087 80.89%	3625 8.60%	102.27%	100%	6	0
MFCC_D_A_T_E	6.16%	-2.00	2597	5689 13.50%	33854 80.34%	3440 8.16%	102.00%	100%	11	0
MFCC_D_A_E	5.81%	0.01	2450	7642 18.13%	32048 76.05%	2447 5.81%	99.99%	100%	39	1
MFCC_0_D_A	5.78%	-1.37	2434	6679 15.85%	33027 78.37%	3010 7.14%	101.37%	100%	44	1
MFCC_0_Z	5.48%	4.09	2308	15296 36.30%	24536 58.22%	585 1.39%	95.91%	100%	0	0
MFCC_0_D_A_T	5.39%	-3.80	2273	5139 12.20%	34728 82.41%	3876 9.20%	103.80%	100%	48	1
MFCC_D_A	5.28%	1.18	2225	9611 22.81%	30304 71.91%	1726 4.10%	98.82%	100%	39	2
MFCC_0_E_Z	5.23%	3.61	2206	13989 33.20%	25945 61.57%	684 1.62%	96.39%	100%	7	0
MFCC_D_E	5.21%	2.83	2194	12284 29.15%	27662 65.64%	1003 2.38%	97.17%	100%	57	2
MFCC_D_A_T	5.15%	-1.94	2170	6614 15.70%	33356 79.16%	2988 7.09%	101.94%	100%	31	0
MFCC_0_D	5.12%	2.57	2156	12573 29.84%	27411 65.05%	1073 2.55%	97.43%	100%	58	1
MFCC_E_Z	4.96%	3.95	2089	16136 38.29%	23915 56.75%	425 1.01%	96.05%	100%	4	0
MFCC_D	4.24%	2.74	1785	15510 36.81%	24845 58.96%	630 1.50%	97.26%	100%	69	0
MFCC_E	2.61%	2.02	1101	20264 48.09%	20775 49.30%	248 0.59%	97.98%	100%	155	0
MFCC_0_E	2.47%	1.79	1042	19072 45.26%	22026 52.27%	286 0.68%	98.21%	100%	157	2
MFCC_0	2.25%	1.67	950	21507 51.04%	19683 46.71%	246 0.58%	98.33%	100%	223	1

Acoustic Model Not Obtained

TARGETKIND	HCopy Error
MFCC_0_D_A_E	cannot convert to TARGETKIND
MFCC_0_D_A_E_Z	cannot convert to TARGETKIND
MFCC_0_D_A_N_E	incompatible TARGETKIND
MFCC_0_D_A_N_E_Z	incompatible TARGETKIND
MFCC_0_D_A_T_E	cannot convert to TARGETKIND
MFCC_0_D_A_T_E_Z	cannot convert to TARGETKIND
MFCC_0_D_A_T_N_E	incompatible TARGETKIND
MFCC_0_D_A_T_N_E_Z	incompatible TARGETKIND
MFCC_0_D_E	cannot convert to TARGETKIND
MFCC_0_D_E_Z	cannot convert to TARGETKIND
MFCC_0_D_N_E	incompatible TARGETKIND
MFCC_0_D_N_E_Z	incompatible TARGETKIND
MFCC_D_A_N_E	incompatible TARGETKIND
MFCC_D_A_N_E_Z	incompatible TARGETKIND
MFCC_D_A_T_N_E	incompatible TARGETKIND
MFCC_D_A_T_N_E_Z	incompatible TARGETKIND
MFCC_D_N_E	incompatible TARGETKIND
MFCC_D_N_E_Z	incompatible TARGETKIND

Notes and Operating Conditions

The wav files and their transcriptions were obtained from VoxForge: 8khz 16bit wav files, 1402 speakers, 4120 sentences, 42140 words.

The acoustic model creation method is one that was provided by VoxForge with their acoustic model toward the beginning of 2009. This linkprovides a version I have assembled from the VoxForge method. The link does not show the actual method used for this report but is the basis for it. The primary HTK sequences are the same.

The mfcc creation configuration is the following with the TARGETKIND changed for each kind.

TARGETKIND     = MFCC_0_D_A_N_E
ZMEANSOURCE    = TRUE
TARGETRATE     = 100000.0
SAVECOMPRESSED = FALSE
SAVEWITHCRC    = FALSE
WINDOWSIZE     = 250000.0
USEHAMMING     = TRUE
PREEMCOEF      = 0.969
NUMCHANS       = 24
CEPLIFTER      = 22
NUMCEPS        = 12

The language model is created from the list of words in the sentences using 'HBuild wordList LanguageModel'. The idea is to create the weakest language model so that the acoustic model will dominate the results.

HVite is used as the decode method. HDecode is likely a better operational choice but implementation of the HVite method using the weak language model is fairly direct.

HVite -A -D -T 1 -H ./interim_files/hmm15/macros -H ./interim_files/hmm15/hmmdefs -S ./train.scp_01 -l '*' \
-i hvite/recout.mlf_01 -w /voxforge/master/LMNet -p 0.0 -s 5.0 ./interim_files/dict2 ./interim_files/tiedlist

'No tokens at beam 250' and 'No tokens at final node' give the count of those messages obtained from the HVite run just after the message 'realign hmm7' seen in the above VoxForge acoustic model generation bash file. These counts provide a measure of acoustic model quality from that HVite run.

Click a column heading to sort on that column.

The company I currently work for has a strategic goal to use speech recognition. However, this work is done on my own without their coordination, time or equipment.

Observations

The Acoustic Model Obtained Table is sorted in Word % Correct descending order. This may be the preferred ranking of the models in that the HTK Book on page 79, the line after the 5.9 section heading, holds that the Differential Coefficient models appearing higher in this ranking are better models.

Word Accuracy is sometimes remarked as a good indicator and you can sort on that column by clicking its heading. A second click will put the values in descending order. This result shows a significant shift in the model rankings with the models having more parameters, including the Differential Coefficient models, tending toward the bottom. The opposite is the case for the initial Word % Correct order. The model MFCC_0_D_A_T_Z gives the best example of this by ranking four, toward the top, in Word % Correct order and last in Word Accuracy order.

All of the models having the N parameter failed during the initial HCopy, mfcc assembly, step. It may be that I have not used the parameter properly or it may not work in the current HTK version.

The model parameter combinations having the three parameters 0, D, and E failed and yet the models with just 0 and E, and D and E worked.

P.S. If anyone has any information on who the author is, please let me know so I can contact him/her and make sure it is OK post this info here.

--- (Edited on 3/20/2011 9:28 pm [GMT-0400] by kmaclean) ---

Re: HTK MFCC Study

User: kmaclean
Date: 3/20/2011 10:44 pm

Views: 3810
Rating: 9

While cleaning up some old emails I noticed that Neil Nelson was the author of this study. He posted a link to it on the HTK-Users web site, where it was noted that:

Two problems: first is that if your results are far below expected
accuracies, they are not likely to be of interested to anyone. The
more serious issue is that at the error range you are talking about,
all that is happening is that the DP-search that HResults does is
desperately trying to match *any* correct words in the sentence to any of the others. Any variations in error rate you observe will be
no use.

--- (Edited on 3/20/2011 11:44 pm [GMT-0400] by kmaclean) ---

Previous • Next •


Username	Password