Acoustic Model Discussions

Flat
Acoustic model testing
User: tpavelka
Date: 2/10/2009 7:50 am
Views: 21013
Rating: 10

Hi,

I have downloaded the corpus and trained decision tree clustered triphones with it. Now I would like to evaluate it's performance. Did anyone do this before me (I would like to compare my results so that I could see possible mistakes in my training process)? Is there something like a standardized list of test files?

What I did is that I randomly taken out approx 1600 recordings from the corpus and used it as testing data. At the moment I do not have any language model so I used uniform distribution for transition between words probabilities (+ word transition penalty).

With the complete vocabulary supplied with VoxForge (130k words) the results are the following (the scoring process was similar to the one used in HResults):

%Corr=33,14 Acc=29,80 H=6640 D=4306 S=9093 I=668 N=20039

If I restrict the vocabulary to only those words that can be found in the VoxForge prompts (approx 14k words) the results are a bit better:

%Corr=42,62 Acc=38,52 H=8540 D=4317 S=7182 I=821 N=20039

Since I did not work with a corpus of this size before (most of my previous ASR experiments were done with much smaller corpora and grammar based tasks) I cannot tell whether these results are good or bad. I am not even sure whether it is a good idea to test acoustic models separate from language models.

What I would like to do next is to try to incorporate some kind of language model into it. I think even simple unigram LM might help because if I look at the results they often consist of very rare (often non-english) words (e.g. HOW DOES YOUR WAGER LOOK NOW gets recognized as OO HOUT DARES WAAG GOLOB NURRE). Regarding that I would like to ask from which text were the prompts in voxforge generated. I am guessing that text should be used in LM training, I do not think (but maybe I am wrong) that a LM trained from completelly different source would perform very well.

Tomas

--- (Edited on 2/10/2009 7:50 am [GMT-0600] by tpavelka) ---

Re: Acoustic model testing
User: nsh
Date: 2/10/2009 7:04 pm
Views: 417
Rating: 9

Often testing is done with the language model generated from the test prompts. Once you'll have this estimate it will be possible to state if the model is good or bad.

You can also compare it with a free WSJ model available here:

http://www.inference.phy.cam.ac.uk/kv227/htk/acoustic_models.html


And with sphinx voxforge model results.

http://www.voxforge.org/home/forums/message-boards/general-discussion/acoustic-model-0_1_2?pn=3

 

--- (Edited on 2/10/2009 7:04 pm [GMT-0600] by nsh) ---

Re: Acoustic model testing
User: kmaclean
Date: 2/12/2009 10:23 am
Views: 298
Rating: 8

Hi Tomas,

Thanks for your work on this - this is very useful information.

>Did anyone do this before me (I would like to compare my results so that I

>could see possible mistakes in my training process)? Is there something

>like a standardized list of test files?

No not yet, so your results are very helpful as a baseline going forward. 

thanks again,

Ken

--- (Edited on 2/12/2009 11:23 am [GMT-0500] by kmaclean) ---

Re: Acoustic model testing
User: Visitor
Date: 2/13/2009 8:11 am
Views: 330
Rating: 11

Hi,

thanks for the replies. As nsh suggested I have tried to build a language model from the test prompts. Unfortunatelly, my results are still far away from the 90+ percent accuracies achieved by the sphinx system here:

http://www.voxforge.org/home/forums/message-boards/general-discussion/acoustic-model-0_1_2?pn=3

I would welcome any suggestions as to what I might be doing wrong. Let me first describe the system:

Feature extraction: 16KHz16bit -> 32 millisecond window with 16 ms verlap -> 13 MFCC coefficients (including 0th) + D + A altogether 39

Acoustic models: HTK trained decision tree clustered triphones (I used the questions from VoxForge tutorial), number of mixtures = 8, total number of physical models = 6089.

The test set was created by randomly taken out 1800 sentences from the VoxForge corpus which were not used during training. The acoustic models were trained by all the remaining speech data (roughly 58 hours).

The first test I did was with no language model, but the vocabulary was restricted to only the words which were present in the test prompts (3763 words total). Here are the best results (only first 200 test uttterances for lack of time):

%Corr=50,77 Acc=44,08 H=1115 D=232 S=849 I=147 N=2196

Next I used HTK to train bigrams with backoff on the test prompts (3763 unigrams and 10114 bigrams) and expected a large performance increase. Unfortunatelly this did not happen, the results were only slightly (less than 1%) better or even worse (depending on the pruning threshold).

Since I did this with my own recognizer I thought my implementation of language models in the decoder might be wrong so I used HBuild to create HTK word network from those bigrams and used HTK to recognize with the bigram LM. Here is the best result:

%Corr=58.38, Acc=50.73 [H=1282, D=131, S=783, I=168, N=2196]

The most surprising thing for me is how little did the accuracy increase despite the relativelly good language model (low perplexity and no unseen bigrams, although this was created artificially by training with test data). Does this mean that if I want word level results with more than 90% accuracy, I need to have acoustic models that give at least 80% without any language model (given that the vocabulary is restricted)?

 

--- (Edited on 2/13/2009 8:11 am [GMT-0600] by Visitor) ---

Re: Acoustic model testing
User: nsh
Date: 2/13/2009 9:33 pm
Views: 349
Rating: 10

There are too many unknows there, you need to eliminate them first. For example, voxforge-en-sphinx test set and language model are availble inside the archive. Can you test on exactly the same test with exactly the same model? Did you try to compare with wsj-htk?

Why do you use your own recognizer which can have bugs. Did you try HDecode?

Why not just upload your files except audio of course and give us a link. We'll check ourselves then.

> Feature extraction: 16KHz16bit -> 32 millisecond window with 16 ms verlap -> 13 MFCC coefficients (including 0th) + D + A altogether 39


Not sure that I understood correctly, isn't standard frame rate a bit smaller? Usually frame rate is 100 Hz, so window is 10 ms and hanning window is 0.0256

> Next I used HTK to train bigrams with backoff on the test prompts (3763 unigrams and 10114 bigrams)

Testing was done with trigrams as you can see in voxforge-en archive.

> Acoustic models: HTK trained decision tree clustered triphones (I used the questions from VoxForge tutorial), number of mixtures = 8, total number of physical models = 6089.

The number of models is quite high, something around 4000 is a more reasonable choice. While the number of mixtures can be bigger.

 

--- (Edited on 2/13/2009 9:33 pm [GMT-0600] by nsh) ---

Re: Acoustic model testing
User: tpavelka
Date: 2/14/2009 10:13 am
Views: 306
Rating: 10

The reason I downloaded VoxForge was that I wanted to try whether my recognizer can work with larger vocabularies and stochastic language models (at the present it can't as my results show ;-)). Before VoxForge I have worked with much smaller corpora which were only suitable for grammar based recognition.

Now I realize that first I need to work with existing solutions such as HTK so that I can eliminate possible bugs in my code. I will try to train another acoustic model with HTK only and will also try to syncronize my testing set with voxforge-en-sphinx test set.

As for uploading my files I can do that (with both HTK and my own system if anyone is interested) but first I would like to work on it some more.

Thanks nsh for the suggestions, it gave me a few ideas about what to do next.

--- (Edited on 2/14/2009 10:13 am [GMT-0600] by tpavelka) ---

Re: Acoustic model testing
User: tpavelka
Date: 3/10/2009 9:11 am
Views: 279
Rating: 10

Ok, I have almost finished training the HTK-only acoustic models which were done with more standard window and overlap values (25ms window and 10ms overlap). I trained deision tree clustered word-internal triphones, but did not yet add more mixtures (anyway, the official VoxForge models also have only one mixture so I can compare).

I used the same test set (only the first 100 files) as nsh did in the Sphinx test. I used a dictionary with only the words present in the test set (485 words total) and no language model. Here is the result:

WORD: %Corr=59.89, Acc=42.72 [H=666, D=32, S=414, I=191, N=1112]

To have a comparison I did the same with the official VoxForge acoustic model:

WORD: %Corr=58.54, Acc=49.91 [H=651, D=69, S=392, I=96, N=1112]

And to compare with the acoustic model I described at the start of this thread:

WORD: %Corr=55.04, Acc=50.54 [H=612, D=125, S=375, I=50, N=1112]

Given the small size of the testing data I guess I can say that there are no big differences between the three. What I can say is that the resulting accuracy seems pretty low. I do not think that if I add mixtures and some language model the resulting accuracy will be very high. So my question is, what did I do wrong? I have uploaded the tests (zipped 35MB) here:

http://liks.fav.zcu.cz/tomas/acoustic_model_test.zip

Any suggestions will be appreciated ;-)

Also I looked at the tests here

http://www.dev.voxforge.org/projects/Main/browser/Trunk/Scripts/Testing_scripts/NightlyTest/TestResults

which were done with a very small vocabulary and a grammar. Now, the result from HTK

WORD: %Corr=86.24, Acc=44.44 [H=163, D=1, S=25, I=79, N=189]

seems very low for such task, but my results are also rather low so this is not that surprising. What is surprising are the results from Julian:

WORD: %Corr=97.35, Acc=96.83 [H=184, D=2, S=3, I=1, N=189]

If I understand it correctly, this was done with the same acoustic models, so how did this huge difference happen? I mean, there are about five times as many errors in the HTK test, where did these come from?

--- (Edited on 3/10/2009 9:11 am [GMT-0500] by tpavelka) ---

Re: Acoustic model testing
User: tpavelka
Date: 3/11/2009 2:48 am
Views: 194
Rating: 10

As for the difference between the HTK and Julius results I might have an idea: here

http://www.dev.voxforge.org/projects/Main/browser/Trunk/Scripts/Testing_scripts/NightlyTest.pm

At line 117 you use the the dictionary

$LexiconDirectory/VoxForge/VoxForgeDict

unfortunatelly I could not figure out from where you call the module NightlyTest.pm so I could not check this dictionary (I could not figure out the value of $LexiconDirectory), but my suspicion is that it is a monophone dictionary used in place where a triphone one should be used.

 

--- (Edited on 3/11/2009 2:48 am [GMT-0500] by tpavelka) ---

Re: Acoustic model testing
User: kmaclean
Date: 3/11/2009 11:02 am
Views: 179
Rating: 10

Hi tpavelka,

My apologies, I have not been able to look at this thread in detail for the past little while (and hope to do so shortly), but many thanks for the work that you are doing on this.

>$LexiconDirectory/VoxForge/VoxForgeDict

>unfortunatelly I could not figure out from where you call the module

>NightlyTest.pm so I could not check this dictionary

All the scripts (including NightlyTest.pm) get called with a $parm object (a relic/hack of when I was first learning how to code in Perl...) which contains the locations of all files/directories, which is created in: VoxForge_config.pm.

The line I think you are looking for is:

$$parms{"LexiconDirectory"} = "$SpeechCorpus/Lexicon";

Basically, it uses the lexicon/pronunciation dictionary file in the Speech Corpus repository: VoxForgeDict - which is a monophone dictionary

>but my suspicion is that it is a monophone dictionary used in place where a triphone one should be used.

Interesting... so I should be using a triphone dictionary for HTK recognition during testing? 

I have been assuming that triphone dictionaries are only required during training... and that the recognition engine can look these up on the fly... Julius only seems to need a monophone dictionary for recognition. 

I have not looked at any of this in a long while... if there are corrections required, please let me know,

thanks,

Ken

 

--- (Edited on 3/11/2009 12:02 pm [GMT-0400] by kmaclean) ---

Re: Acoustic model testing
User: tpavelka
Date: 3/11/2009 11:30 am
Views: 344
Rating: 10

Hi Ken,

I knew the parameters are filled somewhere, I just could not find the config file, thanks for the location ;-)

Using a monophone dictionary instead of a triphone one is a mistake which I have done many times myself. Since you are using word-internal triphones, there must be diphones and monophones within your HMM set because of the word beginnings/endings. But because of that they are either poorly trained, or maybe trained with something else than proper monophones should be.

Example:

the word "a" will lead to a phonetic unit "ax" because you are not doing cross word triphones and you cannot create triphones from a one-phoneme word. But the phonetic unit "ax" is not a proper monophone, since most of the other instances of "ax" in other words will be converted to triphones.

With a sufficiently large corpus you have the whole set of phonetic units with monophone like names and you can do recognition with them without HTK protesting (I do not think HTK does any kind of automatic triphone expansion, although there is some kind of cross-word expansion, but I do not know how exactly that works since I never worked with CW triphones). But the accuracy is rather poor, which is what I think happened int he case of your tests.

With grammar based recognition I would expect that the results from HTK and Julius are exactly the same (even up to the acoustic scores). They may differ in recognition speed though.

--- (Edited on 3/11/2009 11:30 am [GMT-0500] by tpavelka) ---

--- (Edited on 3/11/2009 11:37 am [GMT-0500] by tpavelka) ---

PreviousNext