Re: Create Phoneme Recogniser using VoxForge Models and HTK

Create Phoneme Recogniser using VoxForge Models and HTK

User: juge
Date: 11/17/2009 6:44 am

Views: 13710
Rating: 10

Is it possible to create a phoneme recognizer with HTK using the VoxForge Acoustic models?

I tried it with two different approaches:

a) using just the 44 (or so) phonemes without triphone

b) using all triphones (from the file 'tiedlist').

Here is what I did for approach a)

1.

downloaded the models: Julius_AcousticModels_16kHz-16bit_MFCC_O_D_(0_1alpha-build541).zip

2. create a model list from the file 'tiedlist' by leaving just the single phonemes and removing triphones etc.

it looks like this and is called 'phonemlist':

b
d
f
g

3.

create a dictionary from the file 'phonemlist'. It looks like this:

b b
d d
f f
g g
k k

4.

create a grammar and wordnet.

Grammar looks like this:

$PHON = b | d | f | g | k | l | m | n | p | r | s | t | v | w | y | z | aa | ae | ah | ao | aw | ax | ay | ch | dh | dx | eh | er | ey | hh | ih | ix | iy | jh | ng | ow | oy | sh | th | uh | uw | zh ;
( [sil] < $PHON [sp] > [sil] )

Create the wordnet with HParse gram wordnet

5. I took a recording from the YOHO database and made the feature extraction with the following:

The title of the file says what is spoken in the file.

------------------

HCopy -T 1 -D -A -C confighcopy 26_81_57.wav 26_81_57.mfc

HTK Configuration Parameters[12]
Module/Tool     Parameter                  Value
#                 SOURCERATE                 16000
#                 SOURCEFORMAT                WAVE
#                 NUMCEPS                       12
#                 CEPLIFTER                     22
#                 NUMCHANS                      26
#                 PREEMCOEF               0.970000
#                 USEHAMMING                  TRUE
#                 WINDOWSIZE         250000.000000
#                 SAVEWITHCRC                 TRUE
#                 SAVECOMPRESSED              TRUE
#                 TARGETRATE         100000.000000
#                 TARGETKIND              MFCC_0_D

------------

6.

try the recogniser:

----------------------

HVite -A -D -T 1 -H macros -H hmmdefs -C config -l * -i recout.mlf -w wordnet -p -40.0 -s 5.0 main.dict phonemlist 26_81_57.mfc

HTK Configuration Parameters[10]
Module/Tool     Parameter                  Value
#                 NUMCEPS                       12
#                 CEPLIFTER                     22
#                 NUMCHANS                      26
#                 PREEMCOEF               0.970000
#                 USEHAMMING                  TRUE
#                 WINDOWSIZE         250000.000000
#                 SAVEWITHCRC                 TRUE
#                 SAVECOMPRESSED              TRUE
#                 TARGETRATE         100000.000000
#                 TARGETKIND          MFCC_0_D_N_Z

Read 44 physical / 44 logical HMMs
Read lattice with 48 nodes / 217 arcs
WARNING [-8232] ExpandWordNet: Pronunciation 1 of sp is 'tee' word in HVite
Created network with 95 nodes / 264 links
File: 26_81_57.mfc
sil ae sp ow aw sp m ae sp hh aw uh ow ah ax uw sp ah sp eh sp m ow sil == [5571 frames] -61.2614 [Ac=-340287.4 LM=-1000.0] (Act=93.0)
-------------------------

Ok, so it seems like it didn't work quite well.

It should recognise "twenty six, eighty one, fifty seven", in phonemes.

And it recognized: sil ae sp ow aw sp m ae sp hh aw uh ow ah ax uw sp ah sp eh sp m ow sil

With variant b) (using triphones), it recognized "sil ae+v sil ao+r ae+d n-th ae+v sil s-ae+ng m-uw+n m-uw+n ao+r ae+d ow-n+l sil dx-ax sil ae+n sil dx-ax ax+n sil" from the same file.

Is there some fundamental thing I am doing wrong? Or some parameters I can tweak or so?

Thank you for any help!

--- (Edited on 11/17/2009 6:44 am [GMT-0600] by ) ---

Re: Create Phoneme Recogniser using VoxForge Models and HTK

User: kmaclean
Date: 11/17/2009 9:10 am

Views: 100
Rating: 9

Hi juge,

>5. I took a recording from the YOHO database and made the feature

>extraction with the following:

I think the YOHO database was recorded using an 8kHz sampling rate, and you are trying to use an acoustic model created with 16kHz audio.

They need to match...

Ken

--- (Edited on 11/17/2009 10:10 am [GMT-0500] by kmaclean) ---

Re: Create Phoneme Recogniser using VoxForge Models and HTK

User: juge
Date: 11/17/2009 10:03 am

Views: 113
Rating: 10

Thank you for this hint, you are right. I tried using the 8kHz models and the results were different now, but not really better.

There seems to be some larger issues.

--- (Edited on 11/17/2009 10:03 am [GMT-0600] by ) ---

Re: Create Phoneme Recogniser using VoxForge Models and HTK

User: kmaclean
Date: 11/17/2009 10:46 am

Views: 165
Rating: 8

>There seems to be some larger issues.

Our acoustic model might not be good enough for phonetic recognition.

Try a current nightly build acoustic model.

If results are still not better, try using some audio that the Acoustic Model was trained with.

Ken

--- (Edited on 11/17/2009 11:46 am [GMT-0500] by kmaclean) ---

Re: Create Phoneme Recogniser using VoxForge Models and HTK

User: juge
Date: 11/18/2009 6:11 am

Views: 86
Rating: 9

I found out the following:

The config file I use for HCopy looks like:

TARGETKIND = MFCC_0_D
TARGETRATE = 100000.0
SAVECOMPRESSED = T
SAVEWITHCRC = T
WINDOWSIZE = 250000.0
USEHAMMING = T
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12

SOURCEFORMAT=WAVE
SOURCERATE = 8000

------

When I change the Sourcerate to 800 from 8000, the recognition results get better.

It now says 'sil ax r ow th hh zh ae n ah ow sil' when I spoke "professional". You can at least imagine that it is similar.

The source rate actually was 8000 Hz, not 800, but introducing a factor of 10 here seems to help. There must be a problem somewhere, but I don't why.

--- (Edited on 11/18/2009 6:11 am [GMT-0600] by ) ---

Re: Create Phoneme Recogniser using VoxForge Models and HTK

User: kmaclean
Date: 11/18/2009 6:29 pm

Views: 132
Rating: 10

Hi Juge,

>SOURCEFORMAT=WAVE
>SOURCERATE = 8000

I don't think you need these parameters... HCopy can get this information from the header of the wav file... these parameters are normally used for raw (headerless) audio.

Ken

--- (Edited on 11/18/2009 7:29 pm [GMT-0500] by kmaclean) ---

Re: Create Phoneme Recogniser using VoxForge Models and HTK

User: Visitor
Date: 11/19/2009 2:15 am

Views: 91
Rating: 10

Thank you. I noticed it myself, though. Removing these parameters helps a bit.

But the recognition results are still not good enough. Maybe it's because there is no real dictionary and language model?

Or is it possible that the models are not good enough?

Are there any other things I can do to improve the accuracy?

--- (Edited on 11/19/2009 2:15 am [GMT-0600] by Visitor) ---

Re: Create Phoneme Recogniser using VoxForge Models and HTK

User: kmaclean
Date: 11/19/2009 9:06 am

Views: 465
Rating: 9

>Maybe it's because there is no real dictionary and language model?

Maybe it goes back to your orginal question in this thread: "Is it possible to create a phoneme recognizer with HTK using the VoxForge Acoustic models?" - the answer might be no...

>Or is it possible that the models are not good enough?

Likely - did you try Keith Vertanen's acoustic models, or use Sphinx and their acoustic models?

>Are there any other things I can do to improve the accuracy?

Use regular word-based dictionary (rather than phone based...) for your speech recognition.

nsh has some Sphinx pointers to look at: Speech Recognition With CMU Sphinx

Ken

--- (Edited on 11/19/2009 10:06 am [GMT-0500] by kmaclean) ---

Re: Create Phoneme Recogniser using VoxForge Models and HTK

User: SA
Date: 6/29/2011 5:46 pm

Views: 121
Rating: 8

Did you figure out what was the problem? Have you run into any study with successful results on phoneme recognition with adapted acoustic model?

Thank

--- (Edited on 6/29/2011 5:46 pm [GMT-0500] by Visitor) ---


Username	Password

Speech Recognition Engines