VoxForge
Is it possible to create a phoneme recognizer with HTK using the VoxForge Acoustic models?
I tried it with two different approaches:
a) using just the 44 (or so) phonemes without triphone
b) using all triphones (from the file 'tiedlist').
Here is what I did for approach a)
1.
downloaded the models: Julius_AcousticModels_16kHz-16bit_MFCC_O_D_(0_1alpha-build541).zip
2. create a model list from the file 'tiedlist' by leaving just the single phonemes and removing triphones etc.
it looks like this and is called 'phonemlist':
b
d
f
g
3.
create a dictionary from the file 'phonemlist'. It looks like this:
b b
d d
f f
g g
k k
4.
create a grammar and wordnet.
Grammar looks like this:
$PHON = b | d | f | g | k | l | m | n | p | r | s | t | v | w | y | z | aa | ae | ah | ao | aw | ax | ay | ch | dh | dx | eh | er | ey | hh | ih | ix | iy | jh | ng | ow | oy | sh | th | uh | uw | zh ;
( [sil] < $PHON [sp] > [sil] )
Create the wordnet with HParse gram wordnet
5. I took a recording from the YOHO database and made the feature extraction with the following:
The title of the file says what is spoken in the file.
------------------
HCopy -T 1 -D -A -C confighcopy 26_81_57.wav 26_81_57.mfc
HTK Configuration Parameters[12]
Module/Tool Parameter Value
# SOURCERATE 16000
# SOURCEFORMAT WAVE
# NUMCEPS 12
# CEPLIFTER 22
# NUMCHANS 26
# PREEMCOEF 0.970000
# USEHAMMING TRUE
# WINDOWSIZE 250000.000000
# SAVEWITHCRC TRUE
# SAVECOMPRESSED TRUE
# TARGETRATE 100000.000000
# TARGETKIND MFCC_0_D
------------
6.
try the recogniser:
----------------------
HVite -A -D -T 1 -H macros -H hmmdefs -C config -l * -i recout.mlf -w wordnet -p -40.0 -s 5.0 main.dict phonemlist 26_81_57.mfc
HTK Configuration Parameters[10]
Module/Tool Parameter Value
# NUMCEPS 12
# CEPLIFTER 22
# NUMCHANS 26
# PREEMCOEF 0.970000
# USEHAMMING TRUE
# WINDOWSIZE 250000.000000
# SAVEWITHCRC TRUE
# SAVECOMPRESSED TRUE
# TARGETRATE 100000.000000
# TARGETKIND MFCC_0_D_N_Z
Read 44 physical / 44 logical HMMs
Read lattice with 48 nodes / 217 arcs
WARNING [-8232] ExpandWordNet: Pronunciation 1 of sp is 'tee' word in HVite
Created network with 95 nodes / 264 links
File: 26_81_57.mfc
sil ae sp ow aw sp m ae sp hh aw uh ow ah ax uw sp ah sp eh sp m ow sil == [5571 frames] -61.2614 [Ac=-340287.4 LM=-1000.0] (Act=93.0)
-------------------------
Ok, so it seems like it didn't work quite well.
It should recognise "twenty six, eighty one, fifty seven", in phonemes.
And it recognized: sil ae sp ow aw sp m ae sp hh aw uh ow ah ax uw sp ah sp eh sp m ow sil
With variant b) (using triphones), it recognized "sil ae+v sil ao+r ae+d n-th ae+v sil s-ae+ng m-uw+n m-uw+n ao+r ae+d ow-n+l sil dx-ax sil ae+n sil dx-ax ax+n sil" from the same file.
Is there some fundamental thing I am doing wrong? Or some parameters I can tweak or so?
Thank you for any help!
--- (Edited on 11/17/2009 6:44 am [GMT-0600] by ) ---
Hi juge,
>5. I took a recording from the YOHO database and made the feature
>extraction with the following:
I think the YOHO database was recorded using an 8kHz sampling rate, and you are trying to use an acoustic model created with 16kHz audio.
They need to match...
Ken
--- (Edited on 11/17/2009 10:10 am [GMT-0500] by kmaclean) ---
Thank you for this hint, you are right. I tried using the 8kHz models and the results were different now, but not really better.
There seems to be some larger issues.
--- (Edited on 11/17/2009 10:03 am [GMT-0600] by ) ---
>There seems to be some larger issues.
Our acoustic model might not be good enough for phonetic recognition.
Try a current nightly build acoustic model.
If results are still not better, try using some audio that the Acoustic Model was trained with.
Ken
--- (Edited on 11/17/2009 11:46 am [GMT-0500] by kmaclean) ---
I found out the following:
The config file I use for HCopy looks like:
TARGETKIND = MFCC_0_D
TARGETRATE = 100000.0
SAVECOMPRESSED = T
SAVEWITHCRC = T
WINDOWSIZE = 250000.0
USEHAMMING = T
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12
SOURCEFORMAT=WAVE
SOURCERATE = 8000
------
When I change the Sourcerate to 800 from 8000, the recognition results get better.
It now says 'sil ax r ow th hh zh ae n ah ow sil' when I spoke "professional". You can at least imagine that it is similar.
The source rate actually was 8000 Hz, not 800, but introducing a factor of 10 here seems to help. There must be a problem somewhere, but I don't why.
--- (Edited on 11/18/2009 6:11 am [GMT-0600] by ) ---
Hi Juge,
>SOURCEFORMAT=WAVE
>SOURCERATE = 8000
I don't think you need these parameters... HCopy can get this information from the header of the wav file... these parameters are normally used for raw (headerless) audio.
Ken
--- (Edited on 11/18/2009 7:29 pm [GMT-0500] by kmaclean) ---
Thank you. I noticed it myself, though. Removing these parameters helps a bit.
But the recognition results are still not good enough. Maybe it's because there is no real dictionary and language model?
Or is it possible that the models are not good enough?
Are there any other things I can do to improve the accuracy?
--- (Edited on 11/19/2009 2:15 am [GMT-0600] by Visitor) ---
>Maybe it's because there is no real dictionary and language model?
Maybe it goes back to your orginal question in this thread: "Is it possible to create a phoneme recognizer with HTK using the VoxForge Acoustic models?" - the answer might be no...
>Or is it possible that the models are not good enough?
Likely - did you try Keith Vertanen's acoustic models, or use Sphinx and their acoustic models?
>Are there any other things I can do to improve the accuracy?
Use regular word-based dictionary (rather than phone based...) for your speech recognition.
nsh has some Sphinx pointers to look at: Speech Recognition With CMU Sphinx
Ken
--- (Edited on 11/19/2009 10:06 am [GMT-0500] by kmaclean) ---
Did you figure out what was the problem? Have you run into any study with successful results on phoneme recognition with adapted acoustic model?
Thank
--- (Edited on 6/29/2011 5:46 pm [GMT-0500] by Visitor) ---
It's almost 2 years I did that, I almost didn't even remember it anymore.
I think I abandoned it back then and did not pursue it anymore.
--- (Edited on 6/30/2011 12:55 am [GMT-0500] by Visitor) ---