VoxForge
Hi all,
I have a problem using HKT and Julius, as I don't have the same recognition score using one or the other.
I
use HTK to train a word model and then use Julius to decode that model.
As I use a very specific microphone, and the language used is French, I
wasn't able to use prerecorded corpus. I choosed to train a word based
model because my corpus isn't very big (total of 45 occurrences for
each word, from 15 speakers) and I want to use my system for real-time
command speaking.
I use Hinit to init model under HTK and
then HRest to train model. I get a mean of 100% of good recognition
under HTK, using cross-validation on my corpus.
Here is one of the result from HResult on one iteration of cross-validation :
HResults -A -D -T 1 -p -u 0.01 -e ??? sil -I iter7/test/testref.mlf listemot_sil.txt iter7/test/recog.mlf
No HTK Configuration Parameters Set
====================== HTK Results Analysis =======================
Date: Mon Sep 15 17:28:53 2008
Ref : iter7/test/testref.mlf
Rec : iter7/test/recog.mlf
------------------------ Overall Results --------------------------
SENT: %Correct=100.00 [H=25, S=0, N=25]
WORD: %Corr=100.00, Acc=100.00 [H=25, D=0, S=0, I=0, N=25]
------------------------ Confusion Matrix -------------------------
c g d z p
a a r e i
m u o r l
e c i o o
r h t t Del [ %c / %e]
came 5 0 0 0 0 0
gauc 0 5 0 0 0 0
droi 0 0 5 0 0 0
zero 0 0 0 5 0 0
pilo 0 0 0 0 5 0
Ins 0 0 0 0 0
===================================================================
No HTK Configuration Parameters Set
But when I use julius for real-time decoding, I don't have a similar score, but something more around 40%.
To facilitate testing, I've decided to do the same cross-validation with julius using soundfile as input. I get a mean around 30-40% of good recognition.
Here is the commandline :
julius -input rawfile -realtime -filelist $sndfile -h $mmf -gramlist julius/gramlist.txt -multipath -lv 2500 -rejectshort 70 -headmargin 50 -tailmargin 50 -progout -sp sil -b 0
And here is part of the result from julius for the same mmf file as above :
### read waveform input
Stat: adin_sndfile: input speechfile: ../../svn/CorpusVoicis_word/wav16k/AF_s2_camera.wav
Stat: adin_sndfile: input format = Microsoft WAV
Stat: adin_sndfile: input type = Signed 16 bit PCM
Stat: adin_sndfile: endian = file native endian
Stat: adin_sndfile: 16000 Hz, 1 channels
pass1_best:
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_GAUCHE
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAMERA
pass1_best: CAMERA
pass1_best: CAMERA
pass1_best_wordseq: 0
pass1_best_phonemeseq: camera
pass1_best_score: -5912.543457
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 5 generated, 5 pushed, 2 nodes popped in 198
sentence1: CAMERA
wseq1: 0
phseq1: camera
cmscore1: 1.000
score1: -5912.554688
grammar1: 2
------
### read waveform input
Stat: adin_sndfile: input speechfile: ../../svn/CorpusVoicis_word/wav16k/AF_s2_zero.wav
Stat: adin_sndfile: input format = Microsoft WAV
Stat: adin_sndfile: input type = Signed 16 bit PCM
Stat: adin_sndfile: endian = file native endian
Stat: adin_sndfile: 16000 Hz, 1 channels
pass1_best:
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PILOTAGE
pass1_best: CAM_PILOTAGE
pass1_best_wordseq: 0
pass1_best_phonemeseq: pilotage
pass1_best_score: -5937.828613
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 5 generated, 5 pushed, 2 nodes popped in 205
sentence1: CAM_PILOTAGE
wseq1: 0
phseq1: pilotage
cmscore1: 1.000
score1: -5937.822754
grammar1: 1
------
### read waveform input
Stat: adin_sndfile: input speechfile: ../../svn/CorpusVoicis_word/wav16k/AF_s2_pilotage.wav
Stat: adin_sndfile: input format = Microsoft WAV
Stat: adin_sndfile: input type = Signed 16 bit PCM
Stat: adin_sndfile: endian = file native endian
Stat: adin_sndfile: 16000 Hz, 1 channels
pass1_best:
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PILOTAGE
pass1_best: CAM_PILOTAGE
pass1_best: CAM_PILOTAGE
pass1_best: CAM_PILOTAGE
pass1_best: CAM_PILOTAGE
pass1_best_wordseq: 0
pass1_best_phonemeseq: pilotage
pass1_best_score: -5945.857422
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 5 generated, 5 pushed, 2 nodes popped in 212
sentence1: CAM_PILOTAGE
wseq1: 0
phseq1: pilotage
cmscore1: 1.000
score1: -5945.850098
grammar1: 1
------
### read waveform input
Stat: adin_sndfile: input speechfile: ../../svn/CorpusVoicis_word/wav16k/AF_s2_droite.wav
Stat: adin_sndfile: input format = Microsoft WAV
Stat: adin_sndfile: input type = Signed 16 bit PCM
Stat: adin_sndfile: endian = file native endian
Stat: adin_sndfile: 16000 Hz, 1 channels
pass1_best:
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_DROITE
pass1_best_wordseq: 0
pass1_best_phonemeseq: droite
pass1_best_score: -5936.231934
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 5 generated, 5 pushed, 2 nodes popped in 184
sentence1: CAM_PIL_DROITE
wseq1: 0
phseq1: droite
cmscore1: 0.999
score1: -5936.239258
grammar1: 0
------
### read waveform input
Stat: adin_sndfile: input speechfile: ../../svn/CorpusVoicis_word/wav16k/AF_s2_gauche.wav
Stat: adin_sndfile: input format = Microsoft WAV
Stat: adin_sndfile: input type = Signed 16 bit PCM
Stat: adin_sndfile: endian = file native endian
Stat: adin_sndfile: 16000 Hz, 1 channels
pass1_best:
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best_wordseq: 0
pass1_best_phonemeseq: zero
pass1_best_score: -5817.412109
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 5 generated, 5 pushed, 2 nodes popped in 189
sentence1: CAM_PIL_ZERO
wseq1: 0
phseq1: zero
cmscore1: 1.000
score1: -5834.108887
grammar1: 0
------
### read waveform input
Stat: adin_sndfile: input speechfile: ../../svn/CorpusVoicis_word/wav16k/NN_s1_gauche.wav
Stat: adin_sndfile: input format = Microsoft WAV
Stat: adin_sndfile: input type = Signed 16 bit PCM
Stat: adin_sndfile: endian = file native endian
Stat: adin_sndfile: 16000 Hz, 1 channels
pass1_best:
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_DROITE
pass1_best_wordseq: 0
pass1_best_phonemeseq: droite
pass1_best_score: -4845.695801
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 5 generated, 5 pushed, 2 nodes popped in 194
sentence1: CAM_PIL_DROITE
wseq1: 0
phseq1: droite
cmscore1: 1.000
score1: -4865.897461
grammar1: 0
------
### read waveform input
Stat: adin_sndfile: input speechfile: ../../svn/CorpusVoicis_word/wav16k/NN_s1_pilotage.wav
Stat: adin_sndfile: input format = Microsoft WAV
Stat: adin_sndfile: input type = Signed 16 bit PCM
Stat: adin_sndfile: endian = file native endian
Stat: adin_sndfile: 16000 Hz, 1 channels
pass1_best:
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_DROITE
pass1_best_wordseq: 0
pass1_best_phonemeseq: droite
pass1_best_score: -5589.427246
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 5 generated, 5 pushed, 2 nodes popped in 208
sentence1: CAM_PIL_DROITE
wseq1: 0
phseq1: droite
cmscore1: 1.000
score1: -5598.542969
grammar1: 0
------
### read waveform input
Stat: adin_sndfile: input speechfile: ../../svn/CorpusVoicis_word/wav16k/NN_s1_droite.wav
Stat: adin_sndfile: input format = Microsoft WAV
Stat: adin_sndfile: input type = Signed 16 bit PCM
Stat: adin_sndfile: endian = file native endian
Stat: adin_sndfile: 16000 Hz, 1 channels
pass1_best:
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_DROITE
pass1_best_wordseq: 0
pass1_best_phonemeseq: droite
pass1_best_score: -4763.124023
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 5 generated, 5 pushed, 2 nodes popped in 198
sentence1: CAM_PIL_DROITE
wseq1: 0
phseq1: droite
cmscore1: 1.000
score1: -4772.822266
grammar1: 0
------
### read waveform input
Stat: adin_sndfile: input speechfile: ../../svn/CorpusVoicis_word/wav16k/NN_s1_camera.wav
Stat: adin_sndfile: input format = Microsoft WAV
Stat: adin_sndfile: input type = Signed 16 bit PCM
Stat: adin_sndfile: endian = file native endian
Stat: adin_sndfile: 16000 Hz, 1 channels
pass1_best:
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_GAUCHE
pass1_best: CAMERA
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PILOTAGE
pass1_best: CAM_PILOTAGE
pass1_best_wordseq: 0
pass1_best_phonemeseq: pilotage
pass1_best_score: -5862.962891
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 5 generated, 5 pushed, 2 nodes popped in 206
sentence1: CAM_PILOTAGE
wseq1: 0
phseq1: pilotage
cmscore1: 0.658
score1: -5862.988281
grammar1: 1
------
### read waveform input
Stat: adin_sndfile: input speechfile: ../../svn/CorpusVoicis_word/wav16k/NN_s1_zero.wav
Stat: adin_sndfile: input format = Microsoft WAV
Stat: adin_sndfile: input type = Signed 16 bit PCM
Stat: adin_sndfile: endian = file native endian
Stat: adin_sndfile: 16000 Hz, 1 channels
pass1_best:
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_ZERO
pass1_best: CAM_PIL_DROITE
pass1_best: CAM_PIL_DROITE
pass1_best_wordseq: 0
pass1_best_phonemeseq: droite
pass1_best_score: -5343.823730
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 5 generated, 5 pushed, 2 nodes popped in 206
sentence1: CAM_PIL_DROITE
wseq1: 0
phseq1: droite
cmscore1: 1.000
score1: -5358.856445
grammar1: 0
I'm using <MFCC_D_A> . I don't use C0 because my corpus was recorded too low compare to the output of my soundcard.
Does anyone have an idea why I don't get the same result between HTK and Julius?
Thanks a lot for your help.
Best regards,
Bruno.
--- (Edited on 9/19/2008 5:34 am [GMT-0500] by brunal2496) ---
> I'm using <MFCC_D_A> . I don't use C0 because my corpus was recorded too low compare to the output of my soundcard.
According to Juilius manual:
Note that Julius itself can only extract MFCC_E_D_N_Z features from speech data. If you use an acoustic HMM trained by other feature type, only the HTK parameter file of the same feature type can be used.
--- (Edited on 9/19/2008 9:59 am [GMT-0500] by nsh) ---
Can you give me the page reference in the manual?
I've checked directly inside Julius code files, adn I get this commentary in wav2mfcc.c :
The supported parameter is MFCC, with any combination of all the qualifiers in HTK: _0, _E, _D, _A, _Z, _N
I'm checking with Lee Akinobu if it's also the same whe using Julius in realtime with a microphone.
Anyway, does anybody have an idea why I don't have the same score? Is there something that I should really take care about?
--- (Edited on 9/22/2008 5:27 am [GMT-0500] by brunal2496) ---
Hi brunal2496,
The changelog for Julius 3.5.2 on the Julius front page says:
o Wider MFCC types support:
- Added extraction of acceleration coefficients (_A). Now you
can recognize waveform or microphone input with AM trained with _A.
- Support all MFCC qualifiers (_0, _E, _N, _D, _A, _N, _Z) and their
combination
- Support for any vector lenth (will be guessed from AM header)
- New option: "-accwin"
- New option "-zmeanframe": frame-wise DC offset removal, like HTK
- New options to specify detailed analysis parameters (see manual):
-preemph, -fbank, -ceplif, -rawe / -norawe,
-enormal / -noenormal, -escale, -silfloor
The Julius book was written for an older release - Julius r3.2. On page 15 on the Julius book, in the Microphone Input section, the author says: "At present the only possible feature extraction method that can take place within Julius/Julian is MFCC_E_D_NZ feature extraction".
>Anyway, does anybody have an idea why I don't have the same score?
I don't know.... I always seemed to get better recognition results with Julius... each nightly build has a very rudimentary "sanity" test included with it, and Julius seems to recognize better than HTK (at least for the feature set that I use).
Maybe you need a larger training set?
Ken
--- (Edited on 9/23/2008 8:05 pm [GMT-0400] by kmaclean) ---