VoxForge
Hi all,
I 'm a beginner with julius. By dragon NS and sphinx4 it is possible to return the timed result like this: "test(0,41 0,55) result(1,45 1,98)" the number in brackets is the time stamp of the result word. i think it should also be possible by julius but have not found the right option in the .jconf file. Can anybody help me? Thanks in advance.
Ian
--- (Edited on 6/16/2010 5:21 pm [GMT-0500] by Visitor) ---
>...the time stamp of the result word. i think it should also be possible by
>julius but have not found the right option in the .jconf file.
See Julius docs:
Forced Alignment
-walign
Do viterbi alignment per word units from the recognition result.
The word boundary frames and the average acoustic
scores per
frame are calculated.
-palign
Do viterbi alignment per phoneme (model) units from the recogni-
tion result. The phoneme boundary frames and the average acous-
tic scores per frame are calculated.
-salign
Do viterbi alignment per HMM state from the recognition result.
The state boundary frames and the average acoustic
scores per
frame are calculated.
--- (Edited on 6/17/2010 3:33 pm [GMT-0400] by kmaclean) ---
Thanks for your apply! i have tried the -walign option, but it seems that the number in [0 123] are not the time stamp of the recognized words of the audio file which was recognized. by sphinx4 e.g the timed result such as "test (1,23 4,33) result (4,55 10,33)" the number there are time in second of the words which are appeared in the audio. Can you please give me more explain?thanx!
--- (Edited on 6/18/2010 6:14 pm [GMT-0500] by Ian) ---
> but it seems that the number in [0 123] are not the time stamp of the
>recognized words of the audio file which was recognized.
Don't know... I just assumed that forced alignment in Julius outputed the time stamps as does forced alignment using HTK's HVite command.
Maybe just use HTK.
--- (Edited on 6/28/2010 2:24 pm [GMT-0400] by kmaclean) ---
-walign
Do viterbi alignment per word units from the recognition result.
The word boundary frames and the average acoustic
scores per
frame are calculated
It seems to be the boundary frame. Be aware that the following is purley speculation on my part.
First thing is the question what "frame" means in this context. I looked at following output.
### read waveform input
Stat: adin_sndfile: input speechfile: de2-19.wav
Stat: adin_sndfile: input format = Microsoft WAV
Stat: adin_sndfile: input type = Signed 16 bit PCM
Stat: adin_sndfile: endian = file native endian
Stat: adin_sndfile: 16000 Hz, 1 channels
STAT: 70998 samples (4.44 sec.)
STAT: ### speech analysis (waveform -> MFCC)
### Recognition: 1st pass (LR beam)
pass1_best: AN UND </s>
pass1_best_wordseq: AN UND </s>
pass1_best_phonemeseq: Q a n | b U n t | sil
pass1_best_score: -11465.536133
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 1341 generated, 933 pushed, 143 nodes popped in 442
ALIGN: === word alignment begin ===
ALIGN: === phoneme alignment begin ===
sentence1: AN UND FREIHEIT </s>
wseq1: AN UND FREIHEIT </s>
phseq1: Q a n | b U n t | f r aI h aI t | sil
cmscore1: 0.281 0.580 0.207 1.000
score1: -11679.384766
=== begin forced alignment ===
-- word alignment --
id: from to n_score unit
----------------------------------------
[ 0 56] -22.901072 AN [AN]
[ 57 122] -29.492439 UND [UND]
[ 123 329] -27.835896 FREIHEIT [FREIHEIT]
[ 330 441] -22.891262 </s> [</s>]
re-computed AM score: -11578.753906
=== end forced alignment ===
It can't be a audio sample because all the word would be in the first part. It looked further up and found this part of the log.
------------------------------------------------------------
Speech Analysis Module(s)
[MFCC01] for [AM00 _default]
Acoustic analysis condition:
parameter = MFCC_0_D_N_Z (25 dim. from 12 cepstrum + c0, abs energy supressed with CMN)
sample frequency = 16000 Hz
sample period = 625 (1 = 100ns)
window size = 400 samples (25.0 ms)
frame shift = 160 samples (10.0 ms)
pre-emphasis = 0.97
# filterbank = 24
cepst. lifter = 22
raw energy = False
energy normalize = False
delta window = 2 frames (20.0 ms) around
hi freq cut = OFF
lo freq cut = OFF
zero mean frame = OFF
use power = OFF
CVN = OFF
VTLN = OFF
spectral subtraction = off
cepstral normalization = sentence CMN
base setup from = Julius default
------------------------------------------------------------
I though maybe a frame is one window they use to compute the features. Since they overlap you can compute the windows by ( 70998 - 400 ) / 160 = 441, 2375 which roughly matches the amout of "frames".
To get the times for [ 57 122] you need to remember that frame 57 is 10ms*57 more to right than frame 0.
So frame 57 goes from 0.570s(570ms) - 0.595s(575ms+25ms(windowsize))
Frame 122: 1.22s (1220ms) - 1.245s(1220+45)
So the word would be beetwee 0.57-1.245s.
Hope this helps.
Be aware that it depends on sample frequency how much ms 400 samples are.
binh
--- (Edited on 8/19/2013 5:31 am [GMT-0500] by Visitor) ---