 
    VoxForge
Hi all,
I 'm a beginner with julius. By dragon NS and sphinx4 it is possible to return the timed result like this: "test(0,41 0,55) result(1,45 1,98)" the number in brackets is the time stamp of the result word. i think it should also be possible by julius but have not found the right option in the .jconf file. Can anybody help me? Thanks in advance.
Ian
--- (Edited on 6/16/2010 5:21 pm [GMT-0500] by Visitor) ---
>...the time stamp of the result word. i think it should also be possible by
>julius but have not found the right option in the .jconf file.
See Julius docs:
Forced Alignment 
       -walign 
             
Do viterbi alignment per word units from the recognition result. 
             
The word boundary frames and the  average  acoustic 
scores  per 
              frame are calculated. 
 
       -palign 
             
Do viterbi alignment per phoneme (model) units from the recogni- 
             
tion result.  The phoneme boundary frames and the average acous- 
              tic scores per frame are calculated. 
 
       -salign 
             
Do  viterbi alignment per HMM state from the recognition result. 
             
The state boundary frames and the average  acoustic 
scores  per 
              frame are calculated. 
--- (Edited on 6/17/2010 3:33 pm [GMT-0400] by kmaclean) ---
Thanks for your apply! i have tried the -walign option, but it seems that the number in [0 123] are not the time stamp of the recognized words of the audio file which was recognized. by sphinx4 e.g the timed result such as "test (1,23 4,33) result (4,55 10,33)" the number there are time in second of the words which are appeared in the audio. Can you please give me more explain?thanx!
--- (Edited on 6/18/2010 6:14 pm [GMT-0500] by Ian) ---
> but it seems that the number in [0 123] are not the time stamp of the
>recognized words of the audio file which was recognized.
Don't know... I just assumed that forced alignment in Julius outputed the time stamps as does forced alignment using HTK's HVite command.
Maybe just use HTK.
--- (Edited on 6/28/2010 2:24 pm [GMT-0400] by kmaclean) ---
 -walign 
             
Do viterbi alignment per word units from the recognition result. 
             
The word boundary frames and the  average  acoustic 
scores  per 
              frame are calculated
It seems to be the boundary frame. Be aware that the following is purley speculation on my part.
First thing is the question what "frame" means in this context. I looked at following output.
### read waveform input
Stat: adin_sndfile: input speechfile: de2-19.wav
Stat: adin_sndfile: input format = Microsoft WAV
Stat: adin_sndfile: input type = Signed 16 bit PCM
Stat: adin_sndfile: endian = file native endian
Stat: adin_sndfile: 16000 Hz, 1 channels
STAT: 70998 samples (4.44 sec.)
STAT: ### speech analysis (waveform -> MFCC)
### Recognition: 1st pass (LR beam)
pass1_best: AN UND </s>
pass1_best_wordseq: AN UND </s>
pass1_best_phonemeseq: Q a n | b U n t | sil
pass1_best_score: -11465.536133
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 1341 generated, 933 pushed, 143 nodes popped in 442
ALIGN: === word alignment begin ===
ALIGN: === phoneme alignment begin ===
sentence1: AN UND FREIHEIT </s>
wseq1: AN UND FREIHEIT </s>
phseq1: Q a n | b U n t | f r aI h aI t | sil
cmscore1: 0.281 0.580 0.207 1.000
score1: -11679.384766
=== begin forced alignment ===
-- word alignment --
 id: from  to    n_score    unit
 ----------------------------------------
[   0   56]  -22.901072  AN    [AN]
[  57  122]  -29.492439  UND    [UND]
[ 123  329]  -27.835896  FREIHEIT    [FREIHEIT]
[ 330  441]  -22.891262  </s>    [</s>]
re-computed AM score: -11578.753906
=== end forced alignment ===
It can't be a audio sample because all the word would be in the first part. It looked further up and found this part of the log.
------------------------------------------------------------
Speech Analysis Module(s)
[MFCC01]  for [AM00 _default]
 Acoustic analysis condition:
           parameter = MFCC_0_D_N_Z (25 dim. from 12 cepstrum + c0, abs energy supressed with CMN)
    sample frequency = 16000 Hz
       sample period =  625  (1 = 100ns)
         window size =  400 samples (25.0 ms)
         frame shift =  160 samples (10.0 ms)
        pre-emphasis = 0.97
        # filterbank = 24
       cepst. lifter = 22
          raw energy = False
    energy normalize = False
        delta window = 2 frames (20.0 ms) around
         hi freq cut = OFF
         lo freq cut = OFF
     zero mean frame = OFF
           use power = OFF
                 CVN = OFF
                VTLN = OFF
    spectral subtraction = off
  cepstral normalization = sentence CMN
     base setup from = Julius default
------------------------------------------------------------
I though maybe a frame is one window they use to compute the features. Since they overlap you can compute the windows by ( 70998 - 400 ) / 160 = 441, 2375 which roughly matches the amout of "frames".
To get the times for [  57  122] you need to remember that frame 57 is 10ms*57 more to right than frame 0.
So frame 57 goes from 0.570s(570ms) - 0.595s(575ms+25ms(windowsize))
Frame 122: 1.22s (1220ms) - 1.245s(1220+45)
So the word would be beetwee 0.57-1.245s.
Hope this helps.
Be aware that it depends on sample frequency how much ms 400 samples are.
binh
--- (Edited on 8/19/2013 5:31 am [GMT-0500] by Visitor) ---