VoxForge
i Ken,
>Julius can be able to recognise speech perfectly when input is noise free(using mic). But now I am concerned about '.wav' file which usually be bit noisy(Input is from telephone line). This is making the decrease in performance which is obvious.
I tried
1)Remove noise in the input '.wav'.
2)Using the noisy samples while constructing the acoustic model.
Both are yeilding results which are not satisfactory.
Let me know the professional idea to deal with this.
>One more issue is, what is the necessity of placing grammar and vocabulary files in 'auto' directory and executing 'mkdfa.pl' script while preparing acoustic model? ( As it has been explained in 'How-To' and 'Tutorial' )
>How to catch the 'score' of recognition to avoid bad results? some times the bad result's score is nearer than score of matching words.
Thanks in Advance.
--- (Edited on 4/29/2009 1:21 am [GMT-0500] by vishu) ---
Hi vishu,
>But now I am concerned about '.wav' file which usually be bit noisy(Input is
>from telephone line). This is making the decrease in performance which is
>obvious.
Are you using an acoustic model that was trained with telephony speech (i.e. 8kHz-8bit rather than 16kHz-16bit audio)?
>what is the necessity of placing grammar and vocabulary files in
>'auto' directory and executing 'mkdfa.pl' script while preparing
>acoustic model?
As described in the Julius book for rev.3.2::
Language Model
For the task grammar, sentence structures are
written in a BNF style using word categories as
terminating symbols to a grammar file. A voca
file contains the pronunciation (phoneme sequence)
for all words within each category are created.
These files are converted with mkdfa.pl(1) to a
deterministic finite automaton file (.dfa) and a
dictionary file (.dict)
>How to catch the 'score' of recognition to avoid bad results? some
>times the bad result's score is nearer than score of matching words.
Include non-target-grammar words in your grammar. See this post for more information:One word grammar, always recognized?
Ken
--- (Edited on 5/3/2009 8:11 pm [GMT-0400] by kmaclean) ---
Hi Ken, Thanks for the reply.
>Are you using an acoustic model that was trained with telephony speech (i.e. 8kHz-8bit rather than 16kHz-16bit audio)?
Yes. Exactly.
>Language Model:
I am not convinced. If I don't put '.grammar' and other related files then also I get the acoustic model.I want to know the difference.
>Regarding 'score value' observe below:
### read waveform input
Stat: adin_file: input speechfile: seven.wav
STAT: 12447 samples (1.56 sec.)
STAT: ### speech analysis (waveform -> MFCC)
### Recognition: 1st pass (LR beam)
............................................................................pass1_best: <s> 5
pass1_best_wordseq: 0 2
pass1_best_phonemeseq: sil | f ay v
pass1_best_score: -1867.966309
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 120 generated, 120 pushed, 14 nodes popped in 76
sentence1: <s> 5 </s>
wseq1: 0 2 1
phseq1: sil | f ay v | sil
cmscore1: 1.000 0.316 1.000
score1: -1944.799561
------
### read waveform input
Stat: adin_file: input speechfile: six.wav
STAT: 12448 samples (1.56 sec.)
STAT: ### speech analysis (waveform -> MFCC)
### Recognition: 1st pass (LR beam)
............................................................................pass1_best: <s> 8
pass1_best_wordseq: 0 2
pass1_best_phonemeseq: sil | ey t
pass1_best_score: -1802.496094
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 45 generated, 45 pushed, 6 nodes popped in 76
sentence1: <s> 8 </s>
wseq1: 0 2 1
phseq1: sil | ey t | sil
cmscore1: 1.000 0.686 1.000
score1: -1883.121582
------
### read waveform input
Stat: adin_file: input speechfile: three.wav
STAT: 11579 samples (1.45 sec.)
STAT: ### speech analysis (waveform -> MFCC)
### Recognition: 1st pass (LR beam)
......................................................................pass1_best: <s> 3
pass1_best_wordseq: 0 2
pass1_best_phonemeseq: sil | th r iy
pass1_best_score: -1674.601318
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 90 generated, 90 pushed, 12 nodes popped in 70
sentence1: <s> 3 </s>
wseq1: 0 2 1
phseq1: sil | th r iy | sil
cmscore1: 1.000 0.504 1.000
score1: -1753.032471
------
### read waveform input
Stat: adin_file: input speechfile: two.wav
STAT: 8973 samples (1.12 sec.)
STAT: ### speech analysis (waveform -> MFCC)
### Recognition: 1st pass (LR beam)
......................................................pass1_best: <s> 2
pass1_best_wordseq: 0 2
pass1_best_phonemeseq: sil | t uw
pass1_best_score: -1381.510620
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 90 generated, 90 pushed, 11 nodes popped in 54
sentence1: <s> 2 </s>
wseq1: 0 2 1
phseq1: sil | t uw | sil
cmscore1: 1.000 0.574 1.000
score1: -1455.194214
**********************************************************
For seven(7) its prediction is 5 with score -1944.799561,
For six(6) its prediction is 8 with score -1883.121582,
For three(3) its prediction is 3 with score -1753.032471,
For two(2) its prediction is 2 with score -1455.194214.
I hope now you got my question.
--- (Edited on 5/4/2009 8:18 am [GMT-0500] by vishu) ---
Hi vishu,
> I am not convinced. If I don't put '.grammar' and other related files
>then also I get the acoustic model.I want to know the difference.
I am not sure what you are asking here... From Step 1 of the Tutorial two source files (.voca & .grammar) are "compiled" using the mkdfa script to create three files (.dfa .term and .dict) so that Julius can use them... I don't know the detail of how these are set up - you might need to look at the source for the mkdfa script for this type of info...
>>Regarding 'score value' observe below: [...]
You likely need more audio for your acoustic model. Since you have less audio information in a telephony stream (8kHz-8bit), you probably need double the audio in your acoustic model to get the same recognition rates as you would get with a wideband (16kHz-16bit) acoustic model.
You might also try using phrases for you grammar rather than isolated words. More context *usually* makes it easier for Julius to recognize utterances.
Ken
--- (Edited on 5/4/2009 12:09 pm [GMT-0400] by kmaclean) ---
Hi vishu,
I also do not understand your question about the grammar but I think I know what you want from the scores. Julius outputs two types of scores:
The Viterbi score, e.g.:
score1: -1944.799561
This is the cummulative score of the most likeli HMM path. The Viterbi algorithm (decoder) is just a graph search which compares scores of all possible paths through the HMM and outputs the best one. The problem is, that a score of a path (sentence) depends on the sound files length but also on the sound file itself (see this thread for more discussion). This means that Viterbi scores for different files are not comparable. I understand that you want some kind of measure, which can tell you something about whether the result found by Julius is believable or not. In that case, have a look at
The confidence score, in your example:
cmscore1: 1.000 0.316 1.000
Julius outputs a separate score for each word, so in your example the starting silence has confidence score of 1.0 (i.e. 100%), the word "five" has the score 0.316 (i.e. not that reliable) and the ending silence has again 1.0.
Unfortunatelly the confidence score computation is tricky, I do not know the details of the Julius implementation so I cannot tell you how reliable it is, you will have to test it for yourself.
--- (Edited on 05.05.2009 13:45 [GMT+0200] by tpavelka) ---