VoxForge
HI, I use julian to try my own acoustic model create following the VoxForge Tutorial Step, but always I have a very high WER, why?
If I try the Linux QuickStart I have a smaller WER.
For my test I use over than 100 audio file, downloaded from VoxForge, to train the model, and the same sample.grammar and sample.voca of Quickstart.
Someone has good result with his own acoustic model?
Tks
Manuel
--- (Edited on 9/11/2007 6:25 am [GMT-0500] by Visitor) ---
First time I've tried for italian model, but I'm not able to create relative question for tree.hed, so now I'm trying with english acoustic model, create by VoxForge tutorial and audio record.
Manuel
--- (Edited on 9/11/2007 10:32 am [GMT-0500] by Visitor) ---
Hi, I'm trying to modify some parameter of Julian, and I notice that recognition is much correct than first.
Someone can say me more information about the parameter
-fsize 2400 # window size (samples)
-fshift 960 # frame shift (samples)
What are window size and frame shift?
If I increase them, WER decrease.
Tks
Manuel
--- (Edited on 9/12/2007 12:10 pm [GMT-0500] by Visitor) ---
Hi Manuel,
The Julian manual is a little sparse on the topics of window size and frame shift. A quick review of the HTK manual and Googling these search terms resulted in the following:
From section 5.2 (Speech Signal Processing) in the HTK book:
The segment of waveform used to determine each parameter vector is usually referred to as a window and its size is set by the configuration parameter WINDOWSIZE [in 100ms units]. Notice that the window size and frame rate are independent. Normally, the window size will be larger than the frame rate so that successive windows overlap [...].
For example, a waveform sampled at 16kHz would be converted into 100 parameter vectors per second using a 25 msec window by setting the following configuration parameters:
SOURCERATE = 625
TARGETRATE = 100000
WINDOWSIZE = 250000
From SPEECH and LANGUAGE PROCESSING: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, By Daniel Jurafsky and James H. Martin, second edition (draft chapters; an excellent book BTW, I have the first edition):
9.3.2 Windowing
Recall that the goal of feature extraction is to provide spectral features that can help us build phone or subphone classi?ers. We therefore don’t want to extract our spectral features from an entire utterance or conversation, because the spectrum changes very quickly. Technically, we say that speech is a non-stationary signal, meaning that its NON-STATIONARY statistical properties are not constant across time. Instead, we want to extract spectral features from a small window of speech that characterizes a particular subphone and for which we can make the (rough) assumption that the signal is stationary (i.e. its STATIONARY statistical properties are constant within this region)
We’ll do this by using a window which is non-zero inside some region and zero elsewhere, running this window across the speech signal, and extracting the waveform inside this window.
We can characterize such a windowing process by three parameters: how wide is the window (in milliseconds), what is the offset between successive windows, and what is the shape of the window. We call the speech extracted from each window a frame, and we call the number of milliseconds in the frame the frame size and the number of milliseconds between the left edges of successive windows the frame shift.
One question with respect to the WER problems you have been experiencing, does the speech you used to create your acoustic models have all the same sampling rate, and are you using this same sampling rate in your Julian jconfig file?
Ken
--- (Edited on 9/12/2007 2:53 pm [GMT-0400] by kmaclean) ---