julius config for short utterances like single letters?

Speech Recognition Engines

Flat

User: kburden000
Date: 7/23/2009 2:38 pm

Views: 5757
Rating: 6

What jconf parameters would be best for recognizing short utterances like single letters?

I am trying to do a voice keyboard with speaker dependence.

Recogniing 'delete' 'return' etcetera is good but single character key recognition like 'A' 'B' is not good.

I assume that this is config issue.

I have tried tuning but with no success.

Anyone have suggestions?

Thankx

--- (Edited on 7/23/2009 2:39 pm [GMT-0500] by kburden000) ---

Use Alpha, Bravo, Charlie, Delta for better recognition

User: ralfherzog
Date: 7/24/2009 1:43 pm

Views: 242
Rating: 8

Hello! It is better to choose words that are not too short to get a better recognition result. Here is my proposition: take a look at the NATO phonetic alphabet.

A = Alpha

B = Bravo

C = Charlie

D = Delta

--- (Edited on 2009-07-24 1:43 pm [GMT-0500] by ralfherzog) ---

Re: Use Alpha, Bravo, Charlie, Delta for better recognition

User: kburden000
Date: 7/27/2009 9:42 am

Views: 411
Rating: 5

Thank you Yes what you suggest would work. It is what I am doing now will trying to get better recognition. The issue I have is that when generalizing the application most people will not know those substitutions.

I was hoping to find a way to optimize for the shorter utterances. When looking at sonograms I can see the differences between each utterance. What configuration changes would make the system more granular? Perhaps the number cepstral coefficients?

Or is what I am trying to do just not possible?

Thanks again.

--- (Edited on 7/27/2009 9:42 am [GMT-0500] by kburden000) ---

Re: Use Alpha, Bravo, Charlie, Delta for better recognition

User: kmaclean
Date: 7/28/2009 4:50 pm

Views: 49
Rating: 5

HI kburden000,

>What configuration changes would make the system more granular?

>Perhaps the number cepstral coefficients?

You might try using more speech in your training set...

Ken

--- (Edited on 7/28/2009 5:50 pm [GMT-0400] by kmaclean) ---

Re: Use Alpha, Bravo, Charlie, Delta for better recognition

User: kburden000
Date: 7/28/2009 10:37 pm

Views: 57
Rating: 6

I have 10 recordings per alphabet letter sampled as 48 kHz, made in above average acoustic setting.

Could I have "over trained" ? Is there such a thing?

Would adding more speakers help smooth the hmmdefs?

Here's my voxforge/auto/scripts/config...

TARGETKIND = MFCC_0_D_N_Z

TARGETRATE = 100000.0

SAVECOMPRESSED = T

SAVEWITHCRC = T

WINDOWSIZE = 250000.0

USEHAMMING = T

PREEMCOEF = 0.97

NUMCHANS = 26

CEPLIFTER = 22

NUMCEPS = 12

Upon reviewing the HTK book it appears the system training may be defaulting to 16k sampling, even tough I am running julius with the following in voxfoge/auto/julian.jconf...

 -smpFreq 48000

Is it neccessary to include in voxforge/auto/scripts/config

SOURCERATE = 20.8

Would changing the filter banks help?

Or perhaps adding higher order coefficients to the TARGETKIND?

Thanks

--- (Edited on 7/28/2009 10:37 pm [GMT-0500] by kburden000) ---

Re: Use Alpha, Bravo, Charlie, Delta for better recognition

User: kmaclean
Date: 7/29/2009 7:56 am

Views: 2319
Rating: 6

Hi kburden000,

> I have 10 recordings per alphabet letter

Try more recordings (25-50?)... maybe include words covering all the letter sounds in the alphabet.

It is more difficult for a speech recognition engine to recognize individual letter sounds without context...

However, Julius works reasonably well for number selection (using the acoustic model included in the VoxForge Quickstart), so you should be able to get good results for letter selection with a well trained acoustic model.

You might be better off 'adapting' the VoxForge acoustic model with your recordings.

>Could I have "over trained" ? Is there such a thing?

This only applies when you are trying to create a speaker independent acoustic models where a large portion of the training data is from one person.

>Upon reviewing the HTK book it appears the system training may be

>defaulting to 16k sampling,

HCopy converts your training audio to feature sets, independent of the audio sampling rate... this is indicated by these entries in your config:

TARGETKIND = MFCC_0_D_N_Z

TARGETRATE = 100000.0

>Is it neccessary to include in voxforge/auto/scripts/config

>SOURCERATE = 20.8

Not exactly sure how you are calculating this, but no.

>Would changing the filter banks help?

>Or perhaps adding higher order coefficients to the TARGETKIND?

don't know... much of what I have learned has been through trial and error... Google for some papers on this or review the HTK email archives.

I would try with more training data before experimenting with these setting though,

Ken

--- (Edited on 7/29/2009 8:56 am [GMT-0400] by kmaclean) ---

Previous • Next •


Username	Password