English Gigaword language model training recipe

Acoustic Model Discussions

Nested

User: JohnTitor
Date: 4/20/2016 9:55 am

Views: 9724
Rating: 0

When I try to use http://www.keithv.com/software/giga/

I'm gettting this error:

mertyildiran@Corsair:~/Downloads/lm_giga_5k_nvp_3gram$ julius -d julius.bin -h hmmdefs -v lm_giga_5k_nvp.sphinx.dic -hlist tiedlist
STAT: jconf successfully finalized
STAT: * loading AM00 _default
Stat: init_phmm: Reading in HMM definition
Stat: rdhmmdef: ascii format HMM definition
Stat: rdhmmdef: limit check passed
Stat: check_hmm_restriction: an HMM with several arcs from initial state found: "sp"
Stat: rdhmmdef: this HMM requires multipath handling at decoding
Stat: init_phmm: defined HMMs: 16032
Stat: init_phmm: loading ascii hmmlist
Stat: init_phmm: logical names: 24402 in HMMList
Stat: init_phmm: base phones: 41 used in logical
Stat: init_phmm: finished reading HMM definitions
STAT: making pseudo bi/mono-phone for IW-triphone
Stat: hmm_lookup: 799 pseudo phones are added to logical HMM list
STAT: * AM00 _default loaded
STAT: *** loading LM00 _default
Stat: init_voca: read 5884 words
ERROR: m_fusion: head sil word "<s>" not exist in voca
ERROR: m_fusion: failed to initialize dictionary
ERROR: Error in loading model

--- (Edited on 4/20/2016 9:55 am [GMT-0500] by Visitor) ---

Re: English Gigaword language model training recipe

User: colbec
Date: 4/21/2016 4:47 am

Views: 65
Rating: 1

I think the important line here is "head sil word "<s>" not exist in voca"

Check to see that you have the lines

<s> sil
</s> sil

someplace in the file pointed to by the -v parameter. If not, insert them and try again. I don't think the positioning is important, see discussion in https://github.com/julius-speech/julius/issues/15 - I have found some variability in the output of language model generators that can lead to some confusion when loaded by Julius.

--- (Edited on 2016-04-21 5:47 am [GMT-0400] by colbec) ---

Re: English Gigaword language model training recipe

User: JohnTitor
Date: 4/21/2016 12:10 pm

Views: 30
Rating: 0

Thanks it worked. But now I'm getting this:

------

### read analyzed parameter

enter MFCC filename->

I pluged in macros file:

------

### read analyzed parameter

enter MFCC filename->macros

and I'm getting this error:

input MFCC file: macros

Warning: rdparam: header says it has 2121206332 frames (more than 10 minutes)

Warning: rdparam: it may be a little endian MFCC

Warning: rdparam: now try reading with endian conversion

Error: rdparam: failed to read 39552 bytes

--- (Edited on 4/21/2016 12:10 pm [GMT-0500] by Visitor) ---

Re: English Gigaword language model training recipe

User: colbec
Date: 4/21/2016 1:55 pm

Views: 410
Rating: 1

Enter MFCC filename-> is the kind of prompt I get when I forget to state that I'm intending to use the mike as input and Julius starts making assumptions. Are you sure that macros belongs here?

--- (Edited on 2016-04-21 2:55 pm [GMT-0400] by colbec) ---

Re: English Gigaword language model training recipe

User: wildcard
Date: 5/16/2016 5:14 am

Views: 35
Rating: 0

Hi John,

I have just gone to follow the training recipe you linked to but can't seem to find a way to download the LDC's Gigaword text corpus. Is there a secret to this I don't know about?

--- (Edited on 5/16/2016 5:14 am [GMT-0500] by ) ---

Re: English Gigaword language model training recipe

User: kmaclean
Date: 5/16/2016 7:32 am

Views: 31
Rating: 0

LDC's Gigaword text corpus

--- (Edited on 5/16/2016 8:32 am [GMT-0400] by kmaclean) ---

Re: English Gigaword language model training recipe

User: wildcard
Date: 5/16/2016 7:45 am

Views: 26
Rating: 0

I tried that link earlier. I couldn't see a download link on the page so I created an account, logged in and now just get a nice big error message when trying to load that page again that says:

We're sorry, but something went wrong.

--- (Edited on 5/16/2016 7:45 am [GMT-0500] by ) ---

--- (Edited on 5/16/2016 7:46 am [GMT-0500] by ) ---

Re: English Gigaword language model training recipe

User: colbec
Date: 5/16/2016 8:10 am

Views: 120
Rating: 0

The LDC controls copyright on the corpus. You cannot get to it unless you buy a membership and DVD or become a student and get permission that way. In any case you are probably taking too big a bite at the problem; you don't need the entire corpus if you are content to try the already existing LMs (see the downloads on the link you already have) that have been generated by others.

To find out how to use a LM, create a simple LM from a Librivox book text with one of the freely available LM generators, or even better convert a grammar that you know works to a LM by scripting a "corpus" based on the grammar, run an LM generator over it, and then set Julius to work on the resulting LM with the audio model you have from the grammar. Once you have some experience, go back to the LMs derived from the LDC corpus if necessary.

But first, find out what a LM looks like, generate a few and toss them away just so you know what they consist of.

--- (Edited on 2016-05-16 9:10 am [GMT-0400] by colbec) ---

Re: English Gigaword language model training recipe

User: wildcard
Date: 5/17/2016 5:15 am

Views: 4259
Rating: 0

Thanks (once again!) for the advice. You are really helping me get an understanding of something I knew nothing about!

I downloaded a 2gram and 3gram bundle from the link above. I plugged the 2 arpa files into mkbingram using the following command:

mkbingram -nlr lm_giga_64k_vp_2gram.arpa -nlr lm_giga_64k_vp_3gram.arpa outfile.bingram

I then setup my test.jconf file to be:

## Language model file(s)

-d lm/outfile.bingram

## Word dictionary file

-v tutorial/sample.dict

## Acoustic HMM file

-h tutorial/hmm15/hmmdefs

-hlist tutorial/tiedlist

I run julius with this test.jconf but get:

STAT: include config: test.jconf

STAT: jconf successfully finalized

STAT: *** loading AM00 _default

Stat: init_phmm: Reading in HMM definition

Stat: rdhmmdef: ascii format HMM definition

Stat: rdhmmdef: limit check passed

Stat: check_hmm_restriction: an HMM with several arcs from initial state found: "sp"

Stat: rdhmmdef: this HMM requires multipath handling at decoding

Stat: rdhmmdef: no <SID> embedded

Stat: rdhmmdef: assign SID by the order of appearance

Stat: init_phmm: defined HMMs: 811

Stat: init_phmm: loading ascii hmmlist

Stat: init_phmm: logical names: 24402 in HMMList

Stat: init_phmm: base phones: 41 used in logical

Stat: init_phmm: finished reading HMM definitions

STAT: making pseudo bi/mono-phone for IW-triphone

Stat: hmm_lookup: 799 pseudo phones are added to logical HMM list

STAT: *** AM00 _default loaded

STAT: *** loading LM00 _default

Stat: init_voca: read 18 words

ERROR: m_fusion: head sil word "<s>" not exist in voca

ERROR: m_fusion: failed to initialize dictionary

ERROR: Error in loading model

As per your advice above I checked the dict file and it has the <s> and </s> in there.

What am I missing?

--- (Edited on 5/17/2016 5:15 am [GMT-0500] by ) ---

--- (Edited on 5/17/2016 5:20 am [GMT-0500] by ) ---

Previous • Next •


Username	Password