Speech Recognition Engines

Flat
CMUSLM *.arpa
User: chn
Date: 8/19/2008 7:21 am
Views: 17298
Rating: 9

I made a.arpa with CMUSLM2.0,and context cues with <s></s>

Why <s></s> are predicted  in my "a.arpa" file ,If there are not <s></s> in my a.text then the result will be "b.arpa",and if b.vocab without <s></s> then will be the "c.arpa" and what i should do ?

This is my file:

a.text

<s>  OPEN BROWSER </s>
<s>  NEW E-MAIL  </s>
<s>  FORWARD  </s>
<s>  BACKWARD </s>
<s>  NEXT WINDOW </s>
<s>  LAST WINDOW </s>
<s>  OPEN MUSIC PLAYER </s>

a.vocab

## Vocab generated by v2 of the CMU-Cambridge Statistcal
## Language Modeling toolkit.
##
## Includes 13 words ##
</s>
<s>
BACKWARD
BROWSER
E-MAIL
FORWARD
LAST
MUSIC
NEW
NEXT
OPEN
PLAYER
WINDOW 
 

a. arpa

#############################################################################
## Copyright (c) 1996, Carnegie Mellon University, Cambridge University,
## Ronald Rosenfeld and Philip Clarkson
## Version 3, Copyright (c) 2006, Carnegie Mellon University
## Contributors includes Wen Xu, Ananlada Chotimongkol,
## David Huggins-Daines, Arthur Chan and Alan Black
#############################################################################
=============================================================================
===============  This file was produced by the CMU-Cambridge  ===============
===============     Statistical Language Modeling Toolkit     ===============
=============================================================================
This is a 3-gram language model, based on a vocabulary of 13 words,
  which begins "</s>", "<s>", "BACKWARD"...
This is an OPEN-vocabulary model (type 1)
  (OOVs were mapped to UNK, which is treated as any other vocabulary word)
Witten Bell discounting was applied.
This file is in the ARPA-standard format introduced by Doug Paul.

p(wd3|wd1,wd2)= if(trigram exists)           p_3(wd1,wd2,wd3)
                else if(bigram w1,w2 exists) bo_wt_2(w1,w2)*p(wd3|wd2)
                else                         p(wd3|w2)

p(wd2|wd1)= if(bigram exists) p_2(wd1,wd2)
            else              bo_wt_1(wd1)*p_1(wd2)

All probs and back-off weights (bo_wt) are given in log10 form.

Data formats:

Beginning of data mark: \data\
ngram 1=nr            # number of 1-grams
ngram 2=nr            # number of 2-grams
ngram 3=nr            # number of 3-grams

\1-grams:
p_1     wd_1 bo_wt_1
\2-grams:
p_2     wd_1 wd_2 bo_wt_2
\3-grams:
p_3     wd_1 wd_2 wd_3

end of data mark: \end\

\data\
ngram 1=14
ngram 2=18
ngram 3=24

\1-grams:
-1.1461 <UNK>    0.0000
-98.8037 </s>    -0.8451
-98.8037 <s>    -0.0348
-1.1461 BACKWARD    -0.3010
-1.1461 BROWSER    -0.3010
-1.1461 E-MAIL    -0.3010
-1.1461 FORWARD    -0.3010
-1.1461 LAST    -0.2341
-1.1461 MUSIC    -0.2688
-1.1461 NEW    -0.2688
-1.1461 NEXT    -0.2341
-0.8451 OPEN    -0.2341
-1.1461 PLAYER    0.0000
-0.8451 WINDOW    -0.4771

\2-grams:
-0.0669 </s> <s> 0.0348
-1.1139 <s> BACKWARD 0.0000
-1.1139 <s> FORWARD 0.0000
-1.1139 <s> LAST 0.0000
-1.1139 <s> NEW 0.0000
-1.1139 <s> NEXT 0.0000
-0.8129 <s> OPEN 0.0000
-0.3010 BACKWARD </s> 0.5441
-0.3010 BROWSER </s> 0.5441
-0.3010 E-MAIL </s> 0.5441
-0.3010 FORWARD </s> 0.5441
-0.3010 LAST WINDOW 0.1761
-0.3010 MUSIC PLAYER -0.3010
-0.3010 NEW E-MAIL 0.0000
-0.3010 NEXT WINDOW 0.1761
-0.6021 OPEN BROWSER 0.0000
-0.6021 OPEN MUSIC 0.0000
-0.1761 WINDOW </s> 0.3680

\3-grams:
-1.0792 </s> <s> BACKWARD
-1.0792 </s> <s> FORWARD
-1.0792 </s> <s> LAST
-1.0792 </s> <s> NEW
-1.0792 </s> <s> NEXT
-1.0792 </s> <s> OPEN
-0.3010 <s> BACKWARD </s>
-0.3010 <s> FORWARD </s>
-0.3010 <s> LAST WINDOW
-0.3010 <s> NEW E-MAIL
-0.3010 <s> NEXT WINDOW
-0.6021 <s> OPEN BROWSER
-0.6021 <s> OPEN MUSIC
-0.3010 BACKWARD </s> <s>
-0.3010 BROWSER </s> <s>
-0.3010 E-MAIL </s> <s>
-0.3010 FORWARD </s> <s>
-0.3010 LAST WINDOW </s>
-0.3010 MUSIC PLAYER </s>
-0.3010 NEW E-MAIL </s>
-0.3010 NEXT WINDOW </s>
-0.3010 OPEN BROWSER </s>
-0.3010 OPEN MUSIC PLAYER
-0.1761 WINDOW </s> <s>

\end\

b.arpa

....

 data\
ngram 1=14
ngram 2=11
ngram 3=11

\1-grams:
-1.1461 <UNK>    0.0000
-98.8451 </s>    0.0000
-98.8451 <s>    0.0000
-1.1461 BACKWARD    -0.2688
-1.1461 BROWSER    -0.2688
-1.1461 E-MAIL    -0.2688
-1.1461 FORWARD    -0.2688
-1.1461 LAST    -0.2341
-1.1461 MUSIC    0.0000
-1.1461 NEW    -0.2688
-1.1461 NEXT    -0.2341
-0.8451 OPEN    -0.2341
-1.1461 PLAYER    0.0000
-0.8451 WINDOW    -0.1963

\2-grams:
-0.3010 BACKWARD NEXT 0.0000
-0.3010 BROWSER NEW 0.0000
-0.3010 E-MAIL FORWARD 0.0000
-0.3010 FORWARD BACKWARD 0.0000
-0.3010 LAST WINDOW -0.1761
-0.3010 NEW E-MAIL 0.0000
-0.3010 NEXT WINDOW -0.1761
-0.6021 OPEN BROWSER 0.0000
-0.6021 OPEN MUSIC -0.2688
-0.6021 WINDOW LAST 0.0000
-0.6021 WINDOW OPEN -0.1761

\3-grams:
-0.3010 BACKWARD NEXT WINDOW
-0.3010 BROWSER NEW E-MAIL
-0.3010 E-MAIL FORWARD BACKWARD
-0.3010 FORWARD BACKWARD NEXT
-0.3010 LAST WINDOW OPEN
-0.3010 NEW E-MAIL FORWARD
-0.3010 NEXT WINDOW LAST
-0.3010 OPEN BROWSER NEW
-0.3010 OPEN MUSIC PLAYER
-0.3010 WINDOW LAST WINDOW
-0.3010 WINDOW OPEN MUSIC

\end\

c.arpa

 ...

\1-grams:
-0.2840 <UNK>    0.2382
-1.4491 BACKWARD    0.3188
-1.4491 BROWSER    0.3188
-1.4491 E-MAIL    0.3188
-1.4491 FORWARD    0.3188
-1.4491 LAST    0.0362
-1.4491 MUSIC    0.0157
-1.4491 NEW    0.0157
-1.4491 NEXT    0.0362
-1.0969 OPEN    0.0320
-1.4491 PLAYER    0.0000
-1.0969 WINDOW    -0.2833

\2-grams:
-0.3358 <UNK> <UNK> 0.0726
-99.9990 <UNK> BACKWARD 0.0000
-99.9990 <UNK> FORWARD 0.0000
-99.9990 <UNK> LAST 0.0000
-99.9990 <UNK> NEW 0.0000
-99.9990 <UNK> NEXT 0.0000
-0.8129 <UNK> OPEN 0.0000
-99.9990 BACKWARD <UNK> 0.2688
-99.9990 BROWSER <UNK> 0.2688
-99.9990 E-MAIL <UNK> 0.2688
-99.9990 FORWARD <UNK> 0.2688
-99.9990 LAST WINDOW 0.6021
-99.9990 MUSIC PLAYER 0.3188
-99.9990 NEW E-MAIL 0.0000
-99.9990 NEXT WINDOW 0.6021
-99.9990 OPEN BROWSER 0.0000
-99.9990 OPEN MUSIC 0.0000
-0.1249 WINDOW <UNK> -0.2083

\3-grams:
-99.9990 <UNK> <UNK> BACKWARD
-99.9990 <UNK> <UNK> FORWARD
-99.9990 <UNK> <UNK> LAST
-99.9990 <UNK> <UNK> NEW
-99.9990 <UNK> <UNK> NEXT
-99.9990 <UNK> <UNK> OPEN
-99.9990 <UNK> BACKWARD <UNK>
-99.9990 <UNK> FORWARD <UNK>
-99.9990 <UNK> LAST WINDOW
-99.9990 <UNK> NEW E-MAIL
-99.9990 <UNK> NEXT WINDOW
-99.9990 <UNK> OPEN BROWSER
-99.9990 <UNK> OPEN MUSIC
-99.9990 BACKWARD <UNK> <UNK>
-99.9990 BROWSER <UNK> <UNK>
-99.9990 E-MAIL <UNK> <UNK>
-99.9990 FORWARD <UNK> <UNK>
-99.9990 LAST WINDOW <UNK>
-99.9990 MUSIC PLAYER <UNK>
-99.9990 NEW E-MAIL <UNK>
-99.9990 NEXT WINDOW <UNK>
-99.9990 OPEN BROWSER <UNK>
-99.9990 OPEN MUSIC PLAYER
-0.1761 WINDOW <UNK> <UNK>

\end\

--- (Edited on 8/19/2008 7:21 am [GMT-0500] by chn) ---

--- (Edited on 8/19/2008 7:31 am [GMT-0500] by chn) ---

--- (Edited on 8/19/2008 7:50 am [GMT-0500] by chn) ---

Re: CMUSLM *.arpa
User: nsh
Date: 8/19/2008 4:09 pm
Views: 205
Rating: 12

Both your models are not correct, just because you are using too old and broken software. I suggest you to try lmtool:

  http://www.speech.cs.cmu.edu/tools/lmtool.html

 You can just upload your text there and get proper model and a dictionary.

 

--- (Edited on 8/19/2008 4:09 pm [GMT-0500] by nsh) ---

Re: CMUSLM *.arpa
User: chn
Date: 8/19/2008 9:02 pm
Views: 87
Rating: 12

I used cmucmslmtk V3,which is a new version,and I need use this tool to get my LM.

Do you think this tool is broken ? 

 

--- (Edited on 8/19/2008 9:02 pm [GMT-0500] by chn) ---

Re: CMUSLM *.arpa
User: nsh
Date: 8/20/2008 1:22 am
Views: 94
Rating: 11

> I used cmucmslmtk V3,which is a new version,and I need use this tool to get my LM.

It's not the latest version, the latest one is in trunk or in form of nightly build on cmusphinx sourceforge site

> Do you think this tool is broken ?

The tool itself is not broken, you are using it incorrectly just because ngrams in your result have unknowns and silences. Compare them with the proper result from online lmtool.

It's possible to get the same with offline cmuclmtk but since you don't provide details on what you are doing, it's hard to help you.

 

--- (Edited on 8/20/2008 1:22 am [GMT-0500] by nsh) ---

Re: CMUSLM *.arpa
User: chn
Date: 8/21/2008 3:40 am
Views: 133
Rating: 11

Sorry ,I just want to use this tool to get LM and I find what I got is not the same as online lmtool!

This is what I have done: 

I prepared a.text:

OPEN BROWSER
NEW EMAIL
FORWARD
BACKWARD
NEXT WINDOW
LAST WINDOW
OPEN MUSIC PLAYER

and a.ccs:<s>

2.and I got a.sent.text with cmd `sed`.a.sent.text:

<s>  OPEN BROWSER </s>
<s>  NEW EMAIL </s>
<s>  FORWARD </s>
<s>  BACKWARD </s>
<s>  NEXT WINDOW </s>
<s>  LAST WINDOW </s>
<s>  OPEN MUSIC PLAYER </s>

3.Got a.wfreq and a.vocab  with the cmd text2wfreq and wfreq2vocab.

a.vocab:

 ## Vocab generated by v2 of the CMU-Cambridge Statistcal
## Language Modeling toolkit.
##
## Includes 13 words ##
</s>
<s>
BACKWARD
BROWSER
EMAIL
FORWARD
LAST
MUSIC
NEW
NEXT
OPEN
PLAYER
WINDOW

4.Got a.idngram with text2idngram

5.Got a.arpa with cmd idngram2lm  

#############################################################################
## Copyright (c) 1996, Carnegie Mellon University, Cambridge University,
## Ronald Rosenfeld and Philip Clarkson
## Version 3, Copyright (c) 2006, Carnegie Mellon University
## Contributors includes Wen Xu, Ananlada Chotimongkol,
## David Huggins-Daines, Arthur Chan and Alan Black
#############################################################################
=============================================================================
===============  This file was produced by the CMU-Cambridge  ===============
===============     Statistical Language Modeling Toolkit     ===============
=============================================================================
This is a 3-gram language model, based on a vocabulary of 13 words,
  which begins "</s>", "<s>", "BACKWARD"...
This is an OPEN-vocabulary model (type 1)
  (OOVs were mapped to UNK, which is treated as any other vocabulary word)
Witten Bell discounting was applied.
This file is in the ARPA-standard format introduced by Doug Paul.

p(wd3|wd1,wd2)= if(trigram exists)           p_3(wd1,wd2,wd3)
                else if(bigram w1,w2 exists) bo_wt_2(w1,w2)*p(wd3|wd2)
                else                         p(wd3|w2)

p(wd2|wd1)= if(bigram exists) p_2(wd1,wd2)
            else              bo_wt_1(wd1)*p_1(wd2)

All probs and back-off weights (bo_wt) are given in log10 form.

Data formats:

Beginning of data mark: \data\
ngram 1=nr            # number of 1-grams
ngram 2=nr            # number of 2-grams
ngram 3=nr            # number of 3-grams

\1-grams:
p_1     wd_1 bo_wt_1
\2-grams:
p_2     wd_1 wd_2 bo_wt_2
\3-grams:
p_3     wd_1 wd_2 wd_3

end of data mark: \end\

\data\
ngram 1=14
ngram 2=18
ngram 3=24

\1-grams:
-1.3010 <UNK>    0.0000
-0.5229 </s>    -0.8451
-98.8386 <s>    -0.1487
-1.3010 BACKWARD    -0.1461
-1.3010 BROWSER    -0.1461
-1.3010 EMAIL    -0.1461
-1.3010 FORWARD    -0.1461
-1.3010 LAST    -0.2553
-1.3010 MUSIC    -0.2788
-1.3010 NEW    -0.2788
-1.3010 NEXT    -0.2553
-1.0000 OPEN    -0.2553
-1.3010 PLAYER    0.0000
-1.0000 WINDOW    -0.3222

\2-grams:
-0.0669 </s> <s> 0.0348
-1.1139 <s> BACKWARD 0.0000
-1.1139 <s> FORWARD 0.0000
-1.1139 <s> LAST 0.0000
-1.1139 <s> NEW 0.0000
-1.1139 <s> NEXT 0.0000
-0.8129 <s> OPEN 0.0000
-0.3010 BACKWARD </s> 0.5441
-0.3010 BROWSER </s> 0.5441
-0.3010 EMAIL </s> 0.5441
-0.3010 FORWARD </s> 0.5441
-0.3010 LAST WINDOW 0.1761
-0.3010 MUSIC PLAYER -0.1461
-0.3010 NEW EMAIL 0.0000
-0.3010 NEXT WINDOW 0.1761
-0.6021 OPEN BROWSER 0.0000
-0.6021 OPEN MUSIC 0.0000
-0.1761 WINDOW </s> 0.3680

\3-grams:
-1.0792 </s> <s> BACKWARD
-1.0792 </s> <s> FORWARD
-1.0792 </s> <s> LAST
-1.0792 </s> <s> NEW
-1.0792 </s> <s> NEXT
-1.0792 </s> <s> OPEN
-0.3010 <s> BACKWARD </s>
-0.3010 <s> FORWARD </s>
-0.3010 <s> LAST WINDOW
-0.3010 <s> NEW EMAIL
-0.3010 <s> NEXT WINDOW
-0.6021 <s> OPEN BROWSER
-0.6021 <s> OPEN MUSIC
-0.3010 BACKWARD </s> <s>
-0.3010 BROWSER </s> <s>
-0.3010 EMAIL </s> <s>
-0.3010 FORWARD </s> <s>
-0.3010 LAST WINDOW </s>
-0.3010 MUSIC PLAYER </s>
-0.3010 NEW EMAIL </s>
-0.3010 NEXT WINDOW </s>
-0.3010 OPEN BROWSER </s>
-0.3010 OPEN MUSIC PLAYER
-0.1761 WINDOW </s> <s>

\end\

--- (Edited on 8/21/2008 3:40 am [GMT-0500] by chn) ---

Re: CMUSLM *.arpa
User: nsh
Date: 8/21/2008 5:05 pm
Views: 219
Rating: 12

Excuse me for being too irritated. Mostly it's ok. A few things to correct the lm:

1. Online Lmtool uses QuickLM from here, it counts 3-grams differently:

  http://www.speech.cs.cmu.edu/tools/download/quick_lm.pl

you can also use it. Alternatively, you can just filter wngrams (don't get idngrams directly, get wngrams first and then strip them with grep:

 ./text2wngram < input.txt | grep -v "</s> <s>"  > input.wngram

 the rest will work as usual:

 ./wngram2idngram -vocab input.vocab < input.wngram > input.idngram

 ./idngram2lm -vocab_type 0 -idngram input.idngram -vocab input.vocab -arpa input.arpa

Please note that you need latest trunk to make it work more or less properly.

2. Don't generate open vocabulary model, sphinx ignores unknowns, so pass   -vocab_type 0 to idngram2lm

--- (Edited on 8/21/2008 5:05 pm [GMT-0500] by nsh) ---

Re: CMUSLM *.arpa
User: chn
Date: 8/22/2008 7:59 pm
Views: 73
Rating: 12

Thank you!

And I got a.wngram as you said:

<s> NEW EMAIL 1
<s> NEXT WINDOW 1
<s> OPEN BROWSER 1
<s> OPEN MUSIC 1
LAST WINDOW </s> 1
MUSIC PLAYER </s> 1
NEW EMAIL </s> 1
NEXT WINDOW </s> 1
OPEN BROWSER </s> 1
OPEN MUSIC PLAYER 1

When I got a.idngram(-write_ascii), I think  what I got was something wrong!

 40000000 40000000 40000000 1141473616

Why? 

--- (Edited on 8/22/2008 7:59 pm [GMT-0500] by chn) ---

Re: CMUSLM *.arpa
User: chn
Date: 8/22/2008 8:02 pm
Views: 115
Rating: 13
 /home/cui/tutorial/cmuclmtk/bin/wngram2idngram -vocab /home/cui/new/6.temp.vocab </home/cui/new/6.wngram >/home/cui/new/6.idngram -write_ascii
Vocab           : /home/cui/new/6.temp.vocab
Buffer size     : 100
Hash table size : 200000
Temp directory  : /usr/tmp/
Max open files  : 20
n               : 3
FOF size               : 10
buffer size = 4166600
Initialising hash table...
Reading vocabulary...
Allocating memory for the buffer...
Writing non-OOV counts to temporary file /usr/tmp/12345Ghsqhi1
Sorting final buffer...
Writing sorted buffer to temporary file /usr/tmp/12345Ghsqhi2
Merging temporary files...

2-grams occurring:      N times         > N times       Sug. -spec_num value
      0                                               1              11
      1                               0               1              11
      2                               0               1              11
      3                               0               1              11
      4                               0               1              11
      5                               0               1              11
      6                               0               1              11
      7                               0               1              11
      8                               0               1              11
      9                               0               1              11
     10                               0               1              11

3-grams occurring:      N times         > N times       Sug. -spec_num value
      0                                               1              11
      1                               0               1              11
      2                               0               1              11
      3                               0               1              11
      4                               0               1              11
      5                               0               1              11
      6                               0               1              11
      7                               0               1              11
      8                               0               1              11
      9                               0               1              11
     10                               0               1              11
wngram2idngram : Done.

--- (Edited on 8/22/2008 8:02 pm [GMT-0500] by chn) ---

Re: CMUSLM *.arpa
User: nsh
Date: 8/22/2008 10:09 pm
Views: 76
Rating: 10

Yes, there was a bug that I fixed just yesterday, please try to checkout cmuclmtk from trunk and rebuild it.

 

--- (Edited on 8/22/2008 10:09 pm [GMT-0500] by nsh) ---

Re: CMUSLM *.arpa
User: chn
Date: 8/23/2008 8:59 pm
Views: 103
Rating: 12
Sorry, I didn't reslove that problem!

--- (Edited on 8/23/2008 8:59 pm [GMT-0500] by chn) ---

PreviousNext