VoxForge
Hi,
I just started developing with sphinx4 (version: 5prealpha-snapshot). After some successful testing with the default English model, I tried to use the German voxForge model. I downloaded the file cmusphinx-cont-voxforge-de-r20161117.tar.xz and tried to use it with sphinx4. After starting, the following error occurred:
18:12:20.936 INFO largeTrigramModel Loading n-gram language model from: file:vox_cont/etc/voxforge.lm.dmp
Exception in thread "main" java.lang.Error: Bad binary LM file magic number: 1701409364, not an LM dumpfile?
at edu.cmu.sphinx.linguist.language.ngram.large.BinaryLoader.readHeader(BinaryLoader.java:469)
at edu.cmu.sphinx.linguist.language.ngram.large.BinaryLoader.loadModelLayout(BinaryLoader.java:393)
at edu.cmu.sphinx.linguist.language.ngram.large.BinaryLoader.<init>(BinaryLoader.java:99)
at edu.cmu.sphinx.linguist.language.ngram.large.LargeNGramModel.allocate(LargeNGramModel.java:206)
at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.allocate(LexTreeLinguist.java:334)
at edu.cmu.sphinx.decoder.search.WordPruningBreadthFirstSearchManager.allocate(WordPruningBreadthFirstSearchManager.java:243)
at edu.cmu.sphinx.decoder.AbstractDecoder.allocate(AbstractDecoder.java:103)
at edu.cmu.sphinx.recognizer.Recognizer.allocate(Recognizer.java:164)
at edu.cmu.sphinx.api.StreamSpeechRecognizer.startRecognition(StreamSpeechRecognizer.java:52)
at edu.cmu.sphinx.api.StreamSpeechRecognizer.startRecognition(StreamSpeechRecognizer.java:39)
at de.martin.sphinxtest.TranscriberDemo.test1(TranscriberDemo.java:61)
at de.martin.sphinxtest.TranscriberDemo.main(TranscriberDemo.java:149)
------------------------------------------------------------------------
The following configuration-code is used:
configuration.setAcousticModelPath("file:vox_cont/model_parameters/voxforge.cd_cont_6000");
configuration.setDictionaryPath("file:vox_cont/etc/voxforge.dic");
configuration.setLanguageModelPath("file:vox_cont/etc/voxforge.lm.dmp");
I also tried an older german voxforge-model from 2014. It runs without exceptions.
If someone has an idea, where the error lies, I would be grateful for every note.
Thanks in advance
Martin
Thanks a lot! The error does not occur anymore.
Unfortunately another error occurs now when calling recognizer.startRecognition(stream):
19:37:34.708 INFO trieNgramModel Loading n-gram language model from: file:vox_cont/etc/voxforge.lm.bin
2017-01-08 19:37:34 SCHWERWIEGEND de.martin.sphinxtest.TranscriberDemo main null
java.lang.NullPointerException
at edu.cmu.sphinx.linguist.language.ngram.trie.NgramTrieQuant.setTable(NgramTrieQuant.java:50)
at edu.cmu.sphinx.linguist.language.ngram.trie.BinaryLoader.readQuant(BinaryLoader.java:95)
at edu.cmu.sphinx.linguist.language.ngram.trie.NgramTrieModel.allocate(NgramTrieModel.java:225)
at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.allocate(LexTreeLinguist.java:334)
at edu.cmu.sphinx.decoder.search.WordPruningBreadthFirstSearchManager.allocate(WordPruningBreadthFirstSearchManager.java:243)
at edu.cmu.sphinx.decoder.AbstractDecoder.allocate(AbstractDecoder.java:103)
at edu.cmu.sphinx.recognizer.Recognizer.allocate(Recognizer.java:164)
at edu.cmu.sphinx.api.StreamSpeechRecognizer.startRecognition(StreamSpeechRecognizer.java:52)
at edu.cmu.sphinx.api.StreamSpeechRecognizer.startRecognition(StreamSpeechRecognizer.java:39)
at de.martin.sphinxtest.TranscriberDemo.test1(TranscriberDemo.java:61)
at de.martin.sphinxtest.TranscriberDemo.main(TranscriberDemo.java:149)
------------------------------------------------------------------------
Do you know a solution for this problem?
Ok, the LM is also corrupted.
I repackaged files at
https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/German/
download them and try, I verified they work with latest s4.
Make sure you are using latest sphinxbase for conversion. Also its better to avoid such a big lm, its useless, you can prune it to the size currently uploaded on the cmusphinx site with not accuracy drawbacks.
ah, srilm, very cool :)
BTW: I was wondering whether I could or should build just one lm using srilm for both sphinx and kaldi instead of my current approach where I build a separate lm using cmuclmtk for sphinx.
Anyway, I will put this on my TODO list for the next iteration of the german model. Thanks again for your help and comments!