VoxForge
Hello everyone,
i 'm working on german language model training with CMU-Cambridge Statistical Language Modeling Tookit v2.
i have read the tutorials and found out that the input text should be in ASCII format, but my text corpus for training are in utf-8 format and a convertion is not easy, can anybody tell me, is there a way to configure the SLM tool for utf-8 text format or should i use other tool?
Thanks in advance!!
IAN
--- (Edited on 9/10/2010 5:00 am [GMT-0500] by Ian) ---
but is there any option to configure the input text encoding? i have seen the created vocablary file such as 20000vocab.vocab is not in utf-8 encoded but the ISO-8859 English text, will this not worsen the traning?
--- (Edited on 9/10/2010 5:47 am [GMT-0500] by Ian) ---
Sorry I don't quite see how encoding is relevant here. cmuclmtk (please make sure you are using latest snapshot from http://cmusphinx.sourceforge.net) doesn't care about encoding at all. If your text is utf-8 the vocab and lm will be in utf-8. If your text is iso8859-1, the vocab and lm will be in iso8859-1. You can convert either LM on later stage to a different encoding or convert source text before you build the lm. Language model toolkit only cares about spaces to separate words. Also make sure that all your text have single case (upper or lower) since language modelling toolkit doesn't convert the case and can't join words with different capitalization.
--- (Edited on 9/10/2010 15:05 [GMT+0400] by nsh) ---