language model traning problem with utf-8 text corpus

General Discussion

User: Ian
Date: 9/10/2010 5:00 am

Views: 6416
Rating: 2

Hello everyone,

i 'm working on german language model training with CMU-Cambridge Statistical Language Modeling Tookit v2.

i have read the tutorials and found out that the input text should be in ASCII format, but my text corpus for training are in utf-8 format and a convertion is not easy, can anybody tell me, is there a way to configure the SLM tool for utf-8 text format or should i use other tool?

Thanks in advance!!

IAN

--- (Edited on 9/10/2010 5:00 am [GMT-0500] by Ian) ---

Re: language model traning problem with utf-8 text corpus

User: nsh
Date: 9/10/2010 5:14 am

Views: 69
Rating: 1

There is no such restriction. The tutorial you've found is certainly wrong. You can go ahead and train the model with utf-8 text.

--- (Edited on 9/10/2010 14:14 [GMT+0400] by nsh) ---

Re: language model traning problem with utf-8 text corpus

User: Ian
Date: 9/10/2010 5:47 am

Views: 96
Rating: 2

but is there any option to configure the input text encoding? i have seen the created vocablary file such as 20000vocab.vocab is not in utf-8 encoded but the ISO-8859 English text, will this not worsen the traning?

--- (Edited on 9/10/2010 5:47 am [GMT-0500] by Ian) ---

Re: language model traning problem with utf-8 text corpus

User: nsh
Date: 9/10/2010 6:05 am

Views: 108
Rating: 3

Sorry I don't quite see how encoding is relevant here. cmuclmtk (please make sure you are using latest snapshot from http://cmusphinx.sourceforge.net) doesn't care about encoding at all. If your text is utf-8 the vocab and lm will be in utf-8. If your text is iso8859-1, the vocab and lm will be in iso8859-1. You can convert either LM on later stage to a different encoding or convert source text before you build the lm. Language model toolkit only cares about spaces to separate words. Also make sure that all your text have single case (upper or lower) since language modelling toolkit doesn't convert the case and can't join words with different capitalization.

--- (Edited on 9/10/2010 15:05 [GMT+0400] by nsh) ---

Re: language model traning problem with utf-8 text corpus

User: Ian
Date: 9/10/2010 6:37 am

Views: 2760
Rating: 1

Thanks a lot for your help!

--- (Edited on 9/10/2010 6:37 am [GMT-0500] by Ian) ---

Previous • Next •


Username	Password