General Discussion

Nested
Re: German language
User: timobaumann
Date: 11/25/2007 10:47 am
Views: 593
Rating: 27

>Take a look at the second "message board" I just created on the forum web page ... see if that looks OK.

looks perfect. 

Re: Other languages
User: V
Date: 3/18/2008 5:37 pm
Views: 350
Rating: 46

Hi!

 I am trying to figure out what is available to start putting together a Hungarian corpus, but I am not sure if I understand your terminus technicus.

* phonetic dictionary: a dictionary where every word [wurd] is written like this? Does this correspond to the pronunciation dictionary used in the tutorial or to the lexicon?

* prompts: are books allowed? is it necessary to segment them as the prompts file in the tutorial?

* what about audiobooks + text? 

* licensing: what source material is allowed besides public domain? I presume GPL is fine, but what about CC (and which type of CC?), or MIT-like licences? 

I am sure that once you answer me, I'll have some more question! :)

 Cheers,

Re: Other languages
User: nsh
Date: 3/19/2008 2:31 am
Views: 293
Rating: 38

> phonetic dictionary: a dictionary where every word [wurd] is written like this? Does this correspond to the pronunciation dictionary used in the tutorial or to the lexicon?

 yes

> prompts: are books allowed? is it necessary to segment them as the prompts file in the tutorial?

 yes, it's necessary to segment them and it's the biggest problem with books

> what about audiobooks + text?

ok, but we prefere raw wav, not mp3

>licensing: what source material is allowed besides public domain? I presume GPL is fine, but what about CC (and which type of CC?), or MIT-like licences?

GPL is better, though any other free speech is also suitable to start. See the discussion on this forum.

Basically to start you can just record yourself (10 minutes) and a few your friends (5 x 10 minutes).  About dictionary, you can build it with text2pho from

  http://tkltrans.sourceforge.net/

Re: Other languages
User: kmaclean
Date: 3/19/2008 12:04 pm
Views: 3514
Rating: 37

Hi V,

one clarification: 

> Does this correspond to the pronunciation dictionary used in the tutorial or to the lexicon?

The pronunciation dictionary used in the Tutorial and How-to is based on the ISIP Switchboard corpus (contains around 27,500 words).  Whereas the  QuickStart and nightly AM builds is based on version 0.6 of the CMU Pronunciation Dictionary (contains around 130,000 words).   Unfortunately, the Switchboard and CMU pronunciation dictionaries use slightly different phoneme syntax.  This is enough to make them incompatible from a Grammar and Acoustic Model testing perspective (see ticket #52).

Ken 

Previous