German

Nested
first results from the dictionary acquisition project
User: timobaumann
Date: 6/15/2008 3:21 pm
Views: 17962
Rating: 29

Hi there,

I've checked in a first version of a hand-corrected (mostly by ralfherzog, thanks!) pronunciation lexicon that conforms to the Pronunciation Lexicon Specification. I'll try to train some G2P based on this data and check out if this improves over our current espeak-dictionary.

Cheers, Timo

Re: first results from the dictionary acquisition project
User: kmaclean
Date: 6/16/2008 9:35 am
Views: 190
Rating: 67

Hi Timo,

Well done, thanks!

Ken 

UTF-8 instead of ISO-8859-1
User: ralfherzog
Date: 6/16/2008 4:11 pm
Views: 161
Rating: 19
Hello Timo,

Thanks, great work.

There is a problem with the character encoding.  I am not a specialist when it comes to things like UTF-8, and ISO-8859-1.  I went to the Firefox menu "Tools/Page Info" to find out that obviously the dictionary acquisition project is using ISO-8859-1.  

Maybe it would help if you would change the character encoding of the dictionary acquisition project from currently ISO-8859-1 to UTF-8? Obviously, Wikipedia "used ISO-8859-1 but switched to UTF-8 when it became far to cumbersome to support foreign languages."  Maybe you should do the same?  It is very complicated with the German "Umlaute", and with the special IPA characters if we don't use UTF-8.

Anyway, I was able to download the PLS lexicon in the original format, and open it with the Firefox (it looks fine).  But I wasn't able to display it correctly with Notepad++, or with OpenOffice.org.

Greetings, Ralf
Re: UTF-8 instead of ISO-8859-1
User: kmaclean
Date: 6/16/2008 4:30 pm
Views: 1058
Rating: 18

Hi Ralf,

I think the problem is with the way Trac is currently configured ... if you dowload the file in plain text (at the very bottom of the page), everything seems to display properly,

Ken 

Firefox displays pronunciation lexicon correctly
User: ralfherzog
Date: 6/16/2008 10:27 pm
Views: 182
Rating: 22

Hello Ken,

I just downloaded the file in "Plain Text" like you suggested.  It didn't work out (with Notepad++, and with OpenOffice.org; both under Windows XP).  Some special IPA characters are displayed correctly, others are not. But Firefox displays the "Plain Text" version (as well as the Original Format) 100% correctly.

So it is possible to display the pronunciation lexicon correctly with Firefox, but not with Notepad++.  What is the reason for this different behavior? Maybe the encoding is correctly, and I need a different text editor.  But which text editor should I use?

Off-topic: A similar problem occurs when I download the German Prompts.tgz.  There seems to be a problem with the encoding.  When I started to submit prompts to VoxForge, I didn't care about UTF-8.  Instead, I submitted in ANSI (which probably means Windows-1252). Is it possible to fix this problem?

Greetings, Ralf 

Re: UTF-8 instead of ISO-8859-1
User: kmaclean
Date: 6/17/2008 9:03 am
Views: 93
Rating: 17

Trac's default character encoding is utf-8.  I removed a reference in the trac.ini file (Trac config file) in the German repository (and all other languages...) that overrode the default and set it to: ISO-8859-1 (I am not sure what I was thinking when I set that way... :) )

German should now display correctly.  If there are any more problems, please let me know,

Ken 

Re: Firefox displays pronunciation lexicon correctly
User: kmaclean
Date: 6/17/2008 9:39 am
Views: 169
Rating: 22

Hi Ralf,

>But Firefox displays the "Plain Text" version (as well as the Original Format) 100% correctly.

I think this is because the text file is XML and the first line tells FireFox which encoding to use:

<?xml version="1.0" encoding="UTF-8"?>


>So it is possible to display the pronunciation lexicon correctly with Firefox, but

>not with Notepad++.  What is the reason for this different behavior?

I am not sure for the different behavior with Notepad++, it might work better if you download the original format version of the Pronunciation lexicon.  You should not need to change text editors.  There might an "encoding" or "charset" parameter that might need to be changed in Notepad++.

>A similar problem occurs when I download the German Prompts.tgz.

That is something that I noticed a while ago too... The prompt files in the individual submissions are correct, it is just when the prompt files were being added to the master_prompt files, the script was not using the correct encoding (I was not paying much attention to encoding back then either... :) ).  I fixed this problem a few months ago, but the prompts that were added to the master prompts files prior to the fix, need to be corrected.  It's on the todo list.

Ken 
thanks for switching to UTF-8
User: ralfherzog
Date: 6/17/2008 10:32 am
Views: 231
Rating: 27
Hello Ken,

Thanks for setting back the character encoding for German to UTF-8.  Yes, German is now displayed correctly.

From now on, we should try to exclusively use just UTF-8, and nothing else.  Otherwise, the result might be a mess.
RFC 4267 - file extension is .pls
User: ralfherzog
Date: 6/18/2008 11:14 am
Views: 280
Rating: 15
According to RFC 4267, the file extension for PLS files is .pls.
Re: thanks for switching to UTF-8
User: kmaclean
Date: 6/18/2008 12:35 pm
Views: 143
Rating: 15

Hi Ralf,

>we should try to exclusively use just UTF-8, and nothing else. 

I agree, encoding issues give me a migraine...  :) 

PreviousNext