Re: Mojibake - voxforge.org

German

Flat

Re: UTF-8 instead of ISO-8859-1

User: timobaumann
Date: 7/2/2008 1:55 am

Views: 142
Rating: 14

Hi Ken,

Trac still wrongly displays the lexicon and it might still be thinking something wrong as there is stuff like the following in the html source of http://dev.voxforge.org/projects/de/browser/Trunk/Lexicon/voxDE20080615.xml

Also, the wrong character pairs that look like they were one utf-8 character shown in latin1 are actually two nicely utf-8-encoded characters that display what would be displayed if the original character were shown in latin1. I don't know if this babbling is understandable, but try to set your browser's character encoding to latin1 and you'll see that the problem doesn't go away but that each observed character pair actually becomes 4(!) question marks.

Also, the page itself does not specify its character encoding (neither through a http-equiv meta nor with an xml-processing instruction), which would probably be better.

All this doesn't really pose a problem, but for completeness sake we might want to fix it. Or do you think it might automagically go away with the next svn checkin? I don't know how Trac works, but it might be caching the page from before your trac.ini update.

Cheers,
Timo

inconsistencies because of different character encodings

User: ralfherzog
Date: 7/2/2008 7:53 am

Views: 291
Rating: 13

Hello Timo,

Very good. I just took a look at the source code of the HTML page containing the VoxForge dictionary. And the charset in the source code is defined as "iso-8859-15". This should be changed into UTF-8. Firefox says that the encoding is UTF-8. This inconsistency should be corrected.

"I don't know if this babbling is understandable" - well, I hope that I understand your "babbling". In your post, you mention the character encoding "latin1". What do you mean with the expression "latin1"? I took a look into the Wikipedia. Obviously, latin1 is the same as ISO-8859-1. And ISO-8859-15 seems to be the same as Latin-9. I don't know whether this distinction is relevant to our problem, but we should try to be as exact as possible. So once a problem with the character encoding has occurred (ISO-8859-1 versus ISO-8859-15 versus UTF-8), it seems to be almost impossible to reverse this problem (e.g. by changing the browser's character encoding).

"All this doesn't really pose a problem" - for me, this is a very big problem. And we won't be successful with the IPA as long as we don't find a consistent solution. Otherwise, the speech recognition engine will fail.

For example, take a look into the German VoxForge repository. If I open the text file "master_prompts_16kHz-16bit" (with Notepad++ under Windows XP), there are a lot of problems with the German signs "ä,ö,ü,ß". And how would it be possible to revert those mistakes? This file is partly correct, and partly incorrect. I think that one reason for this inconsistency is the mixed employment of the character encoding Windows-1252 as well as UTF-8. The file is not consistent, we have to learn from that.

You can see, obviously, lots of character encodings are involved. We must face those disgusting problems. And we should control every step. UTF-8 is backwards compatible to US-ASCII. UTF-8 is used by famous project like the Wikipedia, or WordPress.com. And there is a absolute minimum every speech recognition software developer, positively must know about Unicode and character sets (I linked to this webpage before, but it is important, so I link to it again).

So it is our choice: Either we solve the encoding problems (by switching completely and exclusively to UTF-8), or we forget about the IPA and stick to SAMPA (or Arpabet which is used by the CMU pronouncing dictionary).

I think it is a good decision if we develop two versions of our German pronunciation dictionary:
- IPA, UTF-8; it takes time to solve the encoding issues; unfortunately, the IPA is not US-ASCII compatible;
- SAMPA, encoding doesn't matter, because all different encodings (Windows-1252, ISO-8859-1, etc.) are US-ASCII compatible.

US-ASCII is old-fashioned, but extremely reliable. But in the long term, we need UTF-8.

Greetings, Ralf

Re: UTF-8 instead of ISO-8859-1

User: kmaclean
Date: 7/11/2008 11:00 pm

Views: 103
Rating: 14

Hi Timo,

Sorry for the delay on this... it's fixed! The default character set for Trac is now explicitly set to utf-8.

Ken

Mojibake

User: ralfherzog
Date: 7/12/2008 9:05 am

Views: 170
Rating: 16

Hello Ken, thank you for setting the default charset value explicitly to UTF-8. Because otherwise, the default charset value may be ISO-8859-1. By the way, do you know the word Mojibake?

Re: Mojibake

User: kmaclean
Date: 7/14/2008 9:49 am

Views: 5415
Rating: 14

Hi Ralf,

>By the way, do you know the word Mojibake?

No, but it is apt...

My favourite is tla.

thanks,

Ken

RFC 4267 - file extension is .pls

User: ralfherzog
Date: 6/18/2008 11:14 am

Views: 282
Rating: 15

According to RFC 4267, the file extension for PLS files is .pls.

Re: RFC 4267 - file extension is .pls

User: timobaumann
Date: 6/23/2008 6:42 am

Views: 242
Rating: 15

ok, i'll keep the file extension in mind for the next build.
do any of the encoding errors persist, or are they resolved now?

Dictionary acquisition project uses UTF-8

User: ralfherzog
Date: 6/23/2008 3:51 pm

Views: 244
Rating: 61

Hello Timo,

I saw that the dictionary acquisition project now uses UTF-8. Thanks for changing that.

I am looking for a text editor that is able to display the IPA symbols correctly. Which text editor do you use?

Re: Dictionary acquisition project uses UTF-8

User: timobaumann
Date: 7/2/2008 2:03 am

Views: 105
Rating: 14

Hi Ralf,

Well, most of my editors (that is, nano and gedit) seem to work well. I mostly rely on console applications for the dictionary processing and all the tools abide my utf-8 wishes. NEdit unfortunately doesn't support utf-8, but openly tells me about that.

In terms of Windows Editors, I don't know. Don't they all more or less work with uft-8 nowadays? At least notepad worked for me when I last used it (it actually wanted to store text as utf-16, but I was able to change it).

Now, for IPA *editing*, I haven't found anything that allows me to simply type IPA symbols -- apart from the little textfield on the dictionary acquisition project's page :-) I frequently use it to type in my symbols and then copy-paste it into e-mails and other stuff. But then again, you can't really type anything *but* IPA in the text fields...

Hope that helps,
Timo

trouble with IPA symbols

User: ralfherzog
Date: 6/27/2008 12:26 am

Views: 271
Rating: 16

"do any of the encoding errors persist[...]?" - Yes, there is lots of trouble with IPA symbols.

[ «Previous Page | 1 2 | Next Page» ]

Previous • Next •


Username	Password