VoxForge
Well, I could build a model this weekend, until that you probably need to install and try pocketsphinx either on windows or, better on Linux. About language model, filtering is a trivial step already done by language modelling toolkits, we'll return to this later when we'll have acoustic model but you only need to use one of them.
Hi nsh,
Unfortunately I have not moved any German audio to subversion.
However, here is quick and dirty way to get the audio:
1. $wget -r -l2 http://www.voxforge.org/home/downloads/speech/german-speech-files -A "ralfherzog*"
this will create a directory called www.voxforge.org
2. search the directory for *.zip files using Gnome's search tool, and drag the results to the directory you want.
Ken
Ok, I created a model from a third of audio data, you can download it here:
http://www.mediafire.com/?2bmbsmmzrm5
it decodes numbers quite well
Hi nsh,
Yes, go for it! You should have commit access to the German svn repository... if not let me know.
I'll need to update the scripts to create the gzipped tar files and rsync to the VF repository to allow downloads.
thanks,
Ken
Hello Ken,
I can see that you are trying to find a solution for ticket # 321 (Windows: SpeechSubmission app for German - umlauts not displaying properly).
Perhaps this might help. My different "prompts.txt" files (de1, de2, ... de150) should be encoded in "ANSI" (Notepad++ under Windows XP). Take a look into the Wikipedia:
"the phrase "ANSI" refers to the Windows ANSI code pages [...]."
Notepad++ offers the possibility to convert a prompts.txt file (obviously some kind of Windows ANSI code, perhaps encoded in Windows-1252?) into UTF-8. This option is available via the Notepad++ menu Format-"Convert to UTF-8."
So perhaps my prompts should be converted from ANSI into UTF-8 using Notepad++?
So you wouldn't have to find a solution via Java. You may just use a simple text editor to do the conversion.
Thanks and greetings, Ralf
Hi Ralf,
Thanks for advice, though I think it might be something more than just the character encodings of the text files (ANSI, UTF-8, ...). The reason I think it might be is that they display fine on my install of Linux (FC6). The problem might be related to the default character set the user selects on their Windows or Linux machine.
I need to look into this further,
Ken
Hi Ralf,
I've updated the speech submission app (now on release 0.1.4). The encoding problem should now be fixed.
Basically, Java takes the default encoding of whatever computer it is running on. So even though the prompts might look OK on my computer (using UTF-8), it might look different on someone elses computer (usually Windows).
Please let me know if you still are having character display problems.
thanks,
Ken