Acoustic Model Discussions

Flat
Use "PIEC" instead of "pięć". Use "P E NG TS" instead of "pʲeɲʨ".
User: ralfherzog
Date: 11/16/2009 5:36 am
Views: 1418
Rating: 17

Hi johnyjj2!

Yes, you should use the Arpabet (and not the IPA). And you should use only US-ASCII characters ("PIEC" instead of "pięć"). It doesn't matter which encoding you are using. Obviously, the following encodings are common:

"The standard 8-bit character encoding for the Polish alphabet is ISO 8859-2 (Latin-2), although both ISO 8859-13 (Latin-7) and ISO 8859-16 (Latin-10) encodings include glyphs of the Polish alphabet. Microsoft's format for encoding the Polish alphabet is Windows-1250."

All encodings should be 100% US-ASCII compatible. If you are using just US-ASCII (for the polish words) and Arpabet (for the polish phonemes), you should be fine. This approach is a very good start because you avoid encoding issues.

I don't know which line endings are accepted by Sphinx (CR/LF (Win) or LF (Unix) or CR (Mac)). Maybe it is a problem, maybe not. You can take a look at cmudict to find out which line endings they are using (I assume that they are using LF (Unix)).

OK, the encoding problem is solved for your small 10-15 words dictionary (Sphinx format).

What is the pronunciation of pięć? It is [pʲeɲʨ] (IPA). You can now use the Arpabet table to translate the IPA symols into Arpabet. The result could look like this:

PIEC P E NG TS

The X-SAMPA equivalent of ʨ is "ts\". So it is probably the best to use TS as symbol.

What can you learn from my post?

1. Take a look into the Polish Wiktionary to find out how the IPA of the word looks like that you want to add to your pronunciation dictionary.

2. Translate the IPA phoneme to the corresponding Arpabet phoneme.

3. Your polish dictionary should look like cmudict (of course, you have just a few words, but for your project that is sufficient).

When you are ready with your Polish pronunciation dictionary (Sphinx format), can you please post a link to it? Thanks.

Regards, Ralf

--- (Edited on 2009-11-16 5:36 am [GMT-0600] by ralfherzog) ---

Re: Use "PIEC" instead of "pięć". Use "P E NG TS" instead of "pʲeɲʨ".
User: johnyjj2
Date: 11/16/2009 4:52 pm
Views: 173
Rating: 16

Thanks for your answer!

> And you should use only US-ASCII characters ("PIEC" instead of "pięć").

For the very basic dictionary yes. However in the future it wouldn't be good choice. Poles are able to read Polish language without Polish letters, only with English letters. However for computer it wouldn't be good idea because it can lead to many disambiguities. For example there exist both words piec and pięć but they have different meaning.

> The standard 8-bit character encoding for the Polish alphabet is ISO 8859-2 (Latin-2),

I guess it can be done with Word/WordPad. I've got both Ubuntu and Windows, however I still spend more time in Windows. I need to switch to Linux completely but it requires some time to change habits :-P.

> Take a look into the Polish Wiktionary

Unfortunately this http://pl.wiktionary.org/wiki/Specjalna:Linkujące/Aneks:IPA is rather poor IPA dictionary. Only some of words have got their IPA.

COPYING IPA FROM WIKIDICTIONARY TO NOTEPAD
I entered Polish Wikidictionary and tried to copy IPA from pięć to notepad and MS Word but it created rectangles instead of some IPA characters. How to copy IPA from Wikidictionary to Notepad/Word/WordPad?

LIST OF ALL SOUNDS IN WRITTEN (ORDINARY) FORM + ORTOGRAPHY
This is list which I created from memory a ą b c ć cz d dz dż dź e ę f g h=ch i j k l ł m n ń o p r s ś sz t u=ó w y z ź ż=rz. It contains all letters and sounds indicated by two letters and it also takes into account that some sounds can be written in two ways (it is feature of Polish language which only makes ortography more difficult but doesn't change anything - in fact this ortography indicates which word has got root in other words of similar ortography but this knowledge is rather not so useful for ordinary speaker I think).

WRITING "I" AFTER LETTER, E.G. Ź/Z(I), Ś/S(I); UPPER-CASE "J"
I noticed that some http://en.wikipedia.org/wiki/Polish_phonology#Consonants indicates both ś and s(i) as the same IPA sound. I think there is little difference between e.g. śa and sia. For example I can write siadaj (let's sit) and śakja (this thing "śa" doesn't exist in Polish language but I have seen it in some Polish books where simple Polish transcription is used for words from other languages, e.g. Budda Śakjamuni). I would say that there is little difference between those two ways - śa and sia. From Wikipedia it looks like both are encoded with the same IPA sound. I would argue with that (especially for ń/n(i)). If I remember properly, this "i" letter also makes the letter just before softer. I also noticed that in some words, like piec, there is upper-case j in IPA. How should I take into account those i/j? You gave me an example how to write this word piec, but you ommited this little j. From my point of view lack of this j completely changes the way how the word sounds and it cannot be ommited.

RESULT OF MY WORK + QUESTION ABOUT ARPABET
OK, I've got some of the list ready. I indicated with red letter Arpabet symbols. However, I couldn't find most of Arpabet symbols. (http://rapidshare.com/files/308030950/alfabet4.JPG.html). Can you give me any link to full list of Arpabet symbols, not only limited to English Arpabet abbreviations? (Those Sampa are not so useful, I guess, because they contain non-letter characters). What should I do with those apostrophes like in jeden (one) IPA and those upper-index j (like in five - pięć)?

Greetings!

--- (Edited on 11/16/2009 4:52 pm [GMT-0600] by johnyjj2) ---

invent the missing "Arpabet" phonemes
User: ralfherzog
Date: 11/16/2009 7:33 pm
Views: 1517
Rating: 16

Hi johnyjj2!

"piec and pięć [...] have different meaning"

You see that US-ASCII isn't sufficient to catch the details of the Polish language. That is the reason why I am using UTF-8 for Ralf's Polish dictionary (UTF-8 is great: no problem with Greek or even Tamil). Maybe you can help me with Ralf's Polish dictionary (UTF-8) when you have finished your small vocabulary Sphinx project (US-ASCII).

"switch to Linux completely"

I am using both on two different hard drives (swappable). I need Win XP (Windows Movie Maker) for the creation of videos like Dictation under Ubuntu: 148 German words recognized correctly. I use Ubuntu for eSpeak, HTK, simon (PDF), Audacity.

By the way, when you watch the video, you can see that simon recognizes even words with special German characters (äöüß). Of course, this concept could be applied for the Polish special characters, too.

"there is little difference between e.g. śa and sia"

You can treat different phones that sound similar as one single phoneme. Your small Sphinx pronunciation dictionary doesn't have to catch every detail (= phones) of the Polish language. You just have to catch the major characteristics (= phonemes).

"both are encoded with the same IPA sound"

Then treat them as one single Arpabet sound. If you don't find the specific sound in the Arpabet table, then create your own "Arpabet" sound. You should create your own pronunciation table. You have to decide which Polish phonemes your dictionary should have (and which phones you want to omit). In the end, your own pronunciation table should have about 40 phonemes.

So create your own phoneme table (and invent the missing "Arpabet" phonemes). It is up to you to solve the details.

"piec, but you ommited this little j"

I don't speak or understand Polish. So it obviously was a mistake to omit the "little j". Maybe it would be better to write:

PIEC P J E NG TS

Greetings,

Ralf

--- (Edited on 2009-11-16 7:33 pm [GMT-0600] by ralfherzog) ---

Re: invent the missing "Arpabet" phonemes
User: johnyjj2
Date: 11/18/2009 9:35 am
Views: 2235
Rating: 17

Thanks for answer!

POCKETSPHINX VS SIMON
I don't know why I thought that Simon is speech synthesizer. I watched the video and I see it is really good speech recognition engine. Should I switch from Sphinx4 to Simon :-P? I've got unexpectedly many problems with things which should be the simplest possible in CMU Sphinx, i.e. how to edit/build/run PocketSphinx examples like HelloWorld (https://sourceforge.net/projects/cmusphinx/forums/forum/5471/topic/3445960?message=7750375). (By the way, I think that's better way to use Ant for demo.xml to Sphinx4, rather than Eclipse/NetBeans, I need to find how to execute demo.xml with Ant). From my point of view it is partially because of mess what is where in the websites and documentation. I looked at docs of HTK and Julius and I found those better because they consist on one big pdf file. My goal is to run that speech recognition with fifteen words on mobile phone so I think PocketSphinx would be best choice. Or should I switch to Simon/HTK/Julius?

RESULTS OF MY WORK
Here I include both files from my etc directory and, as you asked, my pseudo-Arpabet list: http://rapidshare.com/files/308801152/etc.7z.html

HOW TO INVENT AUDIO AND TRANSCRIPTION FILES
I guess it doesn't matter how many spacebars I've got in .dic file between word and its pronunciation. Is it good way to create two ways of 'dziewiec' as shown in my file? How should I enlarge .filler file to take into account mouth lapping, clearing one's throat etc.? I'm surprised that those are not already created in etc/an4.filler. In the directory wav\pl1_clstk I decided not to create any subdirectories but to put my wav (not sph) audio files directly there. And about those .transcription files. Is it good idea to create ten files with ten sets of ten words in random order (e.g. "JEDEN PIEC KONIEC TAK TRZY...")? How many people do need to speak those? Should I use the same or different sentences for all of those people? How many minutes of recording do I need to have? Do I really need to create any 'test' files? If yes, can I simply use exactly the same transcriptions and audio files as in 'train'? By the way I also edited sphinx_train.cfg according to Nsh's suggestion. I deleted feat.params, pl1.ug.lm, pl1.ug.lm.DMP, I hope I could do it, could I?

Greetings :-)!

--- (Edited on 11/18/2009 9:35 am [GMT-0600] by johnyjj2) ---

Sphinx dictionary can contain Polish special characters
User: ralfherzog
Date: 11/19/2009 10:11 am
Views: 490
Rating: 17

Hi johnyjj2!

"should I switch to Simon/HTK/Julius?"

You can use both systems. You can import 19 Polish words into simon.

It is possible to convert Ralf's Polish dictionary (PLS format) into Sphinx format. You could do that e.g. with Notepad++:
- remove the XML elements (with Search/Replace);
- convert the eSpeak phonemes into Arpabet phonemes (with Search/Replace). There is no encoding problem because eSpeak and Arpabet phonemes only consist of US-ASCII characters. By the way, I am interested to know which eSpeak phonemes correspond with which (pseudo-)Arpabet phonemes.

The result would be a dictionary that you can
- use with simon/HTK/Julius;
- use with Sphinx.

Of course, the Sphinx dictionary can contain Polish special characters (be careful with the encoding, use UTF-8).

"I guess it doesn't matter how many spacebars I've got in .dic file between word and its pronunciation."

I don't know. At least, I could import your dictionary into simon as Sphinx dictionary. Download cmudict, and look exactly at their code to find out whether they use spacebars or a tab (or several tabs) between the word and the corresponding pronunciation.

Greetings,
Ralf

--- (Edited on 2009-11-19 10:11 am [GMT-0600] by ralfherzog) ---

Re: Sphinx dictionary can contain Polish special characters
User: johnyjj2
Date: 11/20/2009 8:09 am
Views: 2425
Rating: 18

Thank you for your answer!

Can I run Simon on mobile phone? I will have a look at these eSpeak phonemes soon.

Let me also repeat some questions from previous post which are not answered yet:
Is it good idea to create ten files with ten sets of ten words in random order (e.g. "JEDEN PIEC KONIEC TAK TRZY...")? How many people do need to speak those? Should I use the same or different sentences for all of those people? How many minutes of recording do I need to have?

Nsh, please, have a look here: https://sourceforge.net/projects/cmusphinx/forums/forum/5471/topic/3465115

Greetings!

--- (Edited on 11/20/2009 8:09 am [GMT-0600] by johnyjj2) ---

You have to find out on your own
User: ralfherzog
Date: 11/21/2009 12:36 am
Views: 4330
Rating: 15

Hi johnyjj2! Sorry, I don't know the answers to your questions. You have to find out on your own. I don't want to give you false advice.

--- (Edited on 2009-11-21 12:36 am [GMT-0600] by ralfherzog) ---

PreviousNext