VoxForge
Hi,
Last week I prepared one dictionary for spanish (word to phonemes dictionary). I used festival to convert the words to their phonemes.
In the phonemes list appear some special phonemes, and here I have some doubts.
In spanish, there are 3 phonemes, the sound of b,d,g... that have 2 versions of the same sound, one stronger that the other. For festival appear the b phoneme (for example bajo), and the B phoneme (for example "aBierto")... and the same for the phoneme d and g... should they be considered for a voice recognition system the same phoneme (the sound is very similar).
In spanish, each word has allways a vowel that is accentuated (may appear the accent explicity or not): áéíúó. The diference between the phoneme of a (a), and the phoneme of á (a1) is that the first one is shorter than the second one, but the sound is the same.
In spanish sometimes the same word with a diferent accent gives two different words. For example, the phrase "ESTA es una chica" (THIS is a girl), and "¿Dónde ESTÁ el baño?" (Where IS the w.c.?). The word ESTA in one case means THIS, and in the other (ESTÁ) means IS...
Should it be considered the same phoneme, or the should be considered not equals.
Thanks in advance.
They should be considered as two different phonemes. When they have exact the same phonemes, the speech recognition engine does not 'hear' the difference between them (that esta está problem). When they are very similar it may not differ that much on bad engines / acoustic models, but the more accurate they are the more it matters.
At the end I created two phonemes for the vowels, and it's sounds (a, á, e, é, i, í, o, ó, u and ú) and one for the others.
In order to make better spanish recognition, I suppose that I have to prepare more phrases for spanish, I suppose that I would need to write more phrases with missing Triphones, is this the way?
Hi Ubanov,
>In order to make better spanish recognition, I suppose that I have to
>prepare more phrases for spanish, I suppose that I would need to write
>more phrases with missing Triphones, is this the way?
Although I think you answered your own question, I would say: Yes.
But you want to target the triphones that are most likely in Spanish, and make sure you have those covered.
You don't need to cover all the triphones... see nsh's post for more information: Re: covering all nodes of a language where he says: "... the biggest problem is that rare triphones give you zero improvement in the accuracy."
Ken
Hi Ubanov,
>>In spanish, there are 3 phonemes, the sound of b,d,g... that have 2
>>versions of the same sound, one stronger that the other [...] (the
>>sound is very similar).
I depends... you will likely need to test these out and see if it makes a difference. If these sounds are never used in similar words in the Spanish language, then you can probably get away with using the same phoneme for both sounds (since they are being used in different 'triphone' contexts). If you get poor recognition results with this grouping, then you may need to split them into separate phonemes.
>>In spanish, each word has allways a vowel that is accentuated (may
>>appear the accent explicity or not): áéíúó. The diference between the
>>phoneme of a (a), and the phoneme of á (a1) is that the first one is
>>shorter than the second one, but the sound is the same.
It depends... see above comments...
>>In spanish sometimes the same word with a diferent accent gives two
>>different words. For example, the phrase "ESTA es una chica" (THIS is
>>a girl), and "¿Dónde ESTÁ el baño?" (Where IS the w.c.?). The word
>>ESTA in one case means THIS, and in the other (ESTÁ) means IS...
You don't have to worry about this in your acoustic model. This is addressed in your pronunciation dictionary where ESTA and ESTÁ would have the same pronunciation. Then you would need to use your grammar or language model to distinguish between them.
>At the end I created two phonemes for the vowels, and it's sounds (a, á,
>e, é, i, í, o, ó, u and ú) and one for the others.
You might also look at how they did it in the Sphinx Spanish pronunciation dictionary. You might also take a look at how Festival (or eSpeak...) might have broken down their pronunciations ... i.e. don't "re-invent the wheel" ... :)
Ken
The phonemes of all the words are constructed using festival. Festival gives me 2 phonemes for sounds b and B... as they are similar I have make both of them only one sound.
I have made a quick test removing the áéíóú phonemes (and using aeiou instead), and I think that the recognition has been better (the test it's not too much cientific, rather the test it's very subjetive).
May be that in a future, when we have hours and hours of sounds, the speech recognition engine could diference between á and a... then I'm thinking in removing the sounds of the accentuated vowel. In order to make the test I have search and replace " aa " for " a ", " ee " for " e "... is there any way of having a dict file with the correct pronunciation, and tell to HTK that a and aa are the same phonemes?
Thanks
Hi Ubanov,
>I think that the recognition has been better (the test it's not too much
>cientific, rather the test it's very subjetive).
That is good enough at this point... the important thing is to get the speech, phoneme improvements can always be made at a later date.
>is there any way of having a dict file with the correct pronunciation, and tell
>to HTK that a and aa are the same phonemes?
There might be, but I have not tried this myself... If you look at the tiedlist table (from Step 10 of the Tutorial), there are mappings from "logical" triphones to "physical" phones/triphones. You might try adding an entry there to map from "aa" to "a", and see how that works. However, with this approach you would need to train and recognize with 2 different pronunciation dictionaries. The training one would only use the "a" phoneme, but the recognition pronunciation dictionary would have words containing the proper "a" or "aa" phoneme... seems like more work than it is worth... :)
I think you should just pick the one phoneme that will have the most occurrences in your pronunciation dictionary ("aa" or "a"?), and use that as the phoneme to represent the similar sound. Then, if you decide to split them up at a later date, you only have to change the ones that have the least number of occurrences.
Don't get too hung up on the grapheme representation (letters in a word) vs the phoneme representation (the distinct sounds that make up a word)...
Remember, simply because a word is spelled a certain way, does not mean you have to use those exact letters to represent the sounds in the word (it's easier to figure out what the sound might be if you use them, but is not necessary...). A phoneme can be represented by any 2 digit sequence of letters (alpha in first position, alphanumeric in second position) - HTK/Julius does not care what those 2 digits might be, but you have to be consistent in using that sequence to identify that sound.
If you want 'a' and 'aa' to be the same phoneme for now (until you get a better idea of how having separate phonemes might help recognition...) you could also pick a third, independent 2 digit character sequence (that is not otherwise being used) to represent that sound (like "a1").
Hope that helps,
Ken