German

Flat
Ralph's TTS voice
User: nsh
Date: 2/15/2008 3:59 pm
Views: 27798
Rating: 38

Hey Ralf, I've checked some of your recordings, there are too much speech from you I suppose :) Model will be too overtrained if it will contain only your voice

But, there is an amazing advantage of the large database - we can easily create German TTS voice for festival, what do you think about it?

 

you can use my voice for TTS
User: ralfherzog
Date: 2/15/2008 5:13 pm
Views: 444
Rating: 84
Hello nsh,

Thanks for the feedback.  I don't know how much speech is necessary, and when it is too much.  Do you think I should stop submitting to the VoxForge project?  I am planning to submit much more prompts.  At the moment, I am preparing more than one hundred zip files in the German language (each zip file contains 99 sentences).  That means that I will submit in the next few weeks/months more than 100 × 99 sentences = 9900 sentences.

I hope that other people will use some of my prompts to submit their own speech.

From my point of view, it is better to have a lot of prompts.  Developers of speech recognition software should be able to make a selection.  And a bit of redundancy should be helpful to stimulate the process of developing the free speech recognition software.

Of course, redundancy shouldn't result in overtraining.  But where is the limit? It should be possible to eliminate the redundancy.

If you want, you can create a "German TTS voice for festival."  My speech is licensed under the GPL.  So just use it for TTS.

Greetings, Ralf
Re: Ralph's TTS voice
User: Robin
Date: 2/16/2008 3:33 am
Views: 308
Rating: 37

That's a good idea! I've always thought it was a shame there are not so many voices out there. I don't really use TTS, but it's great for people who have bad eyes or no voice.

How many hours are (more or less) necessary to make a decent voice for festival?

Robin 

covering all nodes of a language
User: ralfherzog
Date: 2/16/2008 4:55 am
Views: 2976
Rating: 35
Hello Robin,

This is a very good question.  In my opinion, it is helpful to think about the concept of a node.  A node has a similar meaning to a vertex in graph theory.  

This is what I want to say:

One phoneme = one node
One word = one node
One sentences = one node

A language like English or German (Dutch should be the same) has the following structure (roughly estimated):

44 phonemes = 44 nodes
400.000 words = 400.000 nodes
One million sentences = one million nodes

To get good results you can try to cover all nodes.  When I create my prompts, I always think about the concept of nodes/vertices as the fundamental unit of a graph.  If you cover every node/vertex of the specific language, this should be enough.  It is necessary to create some redundancy to cover all nodes.  It is not possible to cover all nodes without the creation of some redundancy.

I cannot say how many hours you need to achieve the goal.  You should submit as much as you think is necessary to cover all nodes.

Maybe, the following thought is a good approximation: If you would dictate all articles in the specific language contained in the Wikipedia (which is not licensed under the GPL), you would cover almost all nodes.  Or, if you would dictate all sentences that are spoken in your favorite TV movies, TV series, TV talk shows, you would cover almost all nodes.

Language does have a similar structure like a graph in graph theory.  At least this is my opinion.  I don't know how a TTS application like festival works.  I have my own understanding of language as a graph.  And I couldn't say how many hours you would need to cover all nodes of the graph.  This depends on your selection of the sentences that you submit.  Maybe you need more sentences.  Or maybe you need less sentences to achieve the goal.  You don't need much sentences to cover all phonemes.  But for a good representation of the language you need much more than just a recording of each phoneme.  You have to leave the layer of the phonemes and get to the layer of the words.  And words are a part of sentences.  So to cover 400.000 words you would need let's say one million sentences.

This is my approach.  I would be interested to know if there are similar or different opinions about this topic.

Greetings, Ralf
Re: covering all nodes of a language
User: kmaclean
Date: 2/19/2008 8:29 pm
Views: 2163
Rating: 38

Hi Ralf,

Here are my two cents ... 

I agree with you when you say that the best speech corpus would be a large speech corpus containing both read and transcribed spontaneous speech from hundreds/thousands of people uttering 400,000 different words and 1 million different sentences.  However, the problem is cost - it is very costly and time consuming to collect such data.

As a short-cut, we can try to think of the problem in terms of trying to get  good monophone and *triphone* coverage from as many people as possible.  The original CMU dictionary (the source of the current VoxForge pronunciation dictionary) has close to 130,000 word pronunciations and has 43 phonemes and close to 6000 triphones.  If we can get good coverage of these 6000 triphones (or however many there might be in the target language) then we might reach our objective of a reasonably good acoustic model, without needing to worry about complete coverage of all word/sentence combinations in the target language.

Hope that helps, 

Ken

Re: covering all nodes of a language
User: ralfherzog
Date: 2/20/2008 2:33 am
Views: 303
Rating: 32
Hello Ken,

I would like to have a German pronunciation dictionary that is comparable to the CMU dictionary.  Timo was able to use some of my prompts to produce a first version of the German pronunciation dictionary that contains about 5.088 words.

If I would submit more prompts in the German language, he (or someone else) could use those prompts to create a second version of the dictionary with a larger vocabulary.

Due to copyright concerns, I think it is the best to build such a German pronunciation dictionary from scratch using only prompts that are licensed under the GPL.

To build such a dictionary it wouldn't be necessary that we have hundreds or thousands of people submitting each of them 400.000 words.

Greetings, Ralf
Re: covering all nodes of a language
User: nsh
Date: 2/20/2008 11:07 am
Views: 217
Rating: 33

Heh, let me repeat that it looks like a deeply wrong idea to get all triphones, current senone tying technique allows you to get effective recognition without good coverage. And the biggest problem is that rare triphones give you zero improvement in the accuracty.

I'd actually like to ask for a help - in the acoustic model build logs (inside the acoustic model archive) you'll see a lot of errors due to not reaching final state. It means that transcription is out of sync with the dictionary and recordings. Can you please check dictionary, transcription and recordings for those particular prompts and find the reason of  misalignment.

P.S. Thanks to Ken for uploading it. There was a request about sphinx4 model too.

 

Re: covering all nodes of a language
User: kmaclean
Date: 2/20/2008 2:17 pm
Views: 279
Rating: 36

Hi Ralf/nsh,

Ralf: Sorry, I was thinking speech recognition when I replied to your posting of covering all nodes of a language (I forgot that Robin's original question was about Text-to-Speech). 

From a speech recognition context, I just wanted to save you some work by making sure that you did not attempt to create submissions for every combination of words.  However, if your goal is to create a pronunciation dictionary by recording many different prompts, and getting the added benefit actual speech for these prompts, then it makes sense.

nsh said:

>let me repeat that it looks like a deeply wrong idea to get all triphones, current

>senone tying technique allows you to get effective recognition without good

>coverage. And the biggest problem is that rare triphones give you zero

>improvement in the accuracty.

Thanks for this clarification.  My assumption that a good acoustic model (for speech recognition) needs to be trained from recordings of words containing all triphones is wrong.  Therefore, the key is to get recordings of words that contain the most common triphones, and using "tied-state triphone" models (which I think is HTK terminology for "senone tying" technique, which is what Sphinx uses...) to cover the rare triphones. 

I'm wondering if HTK's HDMan command can provide triphone counts (in a similar way that it provides phoneme counts), so we can then create prompts that might give us the "most bang for our buck".  I'm thinking we would run it against a large database to get these triphone counts (even it could even be proprietary, since we are only looking for the counts), and then generate a list of words (from this same database) that cover off these common triphones, so Ralf (and others creating prompts for new languages) could use these words in his prompts. 

... I'll put it on my todo list :)

Ken 

Zipf's law; German acoustic model
User: ralfherzog
Date: 2/20/2008 5:10 pm
Views: 407
Rating: 37
Hello nsh,

You are probably right when you say that "rare triphones give you zero improvement in the accurac[...]y."  This is because of Zipf's law, which is applicable not only to words but probably also to triphones.  So to get a good recognition rate, you might need just those triphones in the model that occur most frequently.

Thanks for building the German acoustic model.  And thanks, Ken, for adding it to the VoxForge downloads.  This is what I wanted: a first version of the German acoustic model.  I wanted to have some progress, now I have it.  So thanks a lot for your work, you guys.

I took a look at the archive "voxforge-de.tar.gz." When I open the file "voxforge_de_sphinx.html", I can see that there are a lot of warnings.  For example: "WARNING: This phone (EI) occurs in the dictionary ([...]voxforge-de/etc/voxforge_de_sphinx.dic), but not in the phonelist"

Or another example: "WARNING: This word: amerika was in the transcript file, but is not in the dictionary ( das zeigen berichte aus amerika  )."

Obviously, the phonelist has to be completed.  And the dictionary is missing a lot of words.  But I don't know what is the reason for the misalignment.  It would be good if Timo (or someone else who is familiar with scripts) would take a look at it.  I am sorry that I don't have the knowledge to solve this problem.  I would help if I knew what to do.  For the moment, we have to live with those warnings.

Hello Ken,

You don't have to be sorry.  When I submit the prompts, I have several targets.  One target is to cover all nodes.  But I always have in mind that I have to respect Zipf's law.  And both targets tend to conflict.  It is very difficult to cover all nodes, and to respect at the same time Zipf's law.  The main target is to get a prototype of the German speech recognition engine.

It is no problem to create those prompts. It is a lot of fun to use NaturallySpeaking to create prompts in English and in German.  I just hope that you have enough webspace to store those prompts.

Greetings, Ralf
Re: Zipf's law; German acoustic model
User: kmaclean
Date: 2/29/2008 12:49 pm
Views: 464
Rating: 86

Hi Ralf,

Thanks for bringing up Zipf's law.  For others (like me) who have never heard of it, here is an excerpt from Wikipedia:

Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, which occurs twice as often as the fourth most frequent word, etc. For example, in the Brown Corpus "the" is the most frequently occurring word, and all by itself accounts for nearly 7% of all word occurrences (69971 out of slightly over 1 million). True to Zipf's Law, the second-place word "of" accounts for slightly over 3.5% of words (36411 occurrences), followed by "and" (28852). Only 135 vocabulary items are needed to account for half the Brown Corpus.

Amazing stuff,

thanks,

Ken 

P.S.  re: "I just hope that you have enough webspace to store those prompts."

No worries, disk space gets cheaper every year - not sure if it follows Moore's Law, but it must be pretty close :)

PreviousNext