VoxForge
To add a missing word (as displayed in your HDMan log - dlog) to the VoxForge Lexicon, you need to look at the pronunciation of similar words in the dictionary, and create a new pronunciation entry for your word based on these similar words.
For example, if you want to add the word "winward", you would look up words that are similar, such as:
WINWOOD [WINWOOD] w ih n w uh d
In this case, this gives us the pronunciation for the "win" in the word "winward". Next, we look for words that contain "ward" in the dictionary, such as:
WOODWARD [WOODWARD] w uh d w er d
WARD [WARD] w ow r d
Notice that although the words "woodward" and "ward" contain the same sequence of letters (ward), they are pronounced differently - they have different phoneme sequences. Next you need to make a judgment call based on your knowledge of your English dialect (you might also want to listen to the actual audio passage that contains the word, but this could take too much time for each and every word you are unsure of... ). For me, the way I pronounce the word part "ward" in "winward" is closer to the sounds I make in "woodward" that in the word "ward". Therefore, the final pronunciation dictionary entry I would use would look like this:
WINWARD [WINWARD] w ih n w er d
You then need to add this word to your version of the VoxForge Lexicon in *Alphabetical* sequence. You need to repeat these steps for all the "missing words" words in your eText. It's a little tedious when you perform this process for the first time, but as you get familiar with the words and phonemes, it goes much quicker.
Start Festival
$ festival
From the Festival command line, there are a series of "lex" commands that can help you determine the phonemes contained in a word that is not included in the VoxForge dictionnary, and as an added bonus, you can actually listen to how Festival pronounces the word to get a better feel for the phonemes.
First, find out which lexicons (i.e. pronunciation dictionnaries and rules) are included in your distribution of Festival using the "lex.list" command as follows:
festival> (lex.list)
("english_poslex" "cmu")
Since VoxForge is based on the cmu dictionnary, we can use Festival to determine the phonemes of an unknown word, using Festival's dictionnary an pronunciation rules (see here for Festival's phone list).
Festival (rel 1.95) usullay uses the "cmu" lexicon by default. To make sure that you are using this dictionnary, use the following command:
festival> (lex.select "cmu")
Next, to determine the pronunciation of a word use the "lex.lookup" command as follows:
festival> (lex.lookup "internet")
("internet" nil (((ih n t) 1) ((er n) 0) ((eh t) 1)))
Festival will list the phonemes included in the word, but also includes numbers (these indicate "lexical stress" for a phoneme). Ignore the parathesis and numbers, and you have Festival's view of the phonemes that make up the word you entered. Therefore, for the word "Internet", Festival says its phonemes are: "ih n t er n eh t".
Create a new file called MissingWords, and Copy the missing words listed in the dlog log file from the HDMan run
Next, Run the MissingWordsCleanup.pl script as follows
$ perl ./MissingWOrdsCleanup.pl
This will create a good first draft of the pronunciations for the missing words - in a file called MissingWords_out. You still need to confirm these pronunciations to make sure they are OK. You can do this by looking at similar groups of letters in the missing words, and look up the pronunications for these groups in other known words - if they match, then use what Festival recommends. If they don't match, you need to make a judgement call based on your knowledge of English.
In the current example, once all the missing words have been added, your VoxForge Lexicon should look like this: VoxForgeDict.
Once you finish adding all your words, re-run the HDMan command:
$ HDMan -A -D -T 1 -m -w wlist -i -l dlog dict VoxForgeDict
And review the HDMan log output (i.e. dlog) again to make sure that you did not miss any other words.
Note: One common error is to put the new entries in the Lexicon
file in the wrong sort order. You might have to experiment with
word placement (especially with words containing non-alphanumeric
characters) to get it so that HDMan will run correctly. |
The HDMan command with create a dictionnary file called: dict. Your dict file is essentially all the words in your wlist file with added pronounciation information.