Acoustic Model Discussions

Flat
creating new model with the use of Germany VoxForge model
User: johnyjj2
Date: 11/11/2009 10:08 am
Views: 26934
Rating: 10

Hello :-)!

I'm in a desperate need to create acoustic model for my language (which is not supported here in VoxForge). I've got very little vocabulary because it contains only about fifteen words.

I follow this tutorial http://www.speech.cs.cmu.edu/sphinxman/scriptman1.html and unfortunately I cannot use Sphinx Wiki because they've got technical difficulties all the time.

I find this tutorial somehow insufficient so I downloaded files from VoxForge Germany model. Based on this model and tutorial I created similar structure of files and directories for my language.

I know I ask for much but I would be really greatful if somebody experienced can guide me through the process of training acoustic model. I've got little training data and I'm going to have much more but first of all I need to know how to follow this whole process with the simplest set of data - just to ensure that I can do it.

Results of my work are here: http://www.speedyshare.com/743133979.html (if the link expires, write to me [email protected]).

 

  1. AcousticModels\etc\feat.params -> to be created by SphinxTrain
  2. AcousticModels\etc\prompts -> am I right that I need to create several short "sentences", e.g. "seven five two next" or "two one next five" if my vocabulary contains numbers and some additional words like "next"?
  3. AcousticModels\etc\sphinx_decode.cfg & sphinx_train.cfg -> I simply copied those from Germany model
  4. AcousticModels\etc\voxforge_pl_sphinx.dic -> my dictionary, simply "TRZY TRZY" because of very little dictionary; with spaces and tabs in the same way as in Germany
  5. AcousticModels\etc\voxforge_pl_sphinx.filler -> copied from Germany model
  6. AcousticModels\etc\voxforge_pl_sphinx.lm -> to be created by SphinxTrain
  7. AcousticModels\etc\voxforge_de_sphinx.phone -> words + SIL; I wasn't sure whether to create words with lower-case or upper-case letters, I decided to use lower-case for words and upper-case for SIL
  8. AcousticModels\etc\voxforge_de_sphinx.vocal -> list of words
  9. AcousticModels\etc\voxforge_pl_sphinx_full.fileids -> rafal-20091111-a & rafal-20091111-b
  10. AcousticModels\etc\voxforge_pl_sphinx_full.transcription -> again analogy to Germany
  11. AcousticModels\etc\voxforge_pl_sphinx_test.fileids -> new set of data (now for anonymous, not rafal)
  12. AcousticModels\etc\voxforge_pl_sphinx_test.transcription -> as above
  13. AcousticModels\etc\voxforge_pl_sphinx_train.fileids -> the same as voxforge_pl_sphinx_full.fileids
  14. AcousticModels\etc\voxforge_de_sphinx_train.transcription -> the same as voxforge_de_sphinx_full.transcription
  15. AcousticModels\model_parameters\voxforge_pl_sphinx.cd_cont_3000 -> I don't know what this name of subfolder means, but I created the same name. I think 'cont' stands for continuous. I don't know which I should choose for my solution - continuous or semi-cont., but I decided for cont.
  16. AcousticModels\model_parameters\voxforge_pl_sphinx.cd_cont_3000\feat.params -> to be created by SphinxTrain
  17. AcousticModels\model_parameters\voxforge_pl_sphinx.cd_cont_3000\mdef -> I think it is also to be created by SphinxTrain, am I right?
  18. AcousticModels\model_parameters\voxforge_pl_sphinx.cd_cont_3000\means -> as above
  19. AcousticModels\model_parameters\voxforge_pl_sphinx.cd_cont_3000\mixture_weights -> as above
  20. AcousticModels\model_parameters\voxforge_pl_sphinx.cd_cont_3000\noisedict -> the only one file which I copied to this subdirectory from Germany model; without changes
  21. AcousticModels\model_parameters\voxforge_pl_sphinx.cd_cont_3000\transition_matrices -> to be created by SphinxTrain
  22. AcousticModels\model_parameters\voxforge_pl_sphinx.cd_cont_3000\variances -> to be created by SphinxTrain
  23. AcousticModels\results\voxforge_de_sphinx-1-1.match -> the same as etc\voxforge_de_sphinx_test.transcription but without <s> and </s>
  24. AcousticModels\results\voxforge_de_sphinx.align -> to be created by SphinxTrain
  25. AcousticModels\results\voxforge_de_sphinx.match -> the same as voxforge_de_sphinx-1-1.match in this directory; I guess I can delete voxforge_de_sphinx-1-1.match and leave only voxforge_de_sphinx.match
  26. AcousticModels\test\sphinx3-test -> copied from Germany model; the only one what I changed is from "de" to "pl"
  27. AcousticModels\test\test.ctl -> simply copied from Germany model
  28. AcousticModels\test\test.fsg -> to be created by SphinxTrain
  29. AcousticModels\test\test.gram -> copied from Germany model; I replaced Germany digits with my digits; what about other words which are in my dictionary? I guess I can ommit them here; why is the order of digits so weird?
  30. AcousticModels\test\test.wav -> original wav file contains spoken "drei neun zwei funf funf neun" (392559); I'll create the same file but with my digits in "Accessories->Entertainment->Sound Register"; I guess it is not proper way of creating audio file because it may require some given value of speech frequency or things like this; so how to record it properly?
  31. AcousticModels\espeak2phones.pl -> I think I don't need this script so I didn't copy it
  32. AcousticModels\traintest -> I copied it and changed all "de" to "pl"
  33. Audio\Main\8kHz_16bit\rafal-20091111-a\etc\audiofile_details -> I made it similar to Germany one; I see there are parameters of recorded wav written here; for my files those will be different but I'm somehow lost in this sophisticated directories structure of Audio subdirectory
  34. Audio\Main\8kHz_16bit\rafal-20091111-a\etc\PROMPTS -> analogy to Germany file and according to my data here voxforge_de_sphinx_full.transcription
  35. Audio\Main\8kHz_16bit\rafal-20091111-a\etc\prompts-original -> my data is not sentences but I created those here like "Two one.", i.e. as if those were sentences
  36. Audio\Main\8kHz_16bit\rafal-20091111-a\wav\pl01-001.wav -> I'll create it with default wav recorder in Windows
  37. Audio\Main\8kHz_16bit\rafal-20091111-a\wav\pl01-002.wav
  38. Audio\Main\16kH_16bit\rafal-20091111-b -> at this moment I ignored it in order to avoid unncecessary work - first of all I'd like you to answer my post, please, and say what I did OK and what wrong, what I should do etc.; I only created two empty directories etc and wav here
  39. Audio\MFCC\8kHz_16bit\MFCC_0_D\rafal-20091111-a\etc -> I copied three files from Audio\Main\8kHz_16bit\rafal-20091111-a\etc; I also copied README.txt and I didn't make any changes to this file, it is only info file so it doesn't matter what is inside; I'll fill it with proper informations when I'll know what to write here
  40. Audio\MFCC\8kHz_16bit\MFCC_0_D\rafal-20091111-a\mfc -> files pl01-001.mfc and pl01-002.mfc are to be created by SphinxTrain
  41. Audio\Original\44.1kHz_16bit\openpento-20091111-1_3\etc -> mysterious directory with unknown destination
  42. Audio\Original\48kHz_16bit\rafal-20091111-a\etc -> files copied from here AcousticModels\etc
  43. Lexicon\dewik.output -> I think this file is unnecessary
  44. Lexicon\dewik.rawpron -> created as analogy to Germany file
  45. Lexicon\dewik.rawpronSortedForms -> other mysterious file, probably not needed in my case (very little dictionary); or maybe created by SphinxTrain
  46. Lexicon\dewik.rawpronSortedForms.current -> as above
  47. Lexicon -> I ignored those five files (vox... with extensions pr0n, xml)
  48. Scripts -> at this moment I decided not to copy those

 

Can I simply record those wav files with default Windows recorder? (I guess not). Is it OK to train my model as cont., not semi-cont.?


I think by creating the above structure of files and directories I've got finished "data preparation" (http://www.speech.cs.cmu.edu/sphinxman/scriptman1.html#02), however I'm not sure about feature files and control file. Now I guess I need to follow (http://www.speech.cs.cmu.edu/sphinxman/scriptman1.html#20), i.e. mk_model_def. I enter SphinxTrain directory (built in Visual Studio C++) and search for files "mk_model_def" but there are no founds. I look for similar files manually and I see SphinxTrain\bin\Debug\mk_mdef_gen.exe. When I run it in Command Prompt I see similar flags to those from tutorial, however not the same. I guess the real work begins at this moment.

Greetings :-)!

--- (Edited on 11/11/2009 10:08 am [GMT-0600] by johnyjj2) ---

Re: creating new model with the use of Germany VoxForge model
User: nsh
Date: 11/11/2009 4:12 pm
Views: 324
Rating: 9

> I know I ask for much but I would be really greatful if somebody experienced can guide me through the process of training acoustic model.

Gratitude is not enough. I see you've done many mistakes already. Partially because you are reading wrong documentation, partially because of lack of understanding. For example you need to try an4 tutorial first, you need to try Linux because Windows training is basically unsupported. And many more things.

Instead of trying small steps you want to get answers on 50 questions at once. This will not work just because we don't consult for free. We need your submissions on voxforge, we need Polish recordings. We need contributions. Otherwise it's not so interesting to write so many answers.

 

--- (Edited on 11/12/2009 01:12 [GMT+0300] by nsh) ---

Re: creating new model with the use of Germany VoxForge model
User: johnyjj2
Date: 11/11/2009 5:33 pm
Views: 120
Rating: 6

Hello!

This is answer which I hoped to receive :-). Of course I know nothing is for free in this world so I understand that your help requires also something from me, in this case creating freeware, opensource acoustic model for Polish language.

You say that I read improper documentation. Where is the proper one? I've got Ubuntu 9.10 and SphinxTrain installed there.

I didn't expect getting answers for fifty questions :-). I just wanted to show what I already have tried to do so that I can hear some general guidelines like those which you gave me in your answer :-).

My application requires very small dictionary, about fifteen words. However I think this topic (speech recognition) is interesting and somehow state-of-art so I'd like to understand it better, even after finishing my project. I'd like you to say what exactly you expect from me. Creating model for the whole language is work impossible for one person. So I guess my work would be just creating backbone for Polish language. I can, for example, create file similar to this one http://www.speech.cs.cmu.edu/cgi-bin/cmudict and also list of some words (dictionary), which contains words and the way how they can be written with the use of those phonemes. For my application this is not needed, because I can simply treat all of my words as phonemes (even if those are words, not phonemes) due to fact that I require so little dictionary. Later, of course, I can begin creating that kind of dictionary but I think, again, it is something what cannot be done just by one person. This is why I say about creating backbone so that other users which need Polish language can help to improve it. I can record myself reading sentences in Polish and ask some of my friends or family to do the same. Write what exactly you expect me to do.

Greetings :-)!

--- (Edited on 11/11/2009 5:33 pm [GMT-0600] by johnyjj2) ---

Re: creating new model with the use of Germany VoxForge model
User: nsh
Date: 11/11/2009 6:10 pm
Views: 182
Rating: 9

> Where is the proper one?

http://www.speech.cs.cmu.edu/sphinx/tutorial.html

> I just wanted to show what I already have tried to do so that I can hear some general guidelines like those which you gave me in your answer :-).

Asking precise and detailed question increase your chances to get an answer. Single question. Read this for more info

http://catb.org/~esr/faqs/smart-questions.html

> For my application this is not needed, because I can simply treat all of my words as phonemes (even if those are words, not phonemes) due to fact that I require so little dictionary

No you can't because of variable word length. Due to variable length it's better to have different amount of states per word. While sphinx allows you only to have fixed amount of states per phone. The proper example of the dictionary for small vocabulary system is located in sphinxtrain/templates/tidigits. Other files there also show an example of good setup for a small vocabulary system.

 

--- (Edited on 11/12/2009 03:10 [GMT+0300] by nsh) ---

Polish pronunciation dictionary
User: ralfherzog
Date: 11/12/2009 3:58 am
Views: 386
Rating: 10

Hello johnyjj2!

It seems that your native language is Polish. Well, I am trying to create a Polish pronunciation dictionary with 300.000 words. My question is: which eSpeak phonemes correspond with which Polish IPA phonemes? If you are interested, you can help me with the improvement of Ralf's Polish dictionary. As soon as the major issues are being fixed, we can try to convert this dictionary into Sphinx format.

Regards,

Ralf

--- (Edited on 2009-11-12 3:58 am [GMT-0600] by ralfherzog) ---

Re: Polish pronunciation dictionary
User: johnyjj2
Date: 11/13/2009 5:26 pm
Views: 185
Rating: 9

RalfHerzog: I created the topic here: https://sourceforge.net/projects/espeak/forums/forum/538922/topic/3458345

Nsh: OK, I followed the whole installation process as explained in the tutorial. The last four steps were as follows:
[in "How to perform a preliminary training run"]
1. tutorial/an4$ perl scripts_pl/make_feats.pl -ctl etc/an4_train.fileids
2. tutorial/an4$ RunAll.pl
[in "How to perform a preliminary decode"]
3. tutorial/an4$ perl scripts_pl/make_feats.pl -ctl etc/an4_train.fileids
4. tutorial/an4$ perl scripts_pl/decode/slave.pl
I don't know why make_feats.pl must be executed twice. As I read in [3] it computes MFCCs. What does it use to compute those MFCCs? I guess wav files.

So I follow "perl scripts_pl/copy_setup.pl -task pl1" and it creates pl1 directory (several megabytes less than an4). I see that before "You must go through the following steps in sequence" it says that "tutorial exercise begins with training the system using the MFCC feature files that you have already computed during your preliminary run". But I computed those for an4 so I guess I need to create my wav files before recomputing those MFCC. I think MFCC computation requires wav files so I enter tutorial/pl1/wav. I guess I can simply erase those two directories and create my wav files in similar way. But when I enter tutorial/pl1/wav/an4_clstk/fash (fash is one of many directories) I see some an[...].sph and cen[...].sph files. Those are not wav files and I cannot even see their content in Gedit, Document Viewer or OpenOffice. So how to create those wav files in order to recompute MFCCs and follow those three steps of "How to train, and key training issues".

Greetings :-)!

 

--- (Edited on 11/13/2009 5:26 pm [GMT-0600] by johnyjj2) ---

Re: Polish pronunciation dictionary
User: nsh
Date: 11/14/2009 7:35 am
Views: 117
Rating: 7

> I don't know why make_feats.pl must be executed twice.


First it extracts features from training files, second from testing files. Each acoustic database has thouse two sets. Training files are used to estimate acoustic model parameters, testing files are used to estimate the accuracy of the database.

> Those are not wav files and I cannot even see their content in Gedit, Document Viewer or OpenOffice.


Sph is also a format of the audio files. It's used by most commercial databases and by an4 as well. You can convert them to wav with sox.

> So how to create those wav files in order to recompute MFCCs and follow those three steps of "How to train, and key training issues".

You can record audio files with audacity.

http://www.voxforge.org/home/submitspeech/windows/step-2

In etc/sphinx_train.cfg you need to change the format of the input then:

$CFG_WAVFILES_DIR = "$CFG_BASE_DIR/wav";
$CFG_WAVFILE_EXTENSION = 'wav';
$CFG_WAVFILE_TYPE = 'mswav'; # one of nist, mswav, raw
$CFG_FEATFILES_DIR = "$CFG_BASE_DIR/feat";
$CFG_FEATFILE_EXTENSION = 'mfc';
$CFG_VECTOR_LENGTH = 13;

 

 

--- (Edited on 11/14/2009 16:35 [GMT+0300] by nsh) ---

Re: Polish pronunciation dictionary
User: johnyjj2
Date: 11/14/2009 3:16 pm
Views: 137
Rating: 9

Thank you for answer.

Is it good idea to create .phone file simply by mapping to http://en.wikipedia.org/wiki/Wikipedia:IPA_for_Polish ? In other words just rewriting those symbols which are in Orthography column with avoiding usage of characters which are not present in English alphabet (writing them in a different manner). Later I would simply create list of about fifteen words in .dic (phone merging with my phone symbols, as seen in Polish wikidictionary for most popular words with IPA sounds) and then ten sets of five-seven words in random order, record them and write their transcriptions.

I also would like to edit code and then run examplary application on mobile phone with the use of PocketSphinx (https://sourceforge.net/projects/cmusphinx/forums/forum/5471/topic/3445960?message=7750375). Some difficulties which I've got are explained in this topic on sourceforge.

Greetings.

PS There is no reply from Ralf Herzog https://sourceforge.net/projects/espeak/forums/forum/538922/topic/3458345

--- (Edited on 11/14/2009 3:16 pm [GMT-0600] by johnyjj2) ---

How do you adress Polish orthography (kreska, kropka, ogonek)?
User: ralfherzog
Date: 11/15/2009 6:20 am
Views: 298
Rating: 8

Hi johnyjj2!

"avoiding usage of characters which are not present in English alphabet"

Well, I want to ask you a question: how do you adress Polish orthography (kreska,  kropka, ogonek)? Which encoding are you using for your dictionary file? I would suggest that you only use words that are within the range of ASCII (like cmudict).

I got the impression that almost nobody cares about encoding issues.

Let me give you an example: I imported a French dictionary (Sphinx format) into simon. And of course, there were garbage characters. The same problem is likely to occur with the Polish language. This problem occured with the Voxforge German acoustic model (Sphinx format). Of course, you can imagine that you probably will have to deal with this issue.

Because the encoding issues are difficult to solve, I decided to focus on dictionary development. It is not possible to develop a good acoustic model without a good dictionary. So why focus on acoustic model development without having a good dictionary?

The goal "speech recognition for your native language" is very difficult to solve. Why not define as subgoal "develop a good pronunciation dictionary for the Polish language"?

So my post is about goal-setting. What are your exact goals?

--- (Edited on 2009-11-15 6:20 am [GMT-0600] by ralfherzog) ---

Re: How do you adress Polish orthography (kreska, kropka, ogonek)?
User: johnyjj2
Date: 11/15/2009 5:16 pm
Views: 414
Rating: 9

Thanks for answer!

About encoding of Polish letters on my computer. When I run Notepad in Windows and save text file, I see four encoding possibilities, default one which I use is ANSI, but it also works with Unicode, Unicode big endian and UTF-8. Which one should I choose for my Sphinx files?

> So my post is about goal-setting. What are your exact goals?

I like talking about goals :-). I've got project which I need to finish in December (more details here: http://forum.skype.com/index.php?showtopic=464711). First of all I need to create little acoustic and language model so that I can create application with Sphinx4/PocketSphinx which can recognize about fifteen words. I know encoding can be important issue. I encountered difficulties with encoding files several times in my short career in IT and I can say that such a simple thing as saving the text file with proper encoding chosen from the list at the bottom of "save as" window in Notepad can save much of time which would be wasted in other case :-). But if my application requires only fifteen words, I can simply write "piec" instead of "pięć" and not bother myself with encoding.

I see encoding is important for you in order to create complete model for Polish. It is not for me. However, in the meantime of my application development, I found this topic (all this speech recognition and so on) more interesting than I thought :-). I also stated that I'm gonna create backbone of Polish language on VoxForge in exchange to help from Nsh. In other words my first goal is to create very simple acoustic and language model for Polish language with about fifteen words, then to finish my application and later (or in the meantime :-)) help you with involving Polish into Simon and creating backbone of Polish for VoxForge :-).

So that's about goals :-P. And answering to the question which I asked :-D (i.e. "Is it good idea to create .phone file simply by mapping to http://en.wikipedia.org/wiki/Wikipedia:IPA_for_Polish ?") I guess I found better way when skimming those links which you gave me in your post. I think I should combine Wikipedia:IPA_for_Polish with this http://en.wikipedia.org/wiki/Arpabet . By the way, let me remind that Nsh suggested that I cannot simply treat all of my words as phonemes because of variable word length.

Greetings :-)!

PS Where can I find comprehensive info how to edit/build/run PocketSphinx applications? (Look: https://sourceforge.net/projects/cmusphinx/forums/forum/5471/topic/3445960?message=7750375).

--- (Edited on 11/15/2009 5:16 pm [GMT-0600] by johnyjj2) ---

PreviousNext