VoxForge
Hello :-)!
I'm in a desperate need to create acoustic model for my language (which is not supported here in VoxForge). I've got very little vocabulary because it contains only about fifteen words.
I follow this tutorial http://www.speech.cs.cmu.edu/sphinxman/scriptman1.html and unfortunately I cannot use Sphinx Wiki because they've got technical difficulties all the time.
I find this tutorial somehow insufficient so I downloaded files from VoxForge Germany model. Based on this model and tutorial I created similar structure of files and directories for my language.
I know I ask for much but I would be really greatful if somebody experienced can guide me through the process of training acoustic model. I've got little training data and I'm going to have much more but first of all I need to know how to follow this whole process with the simplest set of data - just to ensure that I can do it.
Results of my work are here: http://www.speedyshare.com/743133979.html (if the link expires, write to me [email protected]).
Can I simply record those wav files with default Windows recorder? (I guess not). Is it OK to train my model as cont., not semi-cont.?
I think by creating the above structure of files and directories I've got finished "data preparation" (http://www.speech.cs.cmu.edu/sphinxman/scriptman1.html#02), however I'm not sure about feature files and control file. Now I guess I need to follow (http://www.speech.cs.cmu.edu/sphinxman/scriptman1.html#20), i.e. mk_model_def. I enter SphinxTrain directory (built in Visual Studio C++) and search for files "mk_model_def" but there are no founds. I look for similar files manually and I see SphinxTrain\bin\Debug\mk_mdef_gen.exe. When I run it in Command Prompt I see similar flags to those from tutorial, however not the same. I guess the real work begins at this moment.
Greetings :-)!
--- (Edited on 11/11/2009 10:08 am [GMT-0600] by johnyjj2) ---
> I know I ask for much but I would be really greatful if somebody experienced can guide me through the process of training acoustic model.
Gratitude is not enough. I see you've done many mistakes already. Partially because you are reading wrong documentation, partially because of lack of understanding. For example you need to try an4 tutorial first, you need to try Linux because Windows training is basically unsupported. And many more things.
Instead of trying small steps you want to get answers on 50 questions at once. This will not work just because we don't consult for free. We need your submissions on voxforge, we need Polish recordings. We need contributions. Otherwise it's not so interesting to write so many answers.
--- (Edited on 11/12/2009 01:12 [GMT+0300] by nsh) ---
Hello!
This is answer which I hoped to receive :-). Of course I know nothing is for free in this world so I understand that your help requires also something from me, in this case creating freeware, opensource acoustic model for Polish language.
You say that I read improper documentation. Where is the proper one? I've got Ubuntu 9.10 and SphinxTrain installed there.
I didn't expect getting answers for fifty questions :-). I just wanted to show what I already have tried to do so that I can hear some general guidelines like those which you gave me in your answer :-).
My application requires very small dictionary, about fifteen words. However I think this topic (speech recognition) is interesting and somehow state-of-art so I'd like to understand it better, even after finishing my project. I'd like you to say what exactly you expect from me. Creating model for the whole language is work impossible for one person. So I guess my work would be just creating backbone for Polish language. I can, for example, create file similar to this one http://www.speech.cs.cmu.edu/cgi-bin/cmudict and also list of some words (dictionary), which contains words and the way how they can be written with the use of those phonemes. For my application this is not needed, because I can simply treat all of my words as phonemes (even if those are words, not phonemes) due to fact that I require so little dictionary. Later, of course, I can begin creating that kind of dictionary but I think, again, it is something what cannot be done just by one person. This is why I say about creating backbone so that other users which need Polish language can help to improve it. I can record myself reading sentences in Polish and ask some of my friends or family to do the same. Write what exactly you expect me to do.
Greetings :-)!
--- (Edited on 11/11/2009 5:33 pm [GMT-0600] by johnyjj2) ---
> Where is the proper one?
http://www.speech.cs.cmu.edu/sphinx/tutorial.html
> I just wanted to show what I already have tried to do so that I can hear some general guidelines like those which you gave me in your answer :-).
Asking precise and detailed question increase your chances to get an answer. Single question. Read this for more info
http://catb.org/~esr/faqs/smart-questions.html
> For my application this is not needed, because I can simply treat all of my words as phonemes (even if those are words, not phonemes) due to fact that I require so little dictionary
No you can't because of variable word length. Due to variable length it's better to have different amount of states per word. While sphinx allows you only to have fixed amount of states per phone. The proper example of the dictionary for small vocabulary system is located in sphinxtrain/templates/tidigits. Other files there also show an example of good setup for a small vocabulary system.
--- (Edited on 11/12/2009 03:10 [GMT+0300] by nsh) ---
Hello johnyjj2!
It seems that your native language is Polish. Well, I am trying to create a Polish pronunciation dictionary with 300.000 words. My question is: which eSpeak phonemes correspond with which Polish IPA phonemes? If you are interested, you can help me with the improvement of Ralf's Polish dictionary. As soon as the major issues are being fixed, we can try to convert this dictionary into Sphinx format.
Regards,
Ralf
--- (Edited on 2009-11-12 3:58 am [GMT-0600] by ralfherzog) ---
RalfHerzog: I created the topic here: https://sourceforge.net/projects/espeak/forums/forum/538922/topic/3458345
Nsh: OK, I followed the whole installation process as explained in the tutorial. The last four steps were as follows:
[in "How to perform a preliminary training run"]
1. tutorial/an4$ perl scripts_pl/make_feats.pl -ctl etc/an4_train.fileids
2. tutorial/an4$ RunAll.pl
[in "How to perform a preliminary decode"]
3. tutorial/an4$ perl scripts_pl/make_feats.pl -ctl etc/an4_train.fileids
4. tutorial/an4$ perl scripts_pl/decode/slave.pl
I
don't know why make_feats.pl must be executed twice. As I read in [3]
it computes MFCCs. What does it use to compute those MFCCs? I guess wav
files.
So I follow "perl scripts_pl/copy_setup.pl -task pl1" and
it creates pl1 directory (several megabytes less than an4). I see that
before "You must go through the following steps in sequence" it says
that "tutorial exercise begins with training the system using the MFCC
feature files that you have already computed during your preliminary
run". But I computed those for an4 so I guess I need to create my wav
files before recomputing those MFCC. I think MFCC computation requires
wav files so I enter tutorial/pl1/wav. I guess I can simply erase those
two directories and create my wav files in similar way. But when I
enter tutorial/pl1/wav/an4_clstk/fash (fash is one of many directories)
I see some an[...].sph and cen[...].sph files. Those are not wav files
and I cannot even see their content in Gedit, Document Viewer or
OpenOffice. So how to create those wav files in order to recompute
MFCCs and follow those three steps of "How to train, and key training
issues".
Greetings :-)!
--- (Edited on 11/13/2009 5:26 pm [GMT-0600] by johnyjj2) ---
> I don't know why make_feats.pl must be executed twice.
First it extracts features from training files, second from testing files. Each acoustic database has thouse two sets. Training files are used to estimate acoustic model parameters, testing files are used to estimate the accuracy of the database.
> Those are not wav files and I cannot even see their content in Gedit, Document Viewer or OpenOffice.
Sph is also a format of the audio files. It's used by most commercial databases and by an4 as well. You can convert them to wav with sox.
> So how to create those wav files in order to recompute MFCCs and follow those three steps of "How to train, and key training issues".
You can record audio files with audacity.
http://www.voxforge.org/home/submitspeech/windows/step-2
In etc/sphinx_train.cfg you need to change the format of the input then:
$CFG_WAVFILES_DIR = "$CFG_BASE_DIR/wav";
$CFG_WAVFILE_EXTENSION = 'wav';
$CFG_WAVFILE_TYPE = 'mswav'; # one of nist, mswav, raw
$CFG_FEATFILES_DIR = "$CFG_BASE_DIR/feat";
$CFG_FEATFILE_EXTENSION = 'mfc';
$CFG_VECTOR_LENGTH = 13;
--- (Edited on 11/14/2009 16:35 [GMT+0300] by nsh) ---
Thank you for answer.
Is it good idea to create .phone file simply by mapping to http://en.wikipedia.org/wiki/Wikipedia:IPA_for_Polish ? In other words just rewriting those symbols which are in Orthography column with avoiding usage of characters which are not present in English alphabet (writing them in a different manner). Later I would simply create list of about fifteen words in .dic (phone merging with my phone symbols, as seen in Polish wikidictionary for most popular words with IPA sounds) and then ten sets of five-seven words in random order, record them and write their transcriptions.
I also would like to edit code and then run examplary application on mobile phone with the use of PocketSphinx (https://sourceforge.net/projects/cmusphinx/forums/forum/5471/topic/3445960?message=7750375). Some difficulties which I've got are explained in this topic on sourceforge.
Greetings.
PS There is no reply from Ralf Herzog https://sourceforge.net/projects/espeak/forums/forum/538922/topic/3458345
--- (Edited on 11/14/2009 3:16 pm [GMT-0600] by johnyjj2) ---
Hi johnyjj2!
"avoiding usage of characters which are not present in English alphabet"
Well, I want to ask you a question: how do you adress Polish orthography (kreska, kropka, ogonek)? Which encoding are you using for your dictionary file? I would suggest that you only use words that are within the range of ASCII (like cmudict).
I got the impression that almost nobody cares about encoding issues.
Let me give you an example: I imported a French dictionary (Sphinx format) into simon. And of course, there were garbage characters. The same problem is likely to occur with the Polish language. This problem occured with the Voxforge German acoustic model (Sphinx format). Of course, you can imagine that you probably will have to deal with this issue.
Because the encoding issues are difficult to solve, I decided to focus on dictionary development. It is not possible to develop a good acoustic model without a good dictionary. So why focus on acoustic model development without having a good dictionary?
The goal "speech recognition for your native language" is very difficult to solve. Why not define as subgoal "develop a good pronunciation dictionary for the Polish language"?
So my post is about goal-setting. What are your exact goals?
--- (Edited on 2009-11-15 6:20 am [GMT-0600] by ralfherzog) ---
Thanks for answer!
About encoding of Polish letters on my computer. When I run Notepad in Windows and save text file, I see four encoding possibilities, default one which I use is ANSI, but it also works with Unicode, Unicode big endian and UTF-8. Which one should I choose for my Sphinx files?
> So my post is about goal-setting. What are your exact goals?
I like talking about goals :-). I've got project which I need to finish in December (more details here: http://forum.skype.com/index.php?showtopic=464711). First of all I need to create little acoustic and language model so that I can create application with Sphinx4/PocketSphinx which can recognize about fifteen words. I know encoding can be important issue. I encountered difficulties with encoding files several times in my short career in IT and I can say that such a simple thing as saving the text file with proper encoding chosen from the list at the bottom of "save as" window in Notepad can save much of time which would be wasted in other case :-). But if my application requires only fifteen words, I can simply write "piec" instead of "pięć" and not bother myself with encoding.
I see encoding is important for you in order to create complete model for Polish. It is not for me. However, in the meantime of my application development, I found this topic (all this speech recognition and so on) more interesting than I thought :-). I also stated that I'm gonna create backbone of Polish language on VoxForge in exchange to help from Nsh. In other words my first goal is to create very simple acoustic and language model for Polish language with about fifteen words, then to finish my application and later (or in the meantime :-)) help you with involving Polish into Simon and creating backbone of Polish for VoxForge :-).
So that's about goals :-P. And answering to the question which I asked :-D (i.e. "Is it good idea to create .phone file simply by mapping to http://en.wikipedia.org/wiki/Wikipedia:IPA_for_Polish ?") I guess I found better way when skimming those links which you gave me in your post. I think I should combine Wikipedia:IPA_for_Polish with this http://en.wikipedia.org/wiki/Arpabet . By the way, let me remind that Nsh suggested that I cannot simply treat all of my words as phonemes because of variable word length.
Greetings :-)!
PS Where can I find comprehensive info how to edit/build/run PocketSphinx applications? (Look: https://sourceforge.net/projects/cmusphinx/forums/forum/5471/topic/3445960?message=7750375).
--- (Edited on 11/15/2009 5:16 pm [GMT-0600] by johnyjj2) ---