VoxForge
Hi, i'm interested to help voxforge, but my english is poor...
is there italian section with instructions?
thanks!
i hope voxforge will be a great project
--- (Edited on 6/4/2007 7:51 am [GMT-0500] by topocheparla) ---
Hi,
To answer your question first: no there's no italian section yet. VoxForge was started in the English language and is still quite young.
However, this seems to be a recurring issue (and for good reasons) so I think it wouldn't hurt to talk about how to eventually add other languages. I think it would be great to have a recipe that explains all the requirements for adding other languages to the project.
People whose interest lies more in - for instance - italian (you're not the first one to mention italian!) could then already do some preparatory work. By the time Italian truly get's added some of the work has then already been done!
Things that should be in this recipe for sure would be:
It all depends on your personal skills to figure out where to start. Some work might already have been done at for instance a university where they do research on phonetics. So it's wise not to start on a word list immediately, but first search for an existing one!
There is also a lot of info on the VoxForge-website (esp. in the dev section).
Obviously officially adding another language is in the end a decision for Ken (the project founder), since it requires a lot of work in the background!
Robin
--- (Edited on 6/5/2007 4:56 am [GMT-0500] by Robin) ---
Hi, i'm italian too. Reading the tutorial to make my own acoustic model I don't understand how can I create statistical representation of phonemes.
It's clear how to make grammar file, and other tutorial steps, but not how to create the acoustic model.
I would create a simple acoustic model for italian word, it's possible?
I'm a programmer, studying at University of Bologna, and I'm preparing my degree thesis about speech recognition, and I have to make something work on italian world.
Tks
Manuel
Hi, i'm italian too. I'm studying at University of Bologna, and I'm a programmer. I'm interest to make an acoustic model with italian phonemes, but reading tutorial I not understand how to make it.
The problem is how to create a statistical representation of phonemes.
Tks
Hi Manuel,
>I don't understand how can I create statistical representation of phonemes.
The HTK toolkit lets you train your hmm-based phonemes automatically - but you need transcribed speech for this to work.
Steps:
In English, the steps look like this:
(in the VoxForge tutorial, we actually skipped this step because all the required phones are already included in the pronunciation dictionary)
The VoxForge (actually originated from the CMU phone set) is as follows:
Phoneme Example Translation
------- ------- -----------
AA odd AA D
AE at AE T
AH hut HH AH T
AO ought AO T
AW cow K AW
AY hide HH AY D
B be B IY
CH cheese CH IY Z
D dee D IY
DH thee DH IY
EH Ed EH D
ER hurt HH ER T
EY ate EY T
F fee F IY
G green G R IY N
HH he HH IY
IH it IH T
IY eat IY T
JH gee JH IY
K key K IY
L lee L IY
M me M IY
N knee N IY
NG ping P IH NG
OW oat OW T
OY toy T OY
P pee P IY
R read R IY D
S sea S IY
SH she SH IY
T tea T IY
TH theta TH EY T AH
UH hood HH UH D
UW two T UW
V vee V IY
W we W IY
Y yield Y IY L D
Z zee Z IY
ZH seizure S IY ZH ER
So you need to create a similar phone list in Italian (the IPA web site can help in this regard, or maybe another speech recognition project in Italian)
For each word in your training set (i.e. the sentences you used to prompt your users who submitted speech for your speech corpus) you need its pronunciation using phonemes. Here is a portion of the VoxForge pronunciation dictionary:
AARP [AARP] ey ey aa r p iy
ABA [ABA] ey b iy ey
ABACK [ABACK] ax b ae k
ABACUS [ABACUS] ae b ax k ax s
ABALON [ABALON] ae b ax l aa n
ABALONE [ABALONE] ae b ax l ow n iy
ABANDON [ABANDON] ax b ae n d ih n
ABANDONED [ABANDONED] ax b ae n d ih n d
ABANDONING [ABANDONING] ax b ae n d ih n ih ng
ABBREVIATED [ABBREVIATED] ax b r iy v iy ey t ih d
ABBREVIATION [ABBREVIATION] ax b r iy v iy ey sh ih n
ABBY [ABBY] ae b iy
ABC [ABC] ey b iy s iy
ABC'S [ABC'S] ey b iy s iy z
ABCS [ABCS] iy b iy s iy z
ABDOMINALS [ABDOMINALS] ae b d aa m ih n ax l z
ABDUCTING [ABDUCTING] ae b d ah k t ih ng
ABDUCTION [ABDUCTION] ae b d ah k sh ih n
Note that the words are in upper case, the return word is also in upper case and in brackets, and the phones are in lower case.
You need to do the same in Italian, for each word in your training set.
In this context, this means that you use the HTK toolkit to generate statistical representations for each phone, based on the word in your training set. In English, your hmms would look something like this:
~h "b"
<BEGINHMM>
<NUMSTATES> 5
<STATE> 2
<MEAN> 25
-9.124349e-01 6.825594e+00 4.190366e+00 6.915018e+00 6.278219e+00 6.211351e+00 6.080202e+00 8.280239e-01 7.751886e-01 1.188034e-01 -2.286278e+00 -2.037417e+00 -5.154014e-02 -1.411842e-01 1.359426e-01 7.536004e-02 1.828612e-02 1.083132e-01 8.064213e-02 6.554011e-02 5.534951e-03 -3.300069e-02 -1.040055e-02 1.726186e-01 1.074358e-01
<VARIANCE> 25
6.946013e+00 9.476726e+00 6.426389e+00 8.900808e+00 8.562872e+00 5.247358e+00 8.789542e+00 9.086433e+00 9.272338e+00 1.021655e+01 8.668521e+00 1.017453e+01 9.018427e-01 1.225605e+00 1.132353e+00 1.225746e+00 1.055387e+00 9.162133e-01 9.871734e-01 1.061771e+00 1.182593e+00 1.325286e+00 1.340984e+00 9.980333e-01 5.850468e-01
<GCONST> 7.204273e+01
<STATE> 3
<MEAN> 25
1.670979e+00 2.505412e+00 3.361752e+00 2.959995e+00 2.192761e+00 2.234684e+00 4.598285e-01 6.712853e-02 -7.422704e-01 -1.477473e+00 -1.300686e+00 -8.829353e-01 2.932750e+00 -1.085336e+00 1.465379e-01 -1.024826e+00 -9.668781e-01 -2.956798e+00 -3.674928e+00 -6.180806e-01 -1.165014e+00 -1.551422e+00 1.459589e-01 -1.145165e-02 3.425349e+00
<VARIANCE> 25
2.775954e+01 2.442891e+01 9.882823e+00 2.289949e+01 2.621673e+01 3.309447e+01 4.353169e+01 1.994825e+01 2.369977e+01 2.078222e+01 1.078901e+01 1.184826e+01 1.814732e+00 4.001577e+00 2.052232e+00 3.576971e+00 5.154440e+00 6.247412e+00 4.224275e+00 3.561308e+00 4.634731e+00 1.263823e+00 2.618247e+00 2.138073e+00 1.512457e+00
<GCONST> 9.653378e+01
<STATE> 4
<MEAN> 25
1.058882e+01 1.385496e+00 8.322063e-01 1.207590e+00 1.215214e+00 -7.297173e+00 -8.178091e+00 2.753822e-01 -3.762378e+00 -6.590958e+00 -1.468036e+00 -2.938320e+00 2.796497e-01 -2.095785e-01 -1.001576e-01 1.865974e-02 -5.384719e-02 -6.179357e-01 -4.035245e-01 4.215330e-02 -2.601456e-01 -1.829550e-01 -2.622822e-02 -2.242988e-01 2.178501e-01
<VARIANCE> 25
1.652969e+01 4.435868e+01 1.719629e+01 6.380357e+01 7.536614e+01 6.076683e+01 5.961767e+01 3.608961e+01 4.442945e+01 1.993280e+01 4.157676e+01 2.804121e+01 2.284771e+00 2.194077e+00 1.651372e+00 2.075975e+00 2.312554e+00 5.300534e+00 3.836717e+00 2.152288e+00 2.561902e+00 1.781796e+00 2.014969e+00 1.707738e+00 2.076164e+00
<GCONST> 1.004418e+02
<TRANSP> 5
0.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
0.000000e+00 8.082747e-01 1.917253e-01 0.000000e+00 0.000000e+00
0.000000e+00 0.000000e+00 6.367275e-01 3.632726e-01 0.000000e+00
0.000000e+00 0.000000e+00 0.000000e+00 7.520868e-01 2.479133e-01
0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
<ENDHMM>
~h "d"
<BEGINHMM>
<NUMSTATES> 5
<STATE> 2
<MEAN> 25
-1.067137e+00 4.886644e+00 2.682094e+00 6.750027e+00 6.457639e+00 6.229094e+00 5.297256e+00 -2.129066e-01 3.815716e-01 -7.126016e-01 -2.884563e+00 -1.832386e+00 7.712548e-03 -7.304223e-01 2.831668e-01 -3.501370e-01 -7.342540e-01 -2.799944e-01 4.564904e-02 2.276214e-01 1.384630e-01 4.671212e-02 -1.844966e-01 -2.142331e-01 7.479197e-01
<VARIANCE> 25
8.934610e+00 1.320769e+01 1.053300e+01 1.804137e+01 1.219705e+01 1.129104e+01 1.721161e+01 1.467160e+01 1.430175e+01 1.481090e+01 1.149455e+01 9.083491e+00 1.831863e+00 3.232245e+00 1.539536e+00 1.744226e+00 2.540962e+00 2.710148e+00 2.181852e+00 2.404683e+00 2.769586e+00 1.280586e+00 1.451528e+00 1.790569e+00 3.939657e+00
<GCONST> 8.637133e+01
<STATE> 3
<MEAN> 25
2.718689e+00 -2.744554e+00 7.256757e-02 1.812361e+00 1.016949e-01 -2.560019e-01 -1.885446e+00 -4.865013e+00 -4.525404e+00 -2.596621e+00 -1.807474e+00 -1.480970e+00 1.222863e+00 -5.446100e-01 5.466800e-01 -1.001800e+00 -7.867664e-01 -1.223161e+00 -2.112964e+00 -1.139215e+00 -1.483523e+00 -8.174815e-01 -1.465670e-01 -4.309444e-01 2.095388e+00
<VARIANCE> 25
2.393951e+01 3.441933e+01 2.272727e+01 3.303073e+01 2.547261e+01 2.950558e+01 4.406739e+01 4.921661e+01 6.163840e+01 3.156588e+01 1.728885e+01 2.407177e+01 3.362633e+00 5.228514e+00 3.342825e+00 3.542599e+00 4.699482e+00 3.152497e+00 5.631856e+00 5.698840e+00 4.839462e+00 2.097089e+00 1.823990e+00 1.847656e+00 7.886878e+00
<GCONST> 1.042911e+02
<STATE> 4
<MEAN> 25
3.030438e+00 -2.106693e+00 2.608706e+00 8.074592e-02 8.320825e-01 -8.720042e-01 -4.455779e+00 -3.824380e+00 -3.882696e+00 -1.690570e+00 -1.894887e+00 -2.615440e+00 2.946242e-01 3.876723e-02 4.528299e-01 -6.694716e-01 5.406591e-01 -8.197967e-01 -1.044559e+00 9.537272e-01 -1.756284e-01 -9.122517e-02 9.268219e-01 3.083803e-01 1.007540e+00
<VARIANCE> 25
4.675860e+01 3.011730e+01 3.514589e+01 5.922066e+01 4.235344e+01 2.218645e+01 5.816761e+01 3.788612e+01 2.974471e+01 1.639678e+01 1.083809e+01 2.301572e+01 3.725070e+00 4.032299e+00 4.137799e+00 4.301898e+00 5.162062e+00 4.180051e+00 9.591230e+00 6.350677e+00 9.550439e+00 4.142642e+00 2.124282e+00 3.202255e+00 4.503259e+00
<GCONST> 1.069599e+02
<TRANSP> 5
0.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
0.000000e+00 6.490452e-01 3.509548e-01 0.000000e+00 0.000000e+00
0.000000e+00 0.000000e+00 5.191061e-01 4.808940e-01 0.000000e+00
0.000000e+00 0.000000e+00 0.000000e+00 2.762414e-01 7.237586e-01
0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
<ENDHMM>
[...]
Note that each line starting with "~h" represents the start of a statistical description of a hmm for a particular phone.
You can do this for Italian, and for most other languages.
to create an Italian acoustic model:
- create an Italian phone set,
- create an Italian pronunciation dictionary for the words in your training set,
- generate acoustic models using the process described in the VoxForge Tutorial.
This will allow you to create monophone acoustic models (up to step 8).
To create tied-state triphone acoustic models, you will need to create 'questions' (see the tree.hed script in step 10). I just used the one included with the HTK toolkit, and am not familiar with creating one for another language.
Hope this helps,
Ken
And, btw, Italian phoneset and dictionary they are both available from Italian festival project:
http://www.pd.istc.cnr.it/TTS/It-FESTIVAL.htm
of course they are synthesis-oriented, but for beginning it's not a big dial.
>First of all you need a large collection of Italian texts (100 Mb for example). Do you have such a big collection?
What do you mean exactly for italian texts?
Can I find it on Festival Project's web site?
Thanks
Manuel
Texts are just texts: books, newspapers and so on. In theory they should be free but copyrighted texts are also acceptable. They are required to build language model but it's only required for decoding not for training.
Once you'll have text put them somewhere so I can download them.
To be honest for me it seems easier to train sphinx model than htk one, probably Ken will correct me. So if you'll install sphinx3 and Sphinxtrain I can help you with Italian setup. We have a dictionary and a phoneset. You just need to record small text (say, 200 utterances in wav files). We'll build acoustic model then.