VoxForge
Hm, what version are you using exactly? I prefer latest one from nightly build or from svn checkout.
Where did you get that slave.pl, it's not a part of archive. Actually model testing is not so easy right now partially because of unsufficient data partially because there is no language model yet since we have no text collection.
It's possible to test acoustic model with finite state grammar actually but it's much better to build language model first.
Well, I've updated scripts with a language model and updated prompts for full phoneset. Now you can test recognition, it even has non-zero WER but of course quality is _very_ poor.
New files are available at:
http://www.gigasize.com/get.php/-1099746482/voxforge_it_sphinx.tar.gz
Probably there is sense to commit them to voxforge and start audio update. We can create the similar bootstrap structure for spanish, french, welsh, dutch, polish and czech btw.
Hi nsh,
I've created a dev site for Italian at:
http://www.dev.voxforge.org/projects/it
http://www.dev.voxforge.org/svn/it
I've also uploaded the Sphinx Acoustic Model creation scripts to the svn trunk for this repository.
If you create the same for the other languages you mentioned, I'll create repositories for the remaining languages you mentioned: spanish, french, welsh, polish and czech. Dutch and English already have repositories created, but whenever you have a chance, Sphinx AM creation scripts for those too would be greatly appreciated.
thanks,
Ken
Did you sort this out eventually?
I also don't have it in my install, and I can't find any online download for SphinxTrain other than the nightly build at sourceforge.
I'm stuck.
Please can someone throw me a line? I really want to get this up and running.
Thanks
Sam
Hi Manuel,
>I don't understand how can I create statistical representation of phonemes.
The HTK toolkit lets you train your hmm-based phonemes automatically - but you need transcribed speech for this to work.
Steps:
In English, the steps look like this:
(in the VoxForge tutorial, we actually skipped this step because all the required phones are already included in the pronunciation dictionary)
The VoxForge (actually originated from the CMU phone set) is as follows:
Phoneme Example Translation
------- ------- -----------
AA odd AA D
AE at AE T
AH hut HH AH T
AO ought AO T
AW cow K AW
AY hide HH AY D
B be B IY
CH cheese CH IY Z
D dee D IY
DH thee DH IY
EH Ed EH D
ER hurt HH ER T
EY ate EY T
F fee F IY
G green G R IY N
HH he HH IY
IH it IH T
IY eat IY T
JH gee JH IY
K key K IY
L lee L IY
M me M IY
N knee N IY
NG ping P IH NG
OW oat OW T
OY toy T OY
P pee P IY
R read R IY D
S sea S IY
SH she SH IY
T tea T IY
TH theta TH EY T AH
UH hood HH UH D
UW two T UW
V vee V IY
W we W IY
Y yield Y IY L D
Z zee Z IY
ZH seizure S IY ZH ER
So you need to create a similar phone list in Italian (the IPA web site can help in this regard, or maybe another speech recognition project in Italian)
For each word in your training set (i.e. the sentences you used to prompt your users who submitted speech for your speech corpus) you need its pronunciation using phonemes. Here is a portion of the VoxForge pronunciation dictionary:
AARP [AARP] ey ey aa r p iy
ABA [ABA] ey b iy ey
ABACK [ABACK] ax b ae k
ABACUS [ABACUS] ae b ax k ax s
ABALON [ABALON] ae b ax l aa n
ABALONE [ABALONE] ae b ax l ow n iy
ABANDON [ABANDON] ax b ae n d ih n
ABANDONED [ABANDONED] ax b ae n d ih n d
ABANDONING [ABANDONING] ax b ae n d ih n ih ng
ABBREVIATED [ABBREVIATED] ax b r iy v iy ey t ih d
ABBREVIATION [ABBREVIATION] ax b r iy v iy ey sh ih n
ABBY [ABBY] ae b iy
ABC [ABC] ey b iy s iy
ABC'S [ABC'S] ey b iy s iy z
ABCS [ABCS] iy b iy s iy z
ABDOMINALS [ABDOMINALS] ae b d aa m ih n ax l z
ABDUCTING [ABDUCTING] ae b d ah k t ih ng
ABDUCTION [ABDUCTION] ae b d ah k sh ih n
Note that the words are in upper case, the return word is also in upper case and in brackets, and the phones are in lower case.
You need to do the same in Italian, for each word in your training set.
In this context, this means that you use the HTK toolkit to generate statistical representations for each phone, based on the word in your training set. In English, your hmms would look something like this:
~h "b"
<BEGINHMM>
<NUMSTATES> 5
<STATE> 2
<MEAN> 25
-9.124349e-01 6.825594e+00 4.190366e+00 6.915018e+00 6.278219e+00 6.211351e+00 6.080202e+00 8.280239e-01 7.751886e-01 1.188034e-01 -2.286278e+00 -2.037417e+00 -5.154014e-02 -1.411842e-01 1.359426e-01 7.536004e-02 1.828612e-02 1.083132e-01 8.064213e-02 6.554011e-02 5.534951e-03 -3.300069e-02 -1.040055e-02 1.726186e-01 1.074358e-01
<VARIANCE> 25
6.946013e+00 9.476726e+00 6.426389e+00 8.900808e+00 8.562872e+00 5.247358e+00 8.789542e+00 9.086433e+00 9.272338e+00 1.021655e+01 8.668521e+00 1.017453e+01 9.018427e-01 1.225605e+00 1.132353e+00 1.225746e+00 1.055387e+00 9.162133e-01 9.871734e-01 1.061771e+00 1.182593e+00 1.325286e+00 1.340984e+00 9.980333e-01 5.850468e-01
<GCONST> 7.204273e+01
<STATE> 3
<MEAN> 25
1.670979e+00 2.505412e+00 3.361752e+00 2.959995e+00 2.192761e+00 2.234684e+00 4.598285e-01 6.712853e-02 -7.422704e-01 -1.477473e+00 -1.300686e+00 -8.829353e-01 2.932750e+00 -1.085336e+00 1.465379e-01 -1.024826e+00 -9.668781e-01 -2.956798e+00 -3.674928e+00 -6.180806e-01 -1.165014e+00 -1.551422e+00 1.459589e-01 -1.145165e-02 3.425349e+00
<VARIANCE> 25
2.775954e+01 2.442891e+01 9.882823e+00 2.289949e+01 2.621673e+01 3.309447e+01 4.353169e+01 1.994825e+01 2.369977e+01 2.078222e+01 1.078901e+01 1.184826e+01 1.814732e+00 4.001577e+00 2.052232e+00 3.576971e+00 5.154440e+00 6.247412e+00 4.224275e+00 3.561308e+00 4.634731e+00 1.263823e+00 2.618247e+00 2.138073e+00 1.512457e+00
<GCONST> 9.653378e+01
<STATE> 4
<MEAN> 25
1.058882e+01 1.385496e+00 8.322063e-01 1.207590e+00 1.215214e+00 -7.297173e+00 -8.178091e+00 2.753822e-01 -3.762378e+00 -6.590958e+00 -1.468036e+00 -2.938320e+00 2.796497e-01 -2.095785e-01 -1.001576e-01 1.865974e-02 -5.384719e-02 -6.179357e-01 -4.035245e-01 4.215330e-02 -2.601456e-01 -1.829550e-01 -2.622822e-02 -2.242988e-01 2.178501e-01
<VARIANCE> 25
1.652969e+01 4.435868e+01 1.719629e+01 6.380357e+01 7.536614e+01 6.076683e+01 5.961767e+01 3.608961e+01 4.442945e+01 1.993280e+01 4.157676e+01 2.804121e+01 2.284771e+00 2.194077e+00 1.651372e+00 2.075975e+00 2.312554e+00 5.300534e+00 3.836717e+00 2.152288e+00 2.561902e+00 1.781796e+00 2.014969e+00 1.707738e+00 2.076164e+00
<GCONST> 1.004418e+02
<TRANSP> 5
0.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
0.000000e+00 8.082747e-01 1.917253e-01 0.000000e+00 0.000000e+00
0.000000e+00 0.000000e+00 6.367275e-01 3.632726e-01 0.000000e+00
0.000000e+00 0.000000e+00 0.000000e+00 7.520868e-01 2.479133e-01
0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
<ENDHMM>
~h "d"
<BEGINHMM>
<NUMSTATES> 5
<STATE> 2
<MEAN> 25
-1.067137e+00 4.886644e+00 2.682094e+00 6.750027e+00 6.457639e+00 6.229094e+00 5.297256e+00 -2.129066e-01 3.815716e-01 -7.126016e-01 -2.884563e+00 -1.832386e+00 7.712548e-03 -7.304223e-01 2.831668e-01 -3.501370e-01 -7.342540e-01 -2.799944e-01 4.564904e-02 2.276214e-01 1.384630e-01 4.671212e-02 -1.844966e-01 -2.142331e-01 7.479197e-01
<VARIANCE> 25
8.934610e+00 1.320769e+01 1.053300e+01 1.804137e+01 1.219705e+01 1.129104e+01 1.721161e+01 1.467160e+01 1.430175e+01 1.481090e+01 1.149455e+01 9.083491e+00 1.831863e+00 3.232245e+00 1.539536e+00 1.744226e+00 2.540962e+00 2.710148e+00 2.181852e+00 2.404683e+00 2.769586e+00 1.280586e+00 1.451528e+00 1.790569e+00 3.939657e+00
<GCONST> 8.637133e+01
<STATE> 3
<MEAN> 25
2.718689e+00 -2.744554e+00 7.256757e-02 1.812361e+00 1.016949e-01 -2.560019e-01 -1.885446e+00 -4.865013e+00 -4.525404e+00 -2.596621e+00 -1.807474e+00 -1.480970e+00 1.222863e+00 -5.446100e-01 5.466800e-01 -1.001800e+00 -7.867664e-01 -1.223161e+00 -2.112964e+00 -1.139215e+00 -1.483523e+00 -8.174815e-01 -1.465670e-01 -4.309444e-01 2.095388e+00
<VARIANCE> 25
2.393951e+01 3.441933e+01 2.272727e+01 3.303073e+01 2.547261e+01 2.950558e+01 4.406739e+01 4.921661e+01 6.163840e+01 3.156588e+01 1.728885e+01 2.407177e+01 3.362633e+00 5.228514e+00 3.342825e+00 3.542599e+00 4.699482e+00 3.152497e+00 5.631856e+00 5.698840e+00 4.839462e+00 2.097089e+00 1.823990e+00 1.847656e+00 7.886878e+00
<GCONST> 1.042911e+02
<STATE> 4
<MEAN> 25
3.030438e+00 -2.106693e+00 2.608706e+00 8.074592e-02 8.320825e-01 -8.720042e-01 -4.455779e+00 -3.824380e+00 -3.882696e+00 -1.690570e+00 -1.894887e+00 -2.615440e+00 2.946242e-01 3.876723e-02 4.528299e-01 -6.694716e-01 5.406591e-01 -8.197967e-01 -1.044559e+00 9.537272e-01 -1.756284e-01 -9.122517e-02 9.268219e-01 3.083803e-01 1.007540e+00
<VARIANCE> 25
4.675860e+01 3.011730e+01 3.514589e+01 5.922066e+01 4.235344e+01 2.218645e+01 5.816761e+01 3.788612e+01 2.974471e+01 1.639678e+01 1.083809e+01 2.301572e+01 3.725070e+00 4.032299e+00 4.137799e+00 4.301898e+00 5.162062e+00 4.180051e+00 9.591230e+00 6.350677e+00 9.550439e+00 4.142642e+00 2.124282e+00 3.202255e+00 4.503259e+00
<GCONST> 1.069599e+02
<TRANSP> 5
0.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
0.000000e+00 6.490452e-01 3.509548e-01 0.000000e+00 0.000000e+00
0.000000e+00 0.000000e+00 5.191061e-01 4.808940e-01 0.000000e+00
0.000000e+00 0.000000e+00 0.000000e+00 2.762414e-01 7.237586e-01
0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
<ENDHMM>
[...]
Note that each line starting with "~h" represents the start of a statistical description of a hmm for a particular phone.
You can do this for Italian, and for most other languages.
to create an Italian acoustic model:
- create an Italian phone set,
- create an Italian pronunciation dictionary for the words in your training set,
- generate acoustic models using the process described in the VoxForge Tutorial.
This will allow you to create monophone acoustic models (up to step 8).
To create tied-state triphone acoustic models, you will need to create 'questions' (see the tree.hed script in step 10). I just used the one included with the HTK toolkit, and am not familiar with creating one for another language.
Hope this helps,
Ken