VoxForge
> I have created a new, updated german audio model for CMU Sphinx (http://goofy.zamia.org/voxforge/de/) - this is based on our submission rating/tagging effort. Currently I am still polishing the model but once it is ready for use I will open a new Thread about it in the german forum. Would you be willing to host the new model on the official voxforge servers somewhere?
hey - thanks for all the helpful advice, very much appreciated as I am still new to the field.
> For the models it's also better to provide test results so others could reproduce them and estimate model accuracy.
results are included in the tarball now, also I have uploaded some more statistics here:
http://goofy.zamia.org/voxforge/de/audio-stats.txt
which is basically the result of running all submissions through pocketsphinx_batch again and calc word errors per user.
> It's better to use sphinxbase/sphinxtrain trunk to train such model, it creates significantly more accurate models which are noise robust too.
I never realized the released versions are that old while SVN is so busy. Do you happen to know if they're planning to have a fresh release anytime soon? Anyway, I have downloaded, compiled and installed svn trunk so next model will be built using those versions. will the model I generate still work in the latest stable pocketsphinx release?
> It's also better to train models with LDA/MLLT transform, they usually are 20% more accurate.
This is a great result, congratulations. I still propose you to exclude your recordings and ralf's recordings from the test set. Use only other speaker in the test set. You still can use yourself in a train set. That will give you probably less attractive numbers but it will be a honest estimate and, more importantly, you will be able to optimize model parameters properly.
Currently a lot of recorded and transcribed data is available online like TED talks, podcasts and librivox book. This data is more practically useful than voxforge recordings. So the biggest return would be from the alignment algorithms like the long audio alignment in sphinx4, not from Voxforge. It's way better to have 3000 hours than 30 hours.
> they're planning to have a fresh release anytime soon?
Yes, I'm preparing a fresh release now.
> the model I generate still work in the latest stable pocketsphinx release?
The models will work but it's better to use new code. The updated decoder also has few critical features.
ken,
I have uploaded another bunch of files to the voxforge FTP server (which should all contain an updated license file :) ) - vould you add them to the german audio corupus?
guenter-20140204-afn.tgz
guenter-20140204-afq.tgz
guenter-20140204-ftr.tgz
guenter-20140204-ofp.tgz
guenter-20140204-xck.tgz
guenter-20140205-afn.tgz
guenter-20140205-afq.tgz
guenter-20140205-qah.tgz
guenter-20140205-xck.tgz
guenter-20140206-afn.tgz
guenter-20140206-ftr.tgz
guenter-20140206-qah.tgz
guenter-20140206-xck.tgz
guenter-20140207-afn.tgz
guenter-20140207-afq.tgz
guenter-20140207-ftr.tgz
guenter-20140207-ofp.tgz
guenter-20140207-qah.tgz
guenter-20140207-vau.tgz
guenter-20140207-xck.tgz
guenter-20140208-ftr.tgz
guenter-20140208-qah.tgz
guenter-20140209-afn.tgz
guenter-20140209-ftr.tgz
guenter-20140209-qah.tgz
guenter-20140209-xck.tgz
guenter-20140211-afn.tgz
guenter-20140211-afq.tgz
guenter-20140211-ftr.tgz
guenter-20140211-ofp.tgz
guenter-20140211-qah.tgz
guenter-20140211-vau.tgz
guenter-20140211-xck.tgz
guenter-20140212-qah.tgz
guenter-20140213-ftr.tgz
guenter-20140213-qah.tgz
guenter-20140213-xck.tgz
guenter-20140214-afn.tgz
guenter-20140214-afq.tgz
guenter-20140214-ftr.tgz
guenter-20140214-ofp.tgz
guenter-20140214-qah.tgz
guenter-20140214-xck.tgz
guenter-20140215-qah.tgz
guenter-20140217-afn.tgz
guenter-20140217-ftr.tgz
guenter-20140217-qah.tgz
guenter-20140217-xck.tgz
guenter-20140218-ftr.tgz
guenter-20140218-qah.tgz
guenter-20140218-xck.tgz
guenter-20140224-ftr.tgz
guenter-20140224-qah.tgz
guenter-20140309-qah.tgz
guenter-20140310-ftr.tgz
guenter-20140310-qah.tgz
>That will give you probably less attractive numbers but it will be a honest estimate
I agree. To put those number in relation. We are testing on interviews on conventions. If we test a new video we only add the missing words to the decoding dictionary.
We moving around word error rates of 90%.
This number seems high but you have to take into account that convention videos is one of the worst possible candidates.
The audio is hardly "clean" even if you use a hand microphone. Add to that a complete unknown speaker, possible accent, no clear sentence structure since its a interview etc.