VoxForge
I am assuming that the more 'good' data we get, the better, even after the 140 goal is reached.
Some of my (and perhaps others) ideas for sources of speech/text are probably far from ideal. consider someone reading, and they mis-read a word, and stumble, and run some words together enough to confuse the algorithm that breaks the stream into words. Not to mention adding in some commentary.
This data should be tagged as "needs validation" and then create some way for helpful people to validate, and maybe even edit it. for editing, I was considering altering the audio, but I think the easiest will be to sync the text to the audio, even if the text is the original work being read/recorded.
For the sync problem, subtitle formats have something to help the player know that 25 seconds into the sound relates to 25 seconds into the text, (or something like that.) seems like this might help. (but I am not even sure there is a problem yet :)
Once it has been validated by a human, then it can be accepted into the master database.
--- (Edited on 6/17/2008 3:17 pm [GMT-0500] by CarlFK) ---
Hi CarlFK,
>... then create some way for helpful people to validate, and maybe even edit it
We've discussed something like this on the ideas page:
Though I like your idea of allowing user the ability to tag the audio for follow-up (if someone notices a problem in the transcription, but does not want/have-time to fix the problem themselves).
>for editing, I was considering altering the audio, but I think the easiest will
>be to sync the text to the audio, even if the text is the original work being
>read/recorded.
Agree. The audio should be left unchanged as much as possible, though I have seen instances where there are 'pops' and 'clicks' in the audio that might need to be removed (because they are in the middle of an utterance...).
>For the sync problem,...
As long as we have text that matches 98-99% of the spoken audio, we can segment the audio file (so it can be used for acoustic model training) with the current VoxForge acoustic model using a process called "force alignment" (see this page for more info) to give us approximate time stamps for each word, and using a script to segment the speech audio file based on pauses in the speech.
Once we have acoustic models that are good enough for dictation (trained with 1000+ hours of good quality speech), then we could theoretically use that to segment a non-transcribed speech audio file.
Ken
Ken
--- (Edited on 6/18/2008 1:54 pm [GMT-0400] by kmaclean) ---