VoxForge
Hello,
I am posting to share some thoughts regarding ASR research and the planned 1.0 release of the VoxForge corpus.
The goal of VoxForge is to create speech corpora for use by the FOSS community. But these corpora may also be of use to the ASR research community, who are publishing hundreds of papers a year on the advancement of speech recognition technology. It's a win for the VoxForge project if ASR researchers start using the corpus, because researchers will learn about ways to improve performance on the VoxForge corpus and the VoxForge project may be able to incorporate these ideas. (I say "may" because in some cases the technical complexity may be too high or the idea too far a departure from current ways of doing things.)
The VoxForge project can encourage this by making the corpus easier to use in research. One way to do so is to publish a division of the corpus sentences into training and test sets for researchers to use. This way different researchers can use the corpus in their papers and the results will be comparable since the training / test division will be consistent. The best practice is typically to have three sets: a training set, a development test set and an evaluation test set. The development test set is used for tweaking things and the evaluation test set is used for final evaluation on data that was not seen during the tweaking process. For example, I have seen the 10-hour OGI Numbers corpus divided into 3/5 training set, 1/5 development set and 1/5 evaluation set. For VoxForge, perhaps the ratios should be different since the corpus will be larger. The current standard in ASR research is speaker-independent ASR, which means that there is a completely different set of human speakers in each of the three sets.
Making scripts and configuration files available for training and testing ASR systems using the VoxForge corpus (along the lines of Keith Vertanen's HTK scripts for Wall Street Journal) could also be a big help for researchers because it would help them to get started quickly. Also, these scripts could serve as a common starting point in different research efforts, which would make it easier to compare results in different papers to judge the usefulness of different ideas.
Why would researchers choose to use the VoxForge corpus rather than one of the many existing non-free corpora? Well, I expect the vast majority of researchers would stick with other corpora, but I hope that some researchers would choose to switch to the VoxForge corpus for the following reasons:
1. It's free. This lowers their costs and makes it easier for other people to reproduce their work and build on their work (a point nsh made on the HTK mailing list).
2. If they improve ASR performance on the VoxForge corpus, their ideas may be adopted by the VoxForge project and thus improve FOSS speech recognition. I remember Prof. Bryan Pellom commenting on the OSSRI mailing list that he would feel extra motivated in his ASR work if he knew it was contributing to speech-enabling the Linux desktop.
There are often conference papers published (e.g., at ICASSP, INTERSPEECH or LREC) announcing the availability of new speech corpora to the ASR research community. So it might be worthwhile to publish a conference paper about VoxForge when corpus version 1.0 is released, to get more attention. There are significant travel and registration costs associated with going to a conference, but if one of the paper co-authors is already going to the conference then it doesn't cost anything extra to present one more paper.
Regards,
David
--- (Edited on 9/3/2008 9:42 pm [GMT-0500] by DavidGelbart) ---
Hi David,
Thanks for your post!
Once we have release 1.0 completed (we still have a way to go.... :) ) then we can look at alternate uses for the VoxForge corpus - like an academic version of the corpus (we will likely need your help...).
It seems that the key for the creation of a good academic speech recognition corpus is that the all speech contained therein must be of consistent audio quality (i.e. all noisy speech or all clean speech, no mixing...) and of a single, broad dialect class (e.g. all American English, but no mixing in of British, India or Australian english). Is this the case?
In addition, it seems that once an academic speech corpus is created, it must remain relatively unchanged, so that accurate performance comparisons (using novel training and/or decoding methods) can be made. Is this correct? This might be difficult in a FOSS context, where continuous improvement rules the day.
Ken
--- (Edited on 9/5/2008 10:35 am [GMT-0400] by kmaclean) ---
Well, I'm all for participation in Interspeech next year, Australia is too far, but Brighton is a good place :)
There is a little problem that I see here. We already tried to participate in a few conferences like TSD2008 or I made a report on our local Russian conference. Academic world is very resistant to newcomers, without having enough authority it's hard to get there. So probably help on advertizing us from some authorative person would be helpful. Co-authoring probably.
The interesting points we could present I see are - multilingual database, probably some estimates on our performance, training and adaptation differences related to our versatility. Traditional databases are quite balanced, our is not, so there can be interesting effects.
Ken: about continuous updates, it's not an issue. We can make snapshots one can use as a reference for sure.
--- (Edited on 9/7/2008 4:58 am [GMT-0500] by nsh) ---
Besides what nsh mentioned, in a conference paper I think it would be interesting to describe how the VoxForge data has been collected in a web-based, community-driven, volunteer-based way. This is quite different from (and cheaper than) usual data collection practices. If you want to emphasize that point you could title the paper something like "VoxForge: a free, community-driven speech corpus".
ISCA's (www.isca-speech.org) newsletter "ISCApad" is a good place for announcements since it has a regular section about new corpora (a conference paper isn't required to put an announcement in the ISCApad).
The VoxForge audio is under GPL. I am not sure how the GPL is interpreted in this case, but if the correct interpretation is that the VF audio can only be used with speech recognition engines that meet the FSF's definition of free software, then I guess that may rule out HTK and many other software packages used in the ASR research community. Another obstacle to popularizing VoxForge in the research community is that there are so many corpora out there already.
"It seems that the key for the creation of a good academic speech recognition corpus is that the all speech contained therein must be of consistent audio quality (i.e. all noisy speech or all clean speech, no mixing...) and of a single, broad dialect class (e.g. all American English, but no mixing in of British, India or Australian english). Is this the case?"
Not necessarily, but it's true that it's often useful to make these kinds of choices in corpus design. It might indeed be useful to divide the speech into categories ahead of time so that the speech inside the categories is more consistent. The questions around whether and how to do that, and how to divide the data into train and test sets (and whether to leave anything out) are subtle. I think the best approach is to find academic user(s) who are interested in using the corpus and get their input on these questions. It might be useful to define subsets of the full corpus, e.g., an American English subset. This way if people don't want to deal with accent issues they can use the subset, and otherwise they can use all the data. Similarly there could be a subset with no telephone/IVR speech, or only telephone/IVR speech.
--- (Edited on 1/30/2009 7:25 pm [GMT-0600] by DavidGelbart) ---
HI David,
>The VoxForge audio is under GPL. I am not sure how the GPL is
>interpreted in this case, but if the correct interpretation is that the VF
>audio can only be used with speech recognition engines that meet the
>FSF's definition of free software, then I guess that may rule out HTK and
>many other software packages used in the ASR research community.
The general rule is that the GPL does not limit the *use* of anything covered by the GPL license. It only covers *distribution*.
Therefore, anyone can download the VoxForge corpus and create their own acoustic models and use it with any speech recognition engine, proprietary or otherwise.
If they then turned around and tried to *distribute* (e.g. sell, give away, ...) the acoustic model along with the speech recognition engine, then they must give the receiver (or purchaser) access to the source code for both the acoustic model (the source audio in this case) and the speech recognition engine.
>think the best approach is to find academic user(s) who are
>interested in using the corpus and get their input on these questions.
That is an angle I had not thought of...
thanks again for the feedback,
Ken
--- (Edited on 2/4/2009 11:37 am [GMT-0500] by kmaclean) ---
Hello!
"define subsets of the full corpus, e.g., an American English subset"
Yes, this would be pretty useful. In my opinion, the following approach should be considered:
- convert the VoxForge prompts into an SSML document. This SSML document contains audio elements (with src attribute). And of course, it contains the xml:lang attribute. The whole document can cover different English dialects (British English, American English, Australian English, Canadian English, etc.).
- you can use the lang () function in XPath to extract the preferred dialect.
Example:
American English: xml:lang="en-us"
British English: xml:lang="EN-GB"
The VoxForge corpus would be much more useful if we would employ SSML in combination with XPath. I prepared some audio files with transcriptions as a first step into this XML direction.
Regards, Ralf
--- (Edited on 2009-02-05 4:24 am [GMT-0600] by ralfherzog) ---