VoxForge
In speech recognition research the general view is that ‘there is no data like more data’. However, this may not always be true. Research in the ESPRIT Project SAM has shown that clever use of a small data set can be more efficient in training and testing isolated word ASR systems than large databases.
..
Therefore, there seems to be room for a fundamental reassessment of the claim that more data is always better, no matter what.
The paper then goes on to describe their approaches to "optimal selection of speech data from a database for efficient training of ASR systems".
Although this paper is talking about *isolated* word recognition, presumably this principle would also extend to *continuous* word recognition (which is what we are interested in...). Therefore, this would indicate the importance of having some way to allow the community to be able to make edits to the text of the VoxForge corpus, and have the ability to flag submissions for removal, so as to help improve recognition results.
--- (Edited on 5/30/2008 1:03 pm [GMT-0400] by kmaclean) ---