VoxForge
I have been looking at speech recognition and OCR lately. I also read that language models can help speech recognition engines determine the most likely result from ambiguous input.
It occurred to me that this is similar for OCR - OCR guesses letters and then has to determine which word is most likely based on what it thinks it "saw" and what the word is most likely to be.
Grammar checkers in word processors must also determine the likelihood of entered text.
This may be an off-the-wall suggestion, but would it be sensible for FSF to try and develop a GPLv3 language model that could be used for all three? I would have thought that a good language model was something important to Voxforge's aims (the ability to create speech recognition apps without the need for commercial resources).
--- (Edited on 3/28/2008 9:27 pm [GMT-0500] by Luna-Tick) ---
I would think that the language models will be a bit different, due to the differences of the task (guessing a letter in a given context of neighbouring letters v. guessing a word in a given context of neighbouring words).
However, both could be made from the same GPL text corpus. I also think that for smaller unit oriented models, you would need less text to achieve decent results.
The Dasher project for instance uses training texts of about 0.5-1 MB per language (the legal status of their texts is not entirely clear by the way, so if we collect texts that are definitely GPL, they would be very interested). That's enough for good results, because it's about guessing the next couple of letters, not the next couple of words.
Robin
--- (Edited on 3/29/2008 10:10 am [GMT-0500] by Robin) ---