VoxForge
Here is a transcript of an e-mail I have sent to the following just now:
CC'd to:
Hi Ken!
My name is Sam. I'm working on redesigning HCI, both at work and in my free time.
I'm concerned that GNU/Linux doesn't as yet have any support for continuous speech recognition. I think that within the next 10 years there will be a major shift towards speech recognition, and I think it is important that GNU/Linux is not left out.
I am in touch with the sphinx development community, including David Huggins Daines, the current sphinx maintainer. David confirmed that the problem is that there is no sizeable speech database (lots of .wav phrases together with their associated .txt), and hence sphinx cannot generate a decent voice model.
I am also in touch with Richard Stallman, founder of GNU/Linux and the FSF. He is willing to publicise VoxForge to the FSF community. This will hit a lot of people, and there are a lot of linux users who really want to contribute but don't know how to code. This could generate a lot of traffic for VoxForge.
Can VoxForge handle it?
We should discuss this before throwing the gates open. Here are three things that hit me straight away:
Firstly, To get people to contribute, it is important to have some simple feedback system. It is the difference between one one hand laying a brick, which disappears, and being told that one day a castle will appear, and on the other laying a brick on a partially built castle.
Is there any chance you could include a usage graph on the website main page?
x-axis: each pixel is one day
y-axis: the number of phrases contributed that day
And a thermometer! You know like those thermometers they use at fundraisers? This is how much we need to make a continuous speech recogniser, this is how much we have got... The key is to create a 'we can make it happen' vibe... Once people see the thermometer is starting to heat up I'm sure there will be a lot of people who put hours of effort in.
Secondly, I just tried it out, imagining I'm a linux fan who has just seen an article by Richard Stallman in a linux magazine. I log onto Voxforge.org ...
I didn't get very far. A dialogue box appeared telling me 'the page you're viewing requires java. More information is available on the Microsoft website.' And that was it.
The page it takes me to has a link saying 'Information on the Java Security Warning pop-up', and this is quite a long page with a lot of information. It doesn't offer any solution to my problem; I have not been presented with any option to download the java virtual machine and.
So I know better than try to get anything meaningful out of the Microsoft website! I go to google, put in 'download java virtual machine vista'... And have to take it from there.
But this is going to put off a lot of people, maybe >90%. Is it possible for the browser to ascertain whether java is installed or not, and if it isn't, offer a link straight to downloading the appropriate java virtual machine executable for the operating system that person is using? ie minimize the amount of clicks and reading required...
The third issue is the phrases themselves: where do you get them from? Nickolai (a major sphinx contributor, cc'd) and I were discussing ways to make entering speech more fun, so people would be encouraged to do it.
A few ideas:
Of course you may need several people to speak the same phrases. If this is true, these ideas could be adapted: you could have a pool of 'this is what the last hundred visitors chose to read out', and next to each one, a number which represents how many people have read from that source. So you can either click something existing, or choose something new. This could be a lot of fun - who knows what songs / movies / literature / jokes people are going to put up?
Sam ([email protected])
--- (Edited on 5/6/2008 11:40 am [GMT-0500] by Visitor) ---
Hi Sam,
Thanks for the great post, my replies follow:
> ...This could generate a lot of traffic for VoxForge.
>...Can VoxForge handle it?
Most likely not :)
The server can handle it no problem, however my ISP throttles the VoxForge web site (8mbit down; 1mbit up). However, we do have subsets of this website on SourceForge and GoogleCode that are more able to tolerate high traffic volumes.
There is also another bottleneck ... on the upload side. The web hoster I use (where any SpeechSubmission app audio is uploaded) is more optimized for downloads rather than uploads. Some users have complained about upload times.
I just received an account with Google App Engine, and was going to look into that to see if it might be used to improve upload capacity.
>Is there any chance you could include a usage graph on the website main page?
Yes, we could create a usage/contributions graph.
>And a thermometer! You know like those thermometers they use at
>fundraisers? This is how much we need to make a continuous speech
>recogniser, this is how much we have got...
OK. But note that our current goal is for the creation of command and control acoustic models. Sphinx uses 140 hours of speech in their current acoustic model, so that is our current target - we are trying to "walk before we run" :). My understanding is that dictation applications require 1000+ hours of speech. Regardless, the 140 hours of speech can be used towards the 1000 hours required for dictation.
>A dialogue box appeared telling me 'the page you're viewing requires java.
>More information is available on the Microsoft website.'
I think that is a message generated from your browser and O/S. I don't recall ever creating anything that pointed to an MS site ... there is no need, since everything a user needs is on the Sun site.
I think that you are looking for this link:
which is also on the Read page.
>It doesn't offer any solution to my problem; I have not been presented with
>any option to download the java virtual machine [...]
All this info is on the Java and Speech Submission Applet Troubleshooting Guide. I might need to put the link at the *top* of the Read page, rather than part way down. I thought it would be easier to find if it were closer to the actual applet on the page ... maybe not :)
>The third issue is the phrases themselves: where do you get them from?
from the FestVox project: from the CMU_ARTIC Database. These are phonetically balanced prompts used for creating a text-to-speech voices.
>A few ideas: ...
The main issue with sourcing new text is to determine who owns Copyright (if any ...) in the text being read. Project Gutenberg might be better in this regard.
>1. To speak something out loud is a great aid to learning. Maybe we can find
>some resource of historical & scientific facts
A future release of the Speech Submission Java Applet will allow users to submit the text of their choice - we will still need to figure out how to confirm a submitted text is free from Copyright restrictions.
>song lyrics (may be a bad idea because people would sing instead of
>speak.. But maybe that would be OK??)
I'm not sure how singing might affect things (other than maybe being painful to review ... for my singing at least :) ), but the general rule is that you train with the same audio you want to recognize - so it may be that such audio might only be useful if we want to recognize singing voices.
>Movie scripts. I swear, if you find a good movie script (like Star Trek IV)
>you will get people who read through the entire movie.
These would likely have Copyright issues - any recording of the dialog of a Copyrighted movie would be considered a derivative work. Might be able to find some open source movie scripts that people might be interested in.
>Making Voice-books. people can kill two birds with one stone - they can
>read in a document, or a chapter from a book, creating a Voice-book while
>adding to the database.
We already do this in association with LibriVox. See these links:
We also did a project with MojoMove411 to short readings (poetry & prose)This is also mentioned on the VoxForge home page (under the "How Can You Help?" section).
The difficulty, up until the past few weeks was the post-processing required on an audio book. To train an acoustic model with and audio book, the speech needs to segmented into sentences of 10-15 words long. This process was semi-automated for the longest time. I have almost completed script that will do this automatically (or much more automatically than was done in this past), with the help of HTK and Sequitur G2P. This page summarizes the current process: Automated Audio Segmentation Using Forced Alignment (Draft).
I've currently have a backlog of audiobook chapters that I need to process - these are all listed at the bottom of this page. That is my current focus.
>Have two text boxes: URL[ ], starting from [ ]
>So if I put in >URL[http://www.chordie.com/chord.pere/www.ultimate-guitar.com/print.php?what=tab&id=456256],
>starting from [I'm afraid]
>It starts presenting text from this location, one sentence at a time. hit >spacebar to advance.
This is a very interesting idea, and is similar to one that "V" proposed last month.
Thanks for the ideas. As I said, my current focus is on processing the LibriVox audio. Next I plan to look at improving scalability of the Submission app using Google App Engine, and then updating the Speech Submission app to allow for more than 10 submissions at a time, and maybe allowing users to submit their own text. Might be a good opportunity to update the prompts.
I can take a look at your recommendations for a usage/contributions graph and a thermometer after that.
If you have any suggestions/ideas about improving bandwidth/capacity, please let me know,
thanks for your interest in VoxForge!
Ken
--- (Edited on 5/6/2008 3:06 pm [GMT-0400] by kmaclean) ---
From my communication with RMS, it looks like the FSF would be happy to provide a suitable server that can cope with the bandwidth.
Lets move this to e-mail, and post when it's resolved!
As regards copyright problems, I think you're worrying about unnecessary things. you're in the same situation that YouTube is in. By allowing people to upload anything there is a possibility of copyright breach. So all you would have to do is mirror their policy on removing copyrighted material, and you'll be in the clear. I'm pretty sure that this policy is that they remove a video as soon as it is drawn to their attention that it is in breach of copyright. as this entire Endeavour is a community service, I doubt such a situation would ever arise.
is it possible to check from the webpage whether a user has Java enabled/ installed, and link them straight to the download if necessary? at the moment anyone coming to this site i guess is pretty dedicated, and will take the time to sort out technology problems. but if we're going to advertise to a larger group of people, many of whom won't have Java, it may pay to make things run as smoothly as possible.
I have a couple of other ideas for getting content on to the database; to first idea is to build a decent front-end for Vista speech-recognition, which captures phrases and sends them to the database. the current front end is pretty unusable. if someone builds a good front end I think a lot of people use it . I will have a look at this next week.
the second idea is only of any use once you have enough hours of training. the idea is a remote server based continuous speech recogniser. users have a satellite Utility. when they speak, the satellite compresses the audio and send it to the server, which performs speech-recognition returning ascii text. if user confirms this on the satellite , confirmation is sent to the server, which adds the phrase to the database
Sam
--- (Edited on 5/7/2008 10:14 am [GMT-0500] by Visitor) ---
Hi Sam,
>As regards copyright problems, I think you're worrying about unnecessary things. you're in the same situation that YouTube is in.
YouTube/Google has much deeper pockets and a top-notch legal team to deal with spurious Copyright claims/actions. Unfortunately, I don't ...
For this reason we will continue using a conservative approach to dealing with Copyright concerns.
> is it possible to check from the webpage whether a user has Java enabled/ installed, and link them straight to the download if necessary?
If the Java applet doesn't load, the user is (supposed to be ...) presented with a message and link to the Java Troubleshooting page, where they can download Java. The relevant html code is as follows:
<table width="550" style="background-color: rgb(255, 195, 1);" height="24" border="1"><tbody>
<tr><td> The VoxForge Speech Submission Applet should appear here. Please see the VoxForge <a color="ffffff" href="http://www.voxforge.org/home/read2/java">Java Troubleshooting Guide</a> to determine if you have Java installed on your PC - this is required in order to use the audio recorder.<br></td></tr></tbody>
</table>
If this does not work, then I'll have to do some additional testing/troubleshooting to figure out why. Which O/S and browser were you having problems on?
Ken
--- (Edited on 5/8/2008 9:32 am [GMT-0400] by kmaclean) ---
Hi Sam,
Thanks for the ideas!
> to first idea is to build a decent front-end for Vista speech-recognition, which captures phrases and sends them to the database.
I am a bit hesitant to spend time creating something for Vista ... there is plenty of work to be done in Linux to get FOSS speech recognition going :)
> the idea is a remote server based continuous speech recogniser.
Added this to the ideas page.
thanks,
Ken
--- (Edited on 5/8/2008 11:27 am [GMT-0400] by kmaclean) ---