GSOC 2011 - student showing interest in a Simon project to help collect speech for VoxForge

Audio and Prompts Discussions

Flat

User: kmaclean
Date: 4/15/2011 7:27 pm

Views: 8132
Rating: 16

Ahel asks:

Hi!
I'm attending to apply at this idea related to Voxforge: briefly Peter (my mentor) has thought that everyone who install simon, train for himself the vocabulary recognition and so on; so it would be cool if simon would envoy semi-automatically the trained audio file that Voxforge is collecting.
May I know what do you think about this idea?
Also I'd like to discuss with you if and how ways would be better to implement, in your opinion, the technology in support of this idea.
Thank you for your time.
Please feel free to contact me as soon as you wish.

--- (Edited on 4/15/2011 8:27 pm [GMT-0400] by kmaclean) ---

Re: GSOC 2011 - student showing interest in a Simon project to help collect speech for VoxForge

User: kmaclean
Date: 4/15/2011 7:28 pm

Views: 148
Rating: 12

My reply:

Hi Ahel,

Glad to hear that you are interested in helping out with open source speech recognition!

My replies follow:

On Tue, Apr 5, 2011 at 5:50 PM, Ahel ibn Alquivr <[email protected]> wrote:

Hi!
I'm attending to apply at this idea related to Voxforge: briefly Peter (my mentor) has thought that everyone who install simon, train for himself the vocabulary recognition and so on; so it would be cool if simon would envoy semi-automatically the trained audio file that Voxforge is collecting.

When you say "envoy" do you mean send or upload to the VoxForge collection site?

May I know what do you think about this idea?

I think it is a great idea, especially for the languages for which we have fewer submissions (i.e. non-English languages). For English, it would be an excellent way to obtain more speech, and to improve the quality of submissions, since, as Peter mentions in one of the comment in the page you linked, Simon can help with filtering out poor recordings.

Some things to think about:

How much speech do we need?

Peter also mentions in the GSOC project description that we need lots more speech to create acoustic models for dictation applications (unfortunately, I have helped spread that misconception...). Though more speech generally is better, there are diminishing returns after a point - and it is not clear where that point is. See the following links:

Nickolay (CMU Sphinx maintainer) post in this thread: VoxForge Acoustic Models w/ Sphinx 4):
Nickolay's blog post here: How to create a speech recognition application for your needs.
Arthur Chan (former CMU Sphinx maintainer) paper here: Do we have a true open source dictation machine?:
The CMU SPhinx site's speech corpus size estimates

I have guessed 140 hours of speech is required for a good command and control acoustic model - simply because that is what the main CMU Sphinx acoustic model uses...

Not sure how much is required for dictation, though one would "assume" more would be required. However, as Nickolay clearly argues in his posts, and the papers he links to, more is not necessarily better.

Types of speech

There are many types of speech that can be collected - see this FAQ entry: What is a speech corpus or speech corpora? Command and control can use acoustic models with 'read' speech, and work quite well.

Dictation acoustic models *may* need to be trained with 'spontaneous speech' rather than 'read' speech (or a mix thereof) to be effective. See this thread: how to get more voice samples?

Conclusion

I can see your application being very good at collecting lots of 'read' speech for the applications that people actually use command and control speech recognition for... That in and of itself would be very valuable to the open source community.

Also I'd like to discuss with you if and how ways would be better to implement, in your opinion, the technology in support of this idea.

The VoxForge speech submission applet uses Java-based Postlet code for its client uploader, and the php server based code described on the postlet site.

That should give you a pretty good idea of what is needed on the upload side. Please make sure the URL for site for uploading is configurable by the user - so that if we change providers, it is just a simple change in URL.

Do you have any other approaches that might work?

If you have any further questions, please let me know,

regards,

Ken

--- (Edited on 4/15/2011 8:28 pm [GMT-0400] by kmaclean) ---

Re: GSOC 2011 - student showing interest in a Simon project to help collect speech for VoxForge

User: kmaclean
Date: 4/15/2011 7:29 pm

Views: 143
Rating: 14

Reply from Ahel:

Hi Ken,

Thank you for your answer and I'm sorry to haven't answered till now.

My replies follow:

On Wed, Apr 6, 2011 at 6:28 PM, Ken MacLean <[email protected]> wrote:

Hi Ahel,

Glad to hear that you are interested in helping out with open source speech recognition!

My replies follow:

On Tue, Apr 5, 2011 at 5:50 PM, Ahel ibn Alquivr <[email protected]> wrote:

Hi!
I'm attending to apply at this idea related to Voxforge: briefly Peter (my mentor) has thought that everyone who install simon, train for himself the vocabulary recognition and so on; so it would be cool if simon would envoy semi-automatically the trained audio file that Voxforge is collecting.

When you say "envoy" do you mean send or upload to the VoxForge collection site?

Sorry for my English mistakes; I meant that, thanks to this feature, simon would send trained audio file to Voxforge.

May I know what do you think about this idea?

I think it is a great idea, especially for the languages for which we have fewer submissions (i.e. non-English languages). For English, it would be an excellent way to obtain more speech, and to improve the quality of submissions, since, as Peter mentions in one of the comment in the page you linked, Simon can help with filtering out poor recordings.

Thank you.

Some things to think about:

I know that will be a not so pleasant part

How much speech do we need?

Peter also mentions in the GSOC project description that we need lots more speech to create acoustic models for dictation applications (unfortunately, I have helped spread that misconception...). Though more speech generally is better, there are diminishing returns after a point - and it is not clear where that point is. See the following links:

Nickolay (CMU Sphinx maintainer) post in this thread: VoxForge Acoustic Models w/ Sphinx 4):

Nickolay's blog post here: How to create a speech recognition application for your needs.

Arthur Chan (former CMU Sphinx maintainer) paper here: Do we have a true open source dictation machine?:

The CMU SPhinx site's speech corpus size estimates

Thank you for proposing me the contradictions and the problems that are under this approach.
I've to think hard about that concepts.
For sure, it can be interesting for unusual language and accent.

I have guessed 140 hours of speech is required for a good command and control acoustic model - simply because that is what the main CMU Sphinx acoustic model uses...

Not sure how much is required for dictation, though one would "assume" more would be required. However, as Nickolay clearly argues in his posts, and the papers he links to, more is not necessarily better.

I've seen, yes.

Types of speech

There are many types of speech that can be collected - see this FAQ entry: What is a speech corpus or speech corpora? Command and control can use acoustic models with 'read' speech, and work quite well.

In my situation, now, i must trust you; but then I'll manage to understand why.

Dictation acoustic models *may* need to be trained with 'spontaneous speech' rather than 'read' speech (or a mix thereof) to be effective. See this thread: how to get more voice samples?

Conclusion

I can see your application being very good at collecting lots of 'read' speech for the applications that people actually use command and control speech recognition for... That in and of itself would be very valuable to the open source community.

Thank you.

Also I'd like to discuss with you if and how ways would be better to implement, in your opinion, the technology in support of this idea.

The VoxForge speech submission applet uses Java-based Postlet code for its client uploader, and the php server based code described on the postlet site.

Thank you, these informetions will surely be necessary to me when I will apply to the server client interface.

That should give you a pretty good idea of what is needed on the upload side. Please make sure the URL for site for uploading is configurable by the user - so that if we change providers, it is just a simple change in URL.

Do you think it's a good point making user to change by themselves an url?

Do you have any other approaches that might work?

I thought that could be simply editable in the source so when you would change, it can be released a new version with that simple patch whit the new url posted previously by Voxforge.

--- (Edited on 4/16/2011 12:19 pm [GMT-0400] by kmaclean) ---

Re: GSOC 2011 - student showing interest in a Simon project to help collect speech for VoxForge

User: kmaclean
Date: 4/15/2011 7:32 pm

Views: 124
Rating: 12

Email from his mentor, Peter:

[...]

How much speech do we need?
Peter also mentions in the GSOC project description that we need lots more speech to create acoustic models for dictation applications (unfortunately, I have helped spread that misconception...). Though more speech generally is better, there are diminishing returns after a point - and it is not clear where that point is.

Don't worry, I know that this won't automatically turn the Voxforge models into high quality dictation models :)

We have been experimenting with model generation ourselves (partly under professional council) so this is not really new for me. I also can't remember saying that this initiative is directly related to dictation (if you read the blog post again, I was merely asserting the quality of the current model).

However, having more samples to play with never _hurts_. It allows us to define higher quality criteria for the submission (and still have enough samples). If the amount of samples is large enough, one can even break down the model according to dialect groups, etc.

And of course gathering samples is very important for non-English languages where the current corpus is very small.

Also I'd like to discuss with you if and how ways would be better to implement, in your opinion, the technology in support of this idea.

The VoxForge speech submission applet uses Java-based Postlet code for its client uploader, and the php server based code described on the postlet site.

Have you looked into ssc / sscd?
We already use it for our sample acquisition ourselves and having sscd and simon compatible would be a huge plus for us.
Would you consider setting up an sscd server for Voxforge?

That should give you a pretty good idea of what is needed on the upload side. Please make sure the URL for site for uploading is configurable by the user - so that if we change providers, it is just a simple change in URL.

Don't worry, it will be :)

Best regards,
Peter

--- (Edited on 4/15/2011 8:32 pm [GMT-0400] by kmaclean) ---

Re: GSOC 2011 - student showing interest in a Simon project to help collect speech for VoxForge

User: kmaclean
Date: 4/15/2011 7:35 pm

Views: 153
Rating: 13

Hi Pete/Ahel,

[...]

My replies follow:

Don't worry, I know that this won't automatically turn the Voxforge models into high quality dictation models :)

Just wanted to make sure neither of you assumed more speech was the 'silver bullet' for dictation (which was what I thought when I started VoxForge...)

We have been experimenting with model generation ourselves (partly under professional council) so this is not really new for me. I also can't remember saying that this initiative is directly related to dictation (if you read the blog post again, I was merely asserting the quality of the current model).

OK, I guess I assumed that this passage in your blog: "The current Voxforge model for English is quite good for command and control but nowhere near powerful enough for dictation." meant you though that all that is required for dictation is more speech... no worries, we are on the same page...

However, having more samples to play with never _hurts_. It allows us to define higher quality criteria for the submission (and still have enough samples). If the amount of samples is large enough, one can even break down the model according to dialect groups, etc.

Agree

And of course gathering samples is very important for non-English languages where the current corpus is very small.

Agree

Have you looked into ssc / sscd?

I've seen some documentation you have written on it, but never played with it...

We already use it for our sample acquisition ourselves and having sscd and simon compatible would be a huge plus for us.
Would you consider setting up an sscd server for Voxforge?

depends - can it run on a regular web hoster, or does it need a specialized server instance?

The VoxForge front end (WebGUI CMS) is on a low-bandwidth server, and the VoxForge speech submission applet uploads to php code on a web hosting account. This was done on the assumption that front-end web server connection (10 mbit down/ 1 mbit up) was not fast enough to accommodate speech uploads (along with current web traffic). If the ssc could throttle itself a bit, then maybe it could be installed on the VoxForge server.

--- (Edited on 4/15/2011 8:35 pm [GMT-0400] by kmaclean) ---

Re: GSOC 2011 - student showing interest in a Simon project to help collect speech for VoxForge

User: kmaclean
Date: 4/15/2011 7:36 pm

Views: 147
Rating: 14

Hello everybody,

Am 2011-04-08 16:16, schrieb Ken MacLean:

We already use it for our sample acquisition ourselves and having sscd and simon compatible would be a huge plus for us.
Would you consider setting up an sscd server for Voxforge?

depends - can it run on a regular web hoster, or does it need a specialized server instance?

Well it has a server component so you need to have _some_ access. It doesn't really need root access but if it doesn't have one it can't bind to any of the lower ports (OS restriction).

Realistically, it should run on a server where you have at least SSH access to. It also requires a MySQL database...

The VoxForge front end (WebGUI CMS) is on a server in my basement, and the VoxForge speech submission applet uploads to php code on a 1&1 web hosting account. This was done on the assumption that my home internet connection (10 mbit down/ 1 mbit up) was not fast enough to accommodate speech uploads (along with current web traffic). If the ssc could throttle itself a bit, then maybe it could be installed on the VoxForge server.

Well you can set up QoS to limit the impact of incoming traffic on your web traffic. It really shouldn't be noticeable.

The nice thing about speech submissions is that they will probably be limited by the upload speed of the clients (asynchronous lines have much faster download speeds than upload speeds). So you alone could accommodate 10 other people having the same internet connection you have (10 x 1mbit up = your 10mbit down).

Maybe you could have a look at ssc/sscd? Both have a manual but you can also ask me if you have any difficulties...

Best regards,
Peter

--- (Edited on 4/15/2011 8:36 pm [GMT-0400] by kmaclean) ---

Re: GSOC 2011 - student showing interest in a Simon project to help collect speech for VoxForge

User: kmaclean
Date: 4/15/2011 7:37 pm

Views: 132
Rating: 12

Well it has a server component so you need to have _some_ access. It doesn't really need root access but if it doesn't have one it can't bind to any of the lower ports (OS restriction).

my 1&1 account does not allow access to the lower ports.

I assume you mean megabit, right?

yes.

Well you can set up QoS to limit the impact of incoming traffic on your web traffic. It really shouldn't be noticeable.

So you alone could accommodate 10 other people having the same internet connection you have (10 x 1mbit up = your 10mbit down).

I agree, in theory - but doesn't seem to work that way in practice for some reason...

Maybe you could have a look at ssc/sscd? Both have a manual but you can also ask me if you have any difficulties...

Will do, after April 23, I will have much more free time.

Noticed that all audio is stored in MySql (rather than pointers to static files), is there a reason for this? What kind of security does sscd use - has it been used on the Internet or mostly in "behind the firewall" LAN configurations?

thanks,

Ken

--- (Edited on 4/15/2011 8:37 pm [GMT-0400] by kmaclean) ---

Re: GSOC 2011 - student showing interest in a Simon project to help collect speech for VoxForge

User: kmaclean
Date: 4/15/2011 7:37 pm

Views: 486
Rating: 15

Well it has a server component so you need to have _some_ access. It doesn't really need root access but if it doesn't have one it can't bind to any of the lower ports (OS restriction).

my 1&1 account does not allow access to the lower ports.

The port is configurable so that wouldn't be that much of a problem but setting it up on a system where you have full control would probably be easier.

I really don't think that we would get a flood of submissions (and if we really are that overwhelmed with good recordings I'm sure we can arrange for more powerful hardware / more bandwidth - quite possibly during a break of all the celebration :P).

So you alone could accommodate 10 other people having the same internet connection you have (10 x 1mbit up = your 10mbit down).

I agree, in theory - but doesn't seem to work that way in practice for some reason...

Traffic shaping / QoS can make a ton of difference. But again, I really don't think that bandwidth will be the bottleneck (at least not at the beginning).

Maybe you could have a look at ssc/sscd? Both have a manual but you can also ask me if you have any difficulties...

Will do, after April 23, I will have much more free time.

Fell free to contact me for more information!

Noticed that all audio is stored in MySql (rather than pointers to static files), is there a reason for this? What kind of security does sscd use - has it been used on the Internet or mostly in "behind the firewall" LAN configurations?

It actually only stores the filenames in the database.

sscd doesn't yet provide any kind of security (we use an VPN setup) but adding some shouldn't really be hard. Notice that the ssc protocol does not yet allow to retrieve samples over the network (only user data and to upload data). To create models of the data I wrote a couple of scripts that can query the db, retrieve the samples and write appropriate prompts files.

So yes we would need to adjust the sscd a bit to fit in this use case (also: fields in the database, etc.) and I'm not sure that makes it viable for this GSoC proposal (short timeframe). Anyways, let's see if the project gets selected and if so (or not) then we can make further arrangements.

Best regards,
Peter

--- (Edited on 4/15/2011 8:37 pm [GMT-0400] by kmaclean) ---

Re: GSOC 2011 - student showing interest in a Simon project to help collect speech for VoxForge

User: kmaclean
Date: 4/25/2011 9:42 pm

Views: 3214
Rating: 13

Great news, Ahel's project proposal got accepted!

From the GSOC site:

Integration of Voxforge acoustic model recognition on simon and fully translating simon in Italian.

Main aim for this proposal is a working project that collects audio acoustic model and then it send to Voxforge server, through re-utilizing simon code (in particoular ssc/sscd). Voxforge open source whose aim is to collect transcribed speech for use in Open Source Speech Recognition Engines. http://www.voxforge.org/ Simon open-source speech recognition program that can utilize the models created from the voxforge data. http://simon-listens.org

--- (Edited on 4/25/2011 10:42 pm [GMT-0400] by kmaclean) ---

Previous • Next •


Username	Password