General Discussion

License terms vs. existing databases
User: Visitor
Date: 10/12/2006 1:26 pm
Views: 5828
Rating: 24

I get the impression (please correct me if I'm wrong) that what you want from the GPL is a situation in which if acoustic models are built using both VoxForge recordings and recordings from a second source and those acoustic models are distributed then the second source recordings must also be distributed on request.

I am concerned that this would prevent the incorporation of data from collections such as the Linguistic Data Consortium and European Language Resources Association in VoxForge acoustic models.   These collections, by my guess, represent millions or tens of millions of dollars worth of labor.  For English and many other languages, there is enough data in these collections to immediately achieve a level of  coverage that I think VoxForge on its own would take years to reach or never reach at all.

On the other hand there are the licensing fees.  I am not sure if you have researched this yet so I just took a quick look at what's available.  I did it in a rush so please don't take what I say next as completely reliable.

The default LDC licenses seem to be a non-profit education and research license which I guess doesn't fit you, and a for-profit license which you might find expensive.  However, it might be worth looking through the catalogue for the license terms attached to the databases available under non-default licenses.   (The catalog search on their web site does not seem to work properly, but browsing the catalog seems to work fine.)  Also, what if Sun Microsystems, who have funded some of the recent work on Sphinx,  want to license LDC databases to train acoustic models for dictation in OpenOffice?  Would they be prevented by your license from using both LDC databases and VoxForge data?

A quick browse of the ELRA catalogue shows that there are some ELRA databases for which a commercial license can be obtained without that much money.  


--- (Edited on 10/12/2006 1:26 pm [GMT-0500] by Visitor) ---

Re: License terms vs. existing databases
User: kmaclean
Date: 10/12/2006 8:40 pm
Views: 464
Rating: 28

Hi David,

Excellent questions! 

The reason I chose GPL is to encourage the open source community to contribute transcribed speech - if you submit something, you know it will always benefit the community.  In creating VoxForge, I did not set out to shut-out third party suppliers speech corpora from the creation of VoxForge Acoustic Models - but with GPL licensing that will likely be the end result.

LDC and ELRA have been around a long time and have contributed greatly toward basic research and to getting Open Source Speech Recognition Engines to where they are today.  However they charge for their speech corpora.  And with good reason, transcribing audio is a tedious and mistake prone exercise, and you have to pay people to do it full time.  I want to leverage the open source community by asking many people to contribute a little, rather than paying fewer people to contribute a lot.  GPL helps to encourage this process because people know their contribution will always be available.

My belief is that if we are to truly grow Open Source Speech Recognition, we need  free Open Source Speech Corpora.  Apache/BSD style licenses have their place, but in the context of speech recognition, they have not created a self-sustaining open source community - there is not a big enough user base (yet...).  My hope is that GPL licensing will improve the situation.  It's not perfect, but given our goals, it is the best available choice.

all the best, 



--- (Edited on 10/13/2006 9:24 am [GMT-0400] by kmaclean) ---