VoxForge
I would like to get your input on the following.
As far as I understand, the GPL license doesn't allow non-free derivative works out of a work licensed under GPL.
However, I believe that it is impossible to know whether an acoustic model has been compiled out of voxforge's audio corpora. Basically, the creation of an acoustic model requires:
pre-processing -> feature vector extraction -> classification
For example, an Hidden Markov model is composed of state transition probabilities and of pairs of means and variances for the observation probability distributions...
There's no way to be sure that an acoustic model comes from voxforge's audio corpora and the commercial product in question will never have to ship the audio corpora, only the acoustic models... In this respect, how do you control that a commercial product doesn't use your corpora?
Thanks for your input
Mathieu
--- (Edited on 5/28/2008 9:25 am [GMT-0500] by Visitor) ---
Hi Mathieu,
>commercial product in question will never have to ship the audio corpora, only
>the acoustic models... In this respect, how do you control that a commercial
>product doesn't use your corpora?
We will likely have to rely on evidence that is not contained in the acoustic model itself, like disgruntled employees...
I would welcome any input on possible technical solutions to this problem.
Thanks,
Ken
--- (Edited on 5/28/2008 1:45 pm [GMT-0400] by kmaclean) ---
--- (Edited on 5/28/2008 10:43 pm [GMT-0500] by Visitor) ---
A related question:
I'm currently building models that contain both voxforge- and non-voxforge audio. Is this allowed at all?
I'll happily release the models, I was just too lazy to do so yet (also because they're not really usable for anything).
Cheers, Timo
--- (Edited on 2008-06-11 13:59 [GMT+0200] by timobaumann) ---
Hi Timo,
The general rule is that you can *use* a GPL'ed work (like VoxForge speech audio & transcription texts) any way you like. However, the moment you *distribute* a GPL'ed work, or derivative works thereof, then the GPL license requires that the *entire* work be distributed under the GPL.
Therefore, in your particular case, the GPL does not prevent you from creating a 'binary' acoustic model from VoxForge and non-Voxforge 'source' audio, and *using* it for your own purposes.
However, if you decide to *distribute* this acoustic model, it is covered by the GPL, and you must make available all the 'source' (VoxForge and non-VoxForge audio, and texts, ...) which was used to create the 'binary' acoustic model.
The FSF FAQ has some information that is helpful with respect to how you might distribute a large corpus of 'source' audio:
Can I put the binaries on my Internet server and put the source on a different Internet site?
The GPL says you must offer access to copy the source code "from the same place"; that is, next to the binaries. However, if you make arrangements with another site to keep the necessary source code available, and put a link or cross-reference to the source code next to the binaries, we think that qualifies as "from the same place".
...
Ken
p.s. I am not a lawyer, and this is not a legal opinion
--- (Edited on 6/11/2008 2:38 pm [GMT-0400] by kmaclean) ---
Hi Ken,
thanks for that advice, I would have probably done a bad thing by releasing the models then. (It's a shame because the combined model works far better than any of the sub-corpus models, but if that's what the license says...) Well, let's hope that I will be able to reach the same performance with only-voxforge models someday.
Cheers, Timo
--- (Edited on 2008-06-12 08:29 [GMT+0200] by timobaumann) ---
Came across an interesting thread in one of the Debian maililng lists (legal questions regarding machine learning models: msg#09321) where Mathieu Blondel asks:
[...]
For example, in speech recognition, speech models
are trained from databases of speech and their corresponding annotated
text. The models can then be used to recognize speech. To summarize
the "training" procedure with a black box:
input: data => [ training algorithm] => output: model
As can be seen from the arrows, this is a "one way" transformation,
i.e. it is possible to transform the data into a model but it's not
possible to transform the model back into exactly the same data. The
only possibility for someone to find whether his/her data were used to
create the model is to reproduce exactly the same training conditions
and train the data again to see if the resulting model is the same.
However, two implementations of the same algorithm may differ due to
design choices and algorithms themselves can have several parameters,
so it's not easy to reproduce the exact same training conditions. Even
then, there's no proof that some other data cannot lead to the same
model in some other training conditions.
[...]
My second question is: Given the difficulty to prove what data were
actually used to train a model, how can we prevent non-free software
to use free data such as those of Voxforge?
Josselin Mouette provides a possible solution:
A widely-used technique is to cleverly hide some minor bugs in the data.
If a non-free model shows the same bugs, you can prove the data was used
illegally. Of course this only works if you manage to keep the bugs
secret.
Ken
--- (Edited on 7/5/2009 10:18 pm [GMT-0400] by kmaclean) ---
Hi all,
We have planning to develop speech recognition engine by using HTK.. for our commercial application.. This application will be based on client server model through web Interface.
But we have bit confused about HTK license model...
Anyone please tell me, whether we can build and use HTK model and decoder for our application or not... Even though if we change souce code, any license issue will occur in future... ?. How come the people can able to find and judge the code used by HTK... if we changed and encrypted.
Thanks...
--- (Edited on 11/9/2010 6:54 am [GMT-0600] by prabhu) ---
>whether we can build and use HTK model and decoder for
>our application or not...
As far as I know, you can do anything you want with acoustic models created using the HTK Toolkit.
With respect to the HTK toolkit itself, it is best to ask on the HTK email list or just look at their license.
Ken
P.S. I am not a lawyer, and this is not legal advice
--- (Edited on 11/9/2010 3:48 pm [GMT-0500] by kmaclean) ---