General Discussion

Flat
Sentences from OpenTaal
User: dano
Date: 3/15/2008 12:20 pm
Views: 6549
Rating: 20

The project OpenTaal (at opentaal.org) has collected a huge amount of Dutch sentences. OpenTaal is a Dutch project for creating dictionaries, grammar checking, synonyms and much more is coming. They created also 'Wordsharvester', a little app that collects and counts words from all over the web, that can used by all languages, not just Dutch.

Here is the link (more than 300 MB):

http://opentaal.org/opentaalbank/test/zinnen.tgz

They have also made a collection with the most used combinations of words (2,3,4 and 5 words), but that's currently not accessible due to a mysql error.

I would greatly thank OpenTaal for the work that they did (and do)!

Well, my question is, what can we exactly do with this huge information? How can we implement the information in the best way? Are there any ways of doing this yet?

--- (Edited on 3/15/2008 12:25 pm [GMT-0500] by dano) ---

Re: Sentences from OpenTaal
User: Robin
Date: 3/15/2008 5:17 pm
Views: 299
Rating: 68

Thanks Dano,

I knew OpenTaal and use their dictionary to make sure all Dutch prompts are in the latest official spelling. Dutch is a bit of a moving target, partly due to government interference with spelling, partly because of it being so open to international influences etc.

I didn't know about these sentences though and it's a great find. I'll have a look at the content (I won't read all!) and if it's good we can use it to make our first Dutch language model.

I'm not sure if it can be GPL, I would have to see the content first.

--- (Edited on 3/15/2008 5:17 pm [GMT-0500] by Robin) ---

OpenTaal seems to use the GPL.
User: ralfherzog
Date: 3/15/2008 8:44 pm
Views: 209
Rating: 18
Hello Robin,

Probably, they are using the GPL.

Greetings, Ralf

--- (Edited on 2008-03-15 8:44 pm [GMT-0500] by ralfherzog) ---

Re: OpenTaal seems to use the GPL.
User: Visitor
Date: 3/16/2008 4:31 am
Views: 213
Rating: 20

I think it's not GPL and we can't make it GPL. Though it is not, we can use it for the Dutch language model and make that GPL.

 

--- (Edited on 3/16/2008 4:32 am [GMT-0500] by Visitor) ---

Re: OpenTaal seems to use the GPL.
User: kmaclean
Date: 3/16/2008 8:23 am
Views: 316
Rating: 22

Hi Visitor, 

Thanks for the post, my comments follow: 

>I think it's not GPL and we can't make it GPL.

Why do you think it is not GPL?  I can't read Dutch, but the Google translation seem to say that the project owners think it is ... 

Is it because it collected source words from all over the web without permission?  Which I agree would seem to be problematic from a strict Copyright/GPL interpretation.  And my first reaction on reviewing the OpenTaal site was that they might have problems if they were to publish the source text they've collected from the web for their language model.

I think they try to get around this by telling people if they think their Copyrights have been infringed that they will take their content out of the text corpus. 

>Though it is not, we can use it for the Dutch language model and make that GPL.

While I agree this approach might be good for the short term (since this is essentially what Google has done with their language model).  However, we will still have the same problem we had with acoustic models - where the source text/audio is closed source, but the acoustic/language models are freely distributable.  I guess we need to think about what we want to do in the short and long term. 

For English, I would like to try to have fully GPL acoustic and language models, with all original audio and text Freely available, with no potential Copyright issues.  But I don't think this approach would necessarily be realistic for other languages.

What does everyone think? 

Ken 

--- (Edited on 3/16/2008 9:23 am [GMT-0400] by kmaclean) ---

Re: OpenTaal seems to use the GPL.
User: dano
Date: 3/16/2008 8:41 am
Views: 201
Rating: 17

I think, like Ruud Baars from OpenTaal also said, the sentences are so much out of context that we don't have to worry about copyright issues to use it in a language model. I think GPL'd text is better, but if we don't have it I think we can use single sentences for this thing. OpenTaal did it also with their dictionary and will also use it for their grammar rules.

--- (Edited on 3/16/2008 8:41 am [GMT-0500] by dano) ---

--- (Edited on 3/16/2008 8:41 am [GMT-0500] by dano) ---

Re: OpenTaal seems to use the GPL.
User: Visitor
Date: 3/16/2008 1:54 pm
Views: 210
Rating: 17

I looked at the text and I'm quite sure there would be copyright issues involved. Even loose sentences get some protection under Dutch copyright law (a slimmed down version of the 'normal' copyright). Especially given the fact that the sentences are a bit mixed, but still you can recognize that some belong together.  So one could reconstruct the stories. Quite a challenging puzzle actually!

However, I see no problems in using these texts to make a language model and then licensing that model under a GPL-compatible licence. One that doesn't require making the source code available.

I think we could possibly use the latest version of the LGPL for that purpose (though I only looked into that quickly). Under Dutch copyright law we should be save then as I don't think that the model will be seen as a derived work. Belgian or Surinamese copyright laws won't be much different in this respect (actually I doubt in any jurisdiction a judge will ever decide differently, since it is a stretch).

At the same time we can continue collecting GPL-text, as in the long run that might be (a bit) better.

R

--- (Edited on 3/16/2008 1:54 pm [GMT-0500] by Visitor) ---

OpenTaal uses the LGPL.
User: Robin
Date: 3/16/2008 2:19 pm
Views: 2587
Rating: 21

I can add that OpenTaal also uses the LGPL (not the GPL) for the spelling lists they produced for OOo, FF and Thunderbird.

--- (Edited on 3/16/2008 2:19 pm [GMT-0500] by Robin) ---

PreviousNext