VoxForge
email from bilal ghalib:
Hey guys!
What a sweet project you have, I actually stumbled across it while
trying to see if someone has already implemented an idea I had. I'll
suggest this to you:
DVD closed captioning, I have found a method to extract it and the
times they happen
and use this along with audio extracted
9000 hours of DVD audio/text is extracted each year, you not only get
text/speech correlation, you get the times as well.
What do you say?
--- (Edited on 2/4/2008 12:31 pm [GMT-0600] by speechsubmission) ---
My reply:
Hi Bilal,
Sounds very interesting!
However, I don't know about the Copyright implications of using close-captioned text and the audio from the DVD. DVD audio tracks are Copy protected. I am not sure of the status of the close-captioning text, but would assume that they would be too. VoxForge only accepts audio/text that we can redistribute under the GPL.
One possible solution might be to segment the DVD audio and randomly jumble up the segments (of text and matching audio). But I don't know if that might be considered a "derivative" work.
So the conservative approach is to only take speech read from public domain or Out-of-Copyright texts, and create acoustic models from that.
If you know of a way around this, please let me know.
I'd like to post this on the VoxForge website to see if anyone might have some more information on this - let me know if this is OK.
thanks,
Ken
--- (Edited on 2/4/2008 12:31 pm [GMT-0600] by speechsubmission) ---
From Bilal:
[...] I've slept on it, and I don't want this to become another idea
that got away. Take it, publish it, and just make sure that people can
contact me: bilal AT modati DOT com if they'd like. I hope that way we can
get some answers on the copyright information, I've just asked the
EFF, who knows, maybe they'll have time to let us know.
I've heard some interesting things about segmentation and copyright,
but I'm no expert and can be wrong. But I believe that a segment under
a certain length is fine. Also, if we are not publishing the strait
audio, but a work taken by computationally combining audio and text
where the original work is completely irreducible isn't that ok? Or is
one of your points to distribute the audio/text for other's to
experiment on?
-BG (Really though, your librivox + Gutenberg idea is awesome, but
how are you getting around timing?)
--- (Edited on 2/4/2008 12:33 pm [GMT-0600] by speechsubmission) ---
--- (Edited on 2/4/2008 12:33 pm [GMT-0600] by speechsubmission) ---
I've slept on it, and I don't want this to become another idea
that got away. Take it, publish it, and just make sure that people can
contact me: bilal AT modati DOT com if they'd like. I hope that way we can
get some answers on the copyright information, I've just asked the
EFF, who knows, maybe they'll have time to let us know.
I've heard some interesting things about segmentation and copyright,
but I'm no expert and can be wrong. But I believe that a segment under
a certain length is fine.
Also, if we are not publishing the strait
audio, but a work taken by computationally combining audio and text
where the original work is completely irreducible isn't that ok?
Or is
one of your points to distribute the audio/text for other's to
experiment on?
--- (Edited on 2/4/2008 1:34 pm [GMT-0500] by kmaclean) ---
--- (Edited on 2/4/2008 1:35 pm [GMT-0500] by kmaclean) ---
I am working on a similar project (called the MovieTrainer) - as part of a university project at TUC (Technical University of Crete) and have used conventional methods for producing divx/srt to extract the audio and data from dvd - motion pictures. We actually also use already ripped movies (srt + avis) distributed over the network which have the additional benifit of having already been authored (the subs in DVDs actually have to be OCRed to be retrieved - they are stored as images - which can cause errors to appear during the OCR, so a basic authoring must take place)
We are working primerely to prove that such datasets CAN be used for traning, so at present we are gathering enough data to train a CMU Sphinx Trainer and compare the results with well known 'proprietery' databases such as AURORA4 (part of the WSJ0)
Along the way we are producing a gui to automate the process of extraction and authoring as well as training/decoding of the data.
I was glad to see people working on similar grounds (we are definetely not competing here), and I have a suggestion about licensing:
Why not use "Free" Movies to produce the datasets, such as "RevolutionOS", "The Corporation", "Steel This Film" and others ? (which also happen to be documentaries - i.e contain relatively clean audio)
I hope we have more results to share with you soon, any suggestions are more than welcomed.
--- (Edited on 2/7/2008 11:32 am [GMT-0600] by whoneedselta) ---
--- (Edited on 2/7/2008 11:34 am [GMT-0600] by whoneedselta) ---
Wow, that's pretty sweet, yeah, I too had to OCR the text out of the DVD. We're totally on the same grounds, I'm very interested to see what sort of extraction automation you come up with.
So, let me know if I'm wrong, but I think subtitles on DVD's/TV's are textual whereas closed captioning is an image that need to be OCRed.
Also, good point on the licencing side, I was actually thinking of instructional videos. Has anyone looked into googles new video subtitling features and if that's accessible. (I bet they're already looking into using it for their own audio translations).
-bg
--- (Edited on 2/7/2008 1:41 pm [GMT-0600] by Visitor) ---
hallo there bg,
I wrote the extraction script on python, using mplayer (to get the main dvd-title and sid for english), transcode for the audio, and a bunch of other tools (tccat,subtitle2pgm,pgm2txt,srttool) to extract the subs in srt.
This article conserning dvd-ripping prooved very helpfull:
http://www.bunkus.org/dvdripping4linux/single/
About your second question (if I understood it correctly), you 're right, subtitles on DVD's are all images - what I said in my previous posting is that already ripped DVD's found via bittorent e.t.c have already been corrected (authored) for OCR mistakes by the people who ripped and uploaded them. Others (especially Free Movies) like The Corporation have official (bug-free) subtitles in .srt posted on the net.
I was thinking of contacting the project-team of 'Corporation' , 'cause they can also provide the unmixed speech-audio (without music e.tc.) from their recordings. (btw they need some support in the great work their doing - we should all consider donating - including myself)
My e-mail is : [email protected], if you 'd like to contact me, to exchange ideas, code, collaborate e.t.c.
I' m a gnu, gpl, free as in freedom type of programmer myself - so no need for a lot of formalities.
It's a nice place here at VoxForge, a wiki for listing Free Movies and submit/author datasets should do the trick
FREEDOM OF SPEECH.. RECOGNITION
(how is that for a punch-line ?)
--- (Edited on 2/7/2008 3:44 pm [GMT-0600] by whoneedselta) ---
Hi whoneedselta,
>a wiki for listing Free Movies [...] should do the trick
I can create such a wiki (the cms I use has wiki-like functionaly), but how might it be different than this page: Possible Audio Sources (which I can give you access to update) on the VoxForgeDev site?
>a wiki for [...] and submit/author datasets should do the trick
Not sure what you mean by this ... do you mean a forum to allow uploading of processed movies (i.e. segmented using closed captioning)?
>FREEDOM OF SPEECH.. RECOGNITION
>(how is that for a punch-line ?)
that is an amazing tag line!!! If you don't mind, I'd like to use it on the VoxForge site.
thanks,
Ken
--- (Edited on 2/11/2008 1:23 pm [GMT-0600] by speechsubmission) ---
hallo Ken,
First of all, PLEASE DO use the tag-line, after all, talking in the forums of VoxForge inspired me to write it!
Now about the ways with which free movies' audio data-sets can be hosted at VoxForge, I can only suggest a couple of things: (you have more exprerience with such things than I do)
I was thinking of:
a) a place where we can submit free titles (coupled with the url that they are hosted - Possible Audio Sources is just that - yes)
b) a place where dvd2data_set, avi_srt2data_set scripts are hosted
c) a place where we can submit ripped data-sets for community authoring, that is to say:
1) Fix Transcription Bugs (due to OCR or human-error)
2) Fix Timing Bugs
3) Exclude too-noisy/bad captions (music, whishpears e.t.c)
4) Mark caption as AUTHORED
d) a place where already ripped AND authored data sets are uploaded/hosted (this is the Download area of VoxForge)
I am working on all of the above creating (off-line) scripts and GUIs to automate the steps mentioned, these are all gpl'ed of course (with no rocket science involved, just easy to use eye-candy scripts). Bg, seems to have crafted an extraction tool too.
I' ll be happy to submit these, if you are intrested. And help where I know and can on the related services at VoxForge.
So to sum up: points a), b) and d) are just content uploading to appopriate sections at VoxForge.
Point c) is covered by the offline tools, but an online community-authoring tool would definetely rock !
I am feeling I' ve said a lot already,
thank you for your patience.
-wnlt (Nick)
--- (Edited on 2/11/2008 4:37 pm [GMT-0600] by whoneedselta) ---
--- (Edited on 2/11/2008 4:39 pm [GMT-0600] by whoneedselta) ---
--- (Edited on 2/11/2008 4:39 pm [GMT-0600] by whoneedselta) ---
Sounds great! I just spent a few minutes poking around with Google Scholar and I found some papers on the use of closed captions for acoustic model training. Here are the URLs, if you are interested:
http://www.isca-speech.org/archive/interspeech_2005/i05_1673.html
http://www.isca-speech.org/archive/eurospeech_2003/e03_1837.html
http://www.isca-speech.org/archive/interspeech_2006/i06_1660.html
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1326091
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1325953
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1325954
--- (Edited on 2/11/2008 5:45 pm [GMT-0600] by DavidGelbart) ---