Audio and Prompts Discussions

Flat
karaoke
User: CarlFK
Date: 12/17/2009 12:10 am
Views: 6866
Rating: 14

<!-- @page { margin: 0.79in } P { margin-bottom: 0.08in } -->

I am writing some code for other reasons (recording PyCon, then doing text search text in the video) which I think can be used to collect transcribed speech. This is not at all a replacement for http://voxforge.org/home/read but something I can do as part of the PyCon recording process.

2 parts: 

1. Basic slide show presentation, one line of text per slide.  The speaker reads a line and hits 'next.' Both the video and the audio get recorded.  (This is the karaoke like part, only with a 'next' button)

2. OCR the video, which produces a subtitle file: text with time stamps or frame numbers.

Now you have transcribed audio.

FAQ: (for a sample size of 0)

'Why not record the time of the next clicks?'  The simple answer is: then I would have to write a slide show app that does that.

What about OCR problems? Those will be worked out ahead of time – I can put the slide show on auto pilot, record it, ocr it, and see if text in == text out, which is the unit test for my ocr video. If there are any problems, I'll either fix the font or get the ocr engine devs to help me out. Once it works once, it won't break. We should really call it Digital CR, cuz there is no optics (scanner, camera, etc.) which is where noise comes in.

I can make this part of the sound check, which means 30 tutorials (which have lots of setup time) and 60 talks (which may not have enough time.)

wa-da-ya-think?

 

 

--- (Edited on 12/17/2009 12:10 am [GMT-0600] by CarlFK) ---

Re: karaoke
User: kmaclean
Date: 12/21/2009 3:16 pm
Views: 149
Rating: 16

>I am writing some code for other reasons [...] which I think can be used

>to collect transcribed speech.

[...]

>I can make this part of the sound check, which means 30 tutorials

>(which have lots of setup time) and 60 talks (which may not have

>enough time.)

Very cool... not sure what the background noise will be like in the room (but this is an excellent way to find out...), but sounds like a promising new source audio.

Just to be clear, you are video-recording PyCon talks.  Since each presentation requires a sound check, you display short prompts on the big-screen and get the speaker to read them out into the microphone you are using for the recording, and then collect these prompts for addition to the VoxForge (or any other...)  speech corpus.

What I am not clear on is the use of OCR for transcription - since the presenter is reading the prompts you are generating.

Is your role to record the audio of the presentation and the speaker's slides, and only capture the video feed from the their laptop (as it goes to the overhead screen) - i.e. you are not actually video taping the room with a videorecorder/camcorder?  And thus, the OCR is used to convert the words in the analog display (which were digitally created on the speakers laptop - OpenOffice Impress for example) back to digital to create time-stamps of the presentation, and you are going to use this to transcribe and timestamp the prompts and speech?

Regardless, if this works, it might be a good way to introduce VoxForge speech collection to the masses at other conferences too...

thanks,

Ken

--- (Edited on 12/21/2009 4:16 pm [GMT-0500] by kmaclean) ---

Re: karaoke
User: CarlFK
Date: 12/21/2009 7:46 pm
Views: 136
Rating: 13

<!-- @page { margin: 0.79in } P { margin-bottom: 0.08in } -->

Sounds like you understand the point of the OCR. 


As for the video recording, I have a scan converter that converts the VGA signal into a .dv stream, which gets mixed with another dv stream coming from a video camera and the sound coming from the sound board, saved as .dv onto a laptop.  http://dvswitch.alioth.debian.org/wiki/ the screen shot there should make it all clear.

Question: it isn't .wav audio, but it isn't very compressed either. Is this a problem?

--- (Edited on 12/21/2009 7:46 pm [GMT-0600] by CarlFK) ---

Re: karaoke
User: kmaclean
Date: 12/22/2009 1:17 pm
Views: 133
Rating: 14

>Question: it isn't .wav audio, but it isn't very compressed either. Is this a

>problem?

If it is transcribed, short sentences (15-25 words), we'll take it.  Segmenting longer speech recordings (like audio books) still takes more time than it should.

We'll indicate the file format in the README, and see if it has any impact on the acoustic models.

Ken

--- (Edited on 12/22/2009 2:17 pm [GMT-0500] by kmaclean) ---

Re: karaoke
User: CarlFK
Date: 12/22/2009 2:23 pm
Views: 217
Rating: 13

<!-- @page { margin: 0.79in } P { margin-bottom: 0.08in } A:link { so-language: zxx } -->

I'll setup an ideal environment in my house and do some test runs.  I'll put the results up so we can scrutinize them. It will also make sure the time coded subtitles whatever is in a usable format.


I need to pick some text. I think it will be best if I use the same text for everyone – trying to use all 30 from

http://www.voxforge.org/home/submitspeech/linux/step-1/phoneme

adds steps and potential confusion.



I am also going to try and transcribe their talks – guessing having the sample of their voice will help. Any of the prompt files better than others?

--- (Edited on 12/22/2009 2:23 pm [GMT-0600] by CarlFK) ---

Re: karaoke
User: kmaclean
Date: 1/6/2010 10:25 pm
Views: 2992
Rating: 15

Sorry, missed this question...

> Any of the prompt files better than others?

You should use the "Dialect Coverage Prompts".

Ken

--- (Edited on 1/6/2010 11:25 pm [GMT-0500] by kmaclean) ---

PreviousNext