Audio and Prompts Discussions

Flat
Re: Missing prompts
User: tpavelka
Date: 3/9/2009 7:54 am
Views: 89
Rating: 10

> Would it not be possible to train a speech recognition engine to

> recongise gender, accent and quailty of the speech, rather than

> what it was saying?

For this to work you would first need anotated training data which we do not have here. I do not think anyone is up to listening to the 58 hours of the corpus and classifiing each recording into cathegories. 

> It should be possible to cluster the recordings around different

> accents, qualities etc. and then generate different trained

> engines for each one.

I guess the first step would be to train a recognizer with clean and well pronounced recordings, the problem is, how to find them. Such system should (in theory) give better results on data with similar quality but may perform worse on noisy data. There is always a tradeoff, but if we could separate the data into cathegories this tradeoff could be quantified.

> Another thought I had was inspired by the RANSAC algorithm

Never heard of it, but it sounds interesting. Given the size of the corpus, this would take ages, but it might be a way to avoid manual classification.

--- (Edited on 3/9/2009 7:54 am [GMT-0500] by tpavelka) ---

Re: Missing prompts
User: nsh
Date: 3/9/2009 9:03 pm
Views: 271
Rating: 8

> Can I spam here? ;-) If you would like to visit Pilsen, we are organizing a conference in September,

www.tsdconference.org


Funny enough. We did submit a paper to exactly this conference last year and it was rejected mostly due to my bad writing skills and little material :) That's why I argue someone should help us to promote the corpus.

> My experience is that the acoustic score coming out of the Viterbi algorithm is pretty much useless (unless you have a really big mismatch between transcription and the actual utterance). Results from phoneme only recognizer are a bit better, but (as I have shown in the experiment) not by much.

Yes, I proven to be wrong here. Though I often used alignment to find bad transcriptions but I now see it's not the best way.

>  Another thought I had was inspired by the RANSAC algorithm

Thanks to rjmunro for the idea. For me it sounds great not to cleanup the data with some prior knowledge but use generic methods to train good model with prior knowledge that we have garbage. This problem is similar to the generic problems other machine learners have. For example web-collected data has garbage by definition and Google and others assume this as a precondition and use algoritm that are robust to noise. I quickly searched for the articles with such methods in acoustic training but only found links on cleanup strategy for Bayesian classifiers and so on. There is probably a sense to search more, it should be a common problem.

 

--- (Edited on 3/9/2009 9:03 pm [GMT-0500] by nsh) ---

Re: Missing prompts
User: kmaclean
Date: 3/18/2009 12:56 pm
Views: 2590
Rating: 8

I think the search term you are looking for is: "Lightly Supervised Acoustic Model Training".  There is a paper by the same name by Lori Lamel et. al.  which describes the process as follows:

The basic idea is to use a speech recognizer to automatically transcribe unannotated data, thus generating labelled training data.  By iteratively increasing the amount of training data, more accurate acoustic models are obtained, when can then be used to transcribe another set of unannotated data.


 

--- (Edited on 3/18/2009 1:56 pm [GMT-0400] by kmaclean) ---

PreviousNext