Italian acoustic model collaboration

Italian

Flat

User: kmaclean
Date: 11/8/2007 12:17 pm

Views: 23756
Rating: 47

Email from Gianluca Cacace:

Hi VoxForge TEAM,

I'm italian and I want to collaborate to this project.
I'm avaible to "donate" my italian voice to make an Italian Indipendent Acoustic Model for speak recognition. How I've to start? I see only english sentences into "Read" tab, on the website menu.

Regards,

Gianluca Cacace

Re: Italian acoustic model collaboration

User: kmaclean
Date: 11/8/2007 12:17 pm

Views: 495
Rating: 49

Hi Gianluca,

Thanks for your interest in Open Source Speech Recognition!

The best way to start is to try to find some out of Copyright texts, segment them into 10-15 word "sentences", and record one speech audio file for each "sentence", and upload it to VoxForge.

Step 1

Find some Italian public domain texts.

http://www.liberliber.it/home/index.php might be a good source for Italian public domain texts. You can also create your own prompts.

Step 2

Segment the texts you have chosen into 10-15 word "sentences". Put these into one text file, one "sentence" per line. The first column must be the name of the audio file containing the speech (without the ".wav" suffix). For example, a prompt file might contain the following entries:

en16-01 He has to break it into several parts.
en16-02 They have to start in the same way.
en16-03 He is interested in exponentials.

en16-01, en16-02 and en16-03 correspond to three wav files (i.e. en16-01.wav, en16-02.wav and en16-03.wav) containing speech. For example, the en16-01.wav audio file will contain speech corresponding to the prompt line: "He has to break it into several parts".

It does not matter what you call your wav files, as long as they the first word of a prompt line corresponds to an actual audio file.

Step 3

Record your audio using Audacity (see Step 2 - Record your Speech with Audacity for details)

Step 4

Next, create your license and readme file, and package everything up into a zip or tar.gz file (see Step 3 - Upload your Speech Files to VoxForge for details), and upload it to the Italian Forum on VoxForge.

Once you get enough speech, we can start looking at creating an Italian pronunciation dictionary.

I'd like to post this thread on a VoxForge Forum, please let me know if this is OK.

Thanks,

Ken

Re: Italian acoustic model collaboration

User: kmaclean
Date: 11/8/2007 12:18 pm

Views: 424
Rating: 48

If you need, I can translate VoxForge website in italian, so many italians can partecipate on this project. Anyway you can post this thread in the forum, it's OK!!! :D

Gianluca Cacace

Re: Italian acoustic model collaboration

User: blue
Date: 11/11/2007 9:28 am

Views: 484
Rating: 60

Is it possible to translate and use java applet for Italian (or other languages)? On liber liber there are also free audiobook with text translation. Is there a way to use automatically this matirial for voxforge?

Cheers,

Ste

Re: Italian acoustic model collaboration

User: kmaclean
Date: 11/11/2007 8:57 pm

Views: 2842
Rating: 36

Hi Ste

>Is it possible to translate and use java applet for Italian (or other languages)?

Yes. We are currently working on translating the VoxForge Speech Submission app to Dutch (and modifying the code so that it is easier to add other languages). Dutch was a good candidate language for this because we've received a number of submissions in Dutch.

>On liber liber there are also free audiobook with text translation. Is there a way to use automatically this matirial for voxforge?

In order to use an audio book for training acoustic models (for HTK/Julius and Sphinx at least ...), the audio needs to be segmented into 10-15 word files and a text file needs to be created that contains the transcriptions of the contents of each of these files.
This process can be "semi-automated", and is described in this *draft* document:
Automated Audio Segmentation Using Forced Alignment (Draft)
It is "semi-automated" because the pronunciations for out-of-vocabulary words needs to be manually reviewed and corrected.
In addition, the actual segmentation of audio requires a reasonably good acoustic model to pick out the silences where the text might be segmented. Therefore if you don't have an Italian acoustic model, you will need to create your own. This will involve hand-segmenting the audio at first so you can create you own preliminary Italian acoustic model. But as you get more audio trained in your acoustic model, this part will become more accurate, and after a point, you will no longer need to hand-segment any speech files. This is not as much a concern for English, since we have an acoustic model that is good enough to perform accurate silence detection.

Ken

Re: Italian acoustic model collaboration

User: DavidGelbart
Date: 1/11/2008 8:09 pm

Views: 461
Rating: 57

"In addition, the actual segmentation of audio requires a reasonably good acoustic model to pick out the silences where the text might be segmented. Therefore if you don't have an Italian acoustic model, you will need to create your own. This will involve hand-segmenting the audio at first so you can create you own preliminary Italian acoustic model. But as you get more audio trained in your acoustic model, this part will become more accurate, and after a point, you will no longer need to hand-segment any speech files. This is not as much a concern for English, since we have an acoustic model that is good enough to perform accurate silence detection."

I wonder how well the English acoustic model would work for doing silence detection on Italian data. I would not be surprised if it turns out to perform well.

Re: Italian acoustic model collaboration

User: kmaclean
Date: 1/14/2008 9:55 pm

Views: 507
Rating: 44

Hi David,

>I wonder how well the English acoustic model would work for doing silence

>detection on Italian data. I would not be surprised if it turns out to perform well.

I must be missing something here ... please clarify how you might use an English acoustic model to detect silence for another language like Italian.

I've been using HTK's Forced Alignment for silence detection. My understanding of Forced Alignment is that it takes a known string of words (i.e. "my cat is black") and tries to match it to the corresponding section of speech, and in doing so provides you with (estimated) time stamps for the start and end of each of words. If there is a long pause between two words, then you can write a script to flag this as a "silence".

My understanding of the process is that you would therefore need an acoustic model for the target language, because you are essentially doing speech recognition for a sentence that you know is located somewhere in a segment of speech audio. For example, using an English Acoustic Model to attempt to detect a silence in an Italian text (i.e. "Il mio gatto è nero") would not work, because the phones making up the words are different in English and Italian (at least using the approach that I have been using).

I seems to me, and I may be wrong (and usually am ...), that you need an acoustic model for the target language for the process to work.

thanks,

Ken

Re: Italian acoustic model collaboration

User: DavidGelbart
Date: 1/23/2008 7:29 pm

Views: 466
Rating: 49

Hi Ken,

You could try using your existing English acoustic model for an Italian forced alignment. You could create a dictionary for that by mapping each Italian phone in your Italian dictionary to the closest English phone. I suppose this will perform well enough for segmentation.

You could also try a phone-loop approach using your existing English acoustic model. Just now, I did a literature search to find a precise definition of 'phone-loop', and I found 'each phone can follow the previous one with equal probability'. So I think that means running unconstrained recognition with the recognizer dictionary containing only a word for each phone which is defined simply as that phone. That includes a silence 'word' which is defined as the silence 'phone', unless your decoder creates such a silence word implicitly so it doesn't need to be in the dictionary or the language model. The language model would allow the dictionary words in any order. In this case, speech in Italian (or any other language) would get mapped to the closest-sounding phone sequence in the English phoneset. You would not get good word accuracy this way (since there is no real dictionary and no real language model), but I hope performance would be good enough to use for segmentation. The advantage of a phone-loop approach is that it could allow people to use the same models, scripts, and dictionary to do the initial segmentation work for all languages.

By the way, is there some kind of speech detection tool that comes with Julius? If it's language-independent, maybe that could be used for the Italian segmentation work. Speech detection (also known as voice activity detection, or VAD) can often be done successfully on non-noisy speech using some simple, language independent calculations such as energy level.

Regards,
David

Re: Italian acoustic model collaboration

User: DavidGelbart
Date: 1/24/2008 12:03 pm

Views: 6192
Rating: 48

I think I missed a key point when I posted yesterday.

You need to segment the transcript the same way that you segment the audio. I suppose this is much easier to automate if you use forced alignment.

Regards,
David

Previous • Next •


Username	Password