Click here to register.

Google Summer of Code Ideas Page

Scripts to train acoustic models using audiobooks from Librivox
User: kmaclean
Date: 3/5/2007 4:28 pm
Views: 8483
Rating: 30

Ways to Reuse Speech from Other Open Source/Social Projects

There is more then enough speech on the Internet to create a commercial quality FOSS speech corpus and acoustic models. The problem is that it is a very time-consuming process to convert such speech into a format that can be usable for the creation of acoustic models. Automating our current manual process for segmenting an Audiobook (from LibriVox for example), and applying the same algorithms to other potential sources of speech (audio or video blogs, etc.) would go a long way to improving FOSS speech recognition.

This project is to create a series of scripts to train acoustic models using  audiobooks from Librivox.  

The high level steps are as follows:

1) Get a list of speakers and number of hours spoken by each speaker.

2) Write the scripts to download all the audio and text

3) Write scripts to clean up the text so that it matches the audio.   In the first case this would be removing the Gutenberg preamble and adding the spoken Librivox preamble, and looking at what can be done about chapter headings, etc.

4) Build acoustic and language models using one of the following speech Recognition Engines:

  • HTK/Julius
  • Sphinx
  • ISIP 

5) Use an "automated transcription script" to highlight any problems with the transcriptions, and if so go back to stage 3 and fix them up.

6) Decide on a sensible split of data between train, eval and test.

7) Make three releases.   The first would be the audio and text (in original forms), the second the scripts that performs steps 3-5 above (so that others may improve) and thirdly the acoustic model release.

8) complete Acoustic Model creation scripts for the other Speech Recognition Engines not selected in step 4.


Re: Create script to train up acoustic models on speech audio from Librivox project.
User: Tony Robinson
Date: 3/5/2007 5:09 pm
Views: 247
Rating: 33

Well - who am I to disagree?

One thought I've just had - I've heard of speech recognition companies who have trained up models using HTK and run them under SPHINX (before  HDecode was available).   Surely it can't be that hard, well certainly the means/variances/transition probs should be trivial.  Voxforge would seem to be an excellent place to:

1) develop some GPL code to do the acuostic model conversion (lets assume ARPA format LMs)

2) benchmark the ISIP/HTK/Julius/Sphinx recognisers using exactly the same acoustic models, pronunciations and language models.

Anyone got any results to share?







Re: Create script to train up acoustic models on speech audio from Librivox project.
User: kmaclean
Date: 3/6/2007 9:14 am
Views: 1606
Rating: 39

One important wrinkle to using MP3 audio from Librivox (or even the WAV audio) is that some (not sure how much) of the speech audio submitted to Librivox has been 'processed' - i.e. the audio has been 'cleaned' with noise removal algorithms, audio level normalization, and/or equalization.

Not sure how this might affect a final acoustic model - the rule of thumb has been to use unaltered speech audio as much as possible. 


Re: Create script to train up acoustic models on speech audio from Librivox project.
User: Tony Robinson
Date: 3/6/2007 1:42 pm
Views: 228
Rating: 42

My gut feeling is that is shouldn't put you off using the data.    Okay, you might find that some audio has been distorted, but you can always throw this away later.



Dr Tony Robinson, CEO Cantab Research Ltd
Phone:  +44 845 009 7530, Fax: +44 845 009 7532

Re: Create script to train up acoustic models on speech audio from Librivox project.
User: kmaclean
Date: 3/9/2007 9:23 am
Views: 522
Rating: 35

Hi Tony,

I created a quick 'sanity test' to compare Acoustic Models trained with wav audio versus mp3 audio.  Basically I took the larger wav directories in the VoxForge corpus, and converted them to MP3 and then converted them back.  I then compared Acoustic Models ("AM") created with the original wav data, to AM trained with converted mp3 data to get an idea of any performance differences.

The tests with Julius performed as expected, with a bit of a degradation of performance by using mp3-based Acoustic Models. 

The tests with HTK are a bit more confusing, since these show some improvement in performance when using AMs based on mp3 audio.  

Basically I need to use a larger test sample with a more complex grammar to get a better test.  But the use of MP3 audio for AM training looks promising.


MP3s for the training of acoustic models
User: ralfherzog
Date: 3/4/2008 1:56 pm
Views: 2523
Rating: 29
Hello Ken,

I think that Tony is right.  You shouldn't hesitate to use MP3s because this is a very popular format.  

I have just made a very quick "sanity test" with NaturallySpeaking 9.5.  I recorded a sentence with Audacity, and exported it into the wav and into the MP3 format. Then I let NaturallySpeaking transcribe both recordings.  And both recordings (wav/MP3) were recognized one hundred percent correctly.

And obviously, it could be possible that there are improvements when you are using MP3s.  At least, MP3s should be sufficient.  Millions of people worldwide are using the MP3 format.  If MP3 is OK for music it should be a maiore ad minus also good for speech.

Greetings, Ralf

PS: Damn, I just see that this discussion is about one year old. Frown