General Discussion

Google Summer of Code
User: kmaclean
Date: 2/16/2007 1:14 pm
Views: 25770
Rating: 39

What is Google Summer of Code?

Google Summer of CodeTM is a program that offers student developers stipends to write code for various open source projects. Google will be working with a several open source, free software and technology-related groups to identify and fund several projects over a three month period. Historically, the program has brought together over 1,000 students with over 100 open source projects, to create hundreds of thousands of lines of code. The program, which kicked off in 2005, is now in its third year, following on from a very successful 2006.

VoxForge Summer of Code Application.

VoxForge will apply to the Google Summer of Code project.  The Google accepts applications via the GSoC web app between March 5-12, 2007.

Possible Project Ideas:

Project ideas that would contribute directly to the project: 

  1. Script that would take audio book recordings and automatically parse the audio into 10-15 second files, with associated transcriptions for use in the training of Acoustic Models.  Could use uncompressed (e.g WAV), or lossless compressed (e.g. FLAC), audio from Librivox of  Gutenberg Audio Books project, or Spoken Wikipedia).

  2. Create  Sphinx and ISIP Acoustic Model training scripts and tutorials for incorporation onto VoxForge site.

  3. Test the use of MP3, or Ogg Vorbis, audio (or some other lossy compressed audio) from podcasts or on-line books as a way to augment the VoxForge Speech Corpus for the creation of Acoustic Models.  Could use audio from Librivox of  Gutenberg Audio Books project, or Spoken Wikipedia.

Project ideas that would create demand for VoxForge Speech Corpora and/or Acoustic Models:

  1. Create a GPL Written Corpus that can be used for the creation of bigram and trigram Language Models for Dictation Applications.  Sphinx, Julius and HTK all use ARPA format language models.  Might be able to use the SRILM (SRI Language Modeling Toolkit), the CMU-Cambridge Statistical Language Modeling Toolkit, or the Language Modeling tools included in HTK.  
  2. Create a Command and Control Dialog Manager with APIs for Sphinx, Julius and ISIP Speech Recognition Engines and the  Festival Text to Speech System.  Might be able to re-use code from the xVoice project.

Julius Specific Project ideas (Julius is a Speech Recognition Engine designed for Dictation applications):
  1. Create an Open Source Acoustic and Language Model Training Toolkit that can create HTK format models for use with the Julius Speech Recognition Engine.  The Julius uses the HTK Toolkit to create its models.  Although you can download the source for HTK, it has distribution restrictions.
  2. Create Dictation Dialog Manager for Julius on Linux (and APIs for Sphinx/ISIP).  Might be able to re-use code from the xVoice project.

Any other possible projects? Any comments?



--- (Edited on 2/18/2007 11:48 pm [GMT-0500] by kmaclean) ---

Re: Google Summer of Code
User: DooGood
Date: 2/23/2007 11:37 am
Views: 839
Rating: 39
I like the idea of parsing audiobooks.  It seems like a great way to get a lot of samples of clearly spoken audio and the corresponding text.  Good luck.

--- (Edited on 2/23/2007 11:37 am [GMT-0600] by DooGood) ---

Re: Google Summer of Code
User: Tony Robinson
Date: 2/26/2007 11:15 am
Views: 484
Rating: 39

I'd combine 1 and 2 of the "Project ideas that would contribute directly" into one - that of a script to train up acoustic models on  Librivox or Gutenberg Audio Books project.   I looked at Spoken Wikipedia and it seems like there would be much less audio there, there is an issue tracking down the revision that was spoken, the Wikipedia entry isn't as easy to parse into text and there is likely to be more inconsistancy between the text and the spoken audio.

Correct me if I''m wrong, but I don't think the project Gutenberg texts allow redistribution of modified versions, so it would have to be a single project that downloaded and trained.

To know whether this is worth doing as an academic exercise to generate a good speech recognition system I'd like to know the total number of hours spoken by each speaker in the two audio sources.   Assuming enough varied audio then this would be a useful speech corpus as it is very clean, so allowing research into new ASR techniques, etc.

The same numbers would be needed to know whether it is worth doing from the point of view of training standard HMM models, however a larger number of speakers is going to be needed to cover a general population.

From looking at the audio links page on this site, these two sources would seem to be the best way to get a lot of data and good models for this "story telling" style. 

Librivox would seem to have about 400 completed items, if only 1/4 were complete books at about 6 hours each, that's 600 hours of audio, which is very respectable.   The big question is how many readers are there?   If there were 100, I'd say that would meet the voxforge goals for this style of speech, if 10 or 1 then is seems like a useful academic corpora.

Anyone know how many speakers there are (and what the entropy)?




--- (Edited on 2/26/2007 11:15 am [GMT-0600] by Tony Robinson) ---

Re: Google Summer of Code
User: kmaclean
Date: 2/27/2007 10:34 pm
Views: 525
Rating: 42

Hi Tony,

Thanks for the recommendation of combining suggested projects 1 & 2 - makes sense.

One major issue that I neglected to mention for all these sources of audio is that most use lossy compressed audio formats (e.g. MP3, Ogg Vorbis, etc.). The plan would be to contact these sites and ask their readers to submit their audio and transcriptions in lossless uncompressed format.  I've just set up a new page on VoxForge to handle this, called Upload - it allows people to submit large audio files to the VoxForge FTP site using their own FTP client.

This approach of collecting uncompressed audio might address the issue you mentioned of getting the proper version of the Wiki text for a particular Wikipedia's Spoken Audio recording.  Since we ask users to submit the text of their recordings. 

With respect to you concerns about the formatting of Wiki text, I agree. Wikipedia can have some unique formatting issues that might pose problems.  It might be a more work than it is worth to ensure the text matches the audio, especially if there are better sources. 

Having said this, an automated transcription script (could be another Google SoC project ...) using the questionable text to be verified as a starting point might be a workable solution.  A test case would be needed to determine if this is possible.

For the Gutenberg Audio Project, I think licensing depends on the source of the speech audio. The Gutenberg Audio page gets its books from the following sources: 

  • – donated some audio texts to Gutenberg.  These recordings are copyright protected, so we can't use them because this is incompatible with GPL. Their approach is to provide free, low quality audio, and charge on a sliding scale for improved quality recordings (going from $3 to $7USD).  All audio uses MP3 based compression.

  • – seems to be using a Creative Commons License (with Attribution, Noncommercial and No Derivative Works conditions). This is incompatible with GPL. What is interesting about these submissions is that they provide 48kHz FLAC (lossless compressed) recordings, in addition to MP3.

  • Librivox Gutenberg also provides audio books collected Librivox, so Gutenberg Audio might only be useful as an alternate repository for Librivox audio.

Librivox Audio books are released into the Public Domain. Their readers use 'out of copyright' books from Gutenberg. As far as I can tell from this link, there are 402 LibriVox readers.  Not sure how many of these are English.  I could probably write a script to figure out the ratio of readers to books and/or the total number of hours spoken by each speaker (I think this is what you are asking about when you ask about 'entropy'), if it becomes necessary, since all the required data is on their web site - though it is spread around a bit.

In summary, I think the limiting factor in this is how much uncompressed audio (or lossless compressed) we can get from Librivox.  Librivox recommends readers use the Audacity audio editor (same as VoxForge).  Assuming 10% of Librivox readers would have kept their audio, and are willing to submit it to VoxForge (given the long upload times involved), based on your numbers we might get an additional 60 hours of transcribed speech.  Not much, considering the total.  However, if we can get VoxForge submission procedures documented on the Librivox site, then I think we might get a higher percentage of the new Librivox submissions.  



--- (Edited on 2/27/2007 11:34 pm [GMT-0500] by kmaclean) ---

Re: Google Summer of Code
User: Tony Robinson
Date: 2/28/2007 5:35 am
Views: 490
Rating: 39

Hi Ken,

I think it would be worth discussing the need to work with uncompressed audio.

Firstly. to agree with you, the use of compression is an obvious source of noise, and all obvious sources of noise should be be eliminated as far as is practical.

However, I would like to question how much noise128kbps MP3 adds to audio recording?   I've listened to three speakers and looked at one with wavesurfer - the speech sounds and looks clear to me.  In one case, the background noise, whilst not exessive, was certainly more noticable than the MP3 artifacts.

Ideally the way to test this out would be to train and test on the uncompressed audio, train and test on the compressed, and compare error rates.   However, this implies that most of the project is already done.

Another way to look at this is to consider the degree of mismatch between the source/train environment and the target environment.    If the aim is freely redistributable acoustic models, then the target environment is very varied, and it could be that the coding noise is not significant compared with this mismatch.

Of course, if Libravox speakers will upload the original audio to you then that is preferable, however from a project management point of view I'd hate to be dependent on 400 people who have volunteered for a different cause.

On other issues:

Focussing on Librivox would seem sensible if there is a sufficient variety of speakers, which you seem to think is the case.

My point on copyright was targetted at Project Gutenberg texts, however rereading the T+C's it allows the insertion of markup to add timings to the text.   Nevertheless, the whole text (plus markup) must be distributed, not just a lot of 15 second chunks.

You may well find that an "automated transcription script" is very helpful in this project, many ASR sites use such tools to check audio transcriptions or to train on partially transcribed audio.

To summarise, if I were doing this for Cantab Reasearch I would focus only on 128kbps audio from Libravox and:

1) Get a list of speakers and number of hours spoken by each speaker.

2) Write the scripts to download all the audio and text

3) Write scipts to clean up the text so that it matches the audio.   In the first case this would be removing the Gutenberg preamble and adding the spoken Libravox preamble, and looking at what can be done about chapter headings, etc.

4) Build acoustic and language models

5) Use an "automated transcription script" to highlight any problems with the transcriptions, and if so go back to stage 3 and fix them up.

6) Decide on a sensible split of data between train, eval and test.

7) Make three releases.   The first would be the audio and text (in original forms), the second the scripts that performs steps 3-5 above (so that others may improve) and thirdly the acoustic model release.

Hope that helps,



[ speaking for Cantab Research only ]


--- (Edited on 2/28/2007 5:35 am [GMT-0600] by Tony Robinson) ---

--- (Edited on 2/28/2007 5:42 am [GMT-0600] by Tony Robinson) ---

Re: Google Summer of Code
User: kmaclean
Date: 3/1/2007 8:51 pm
Views: 398
Rating: 39

Hi Tony,

Thank you very much for the feedback, it's greatly appreciated.

mp3 audio 

With respect to the use of lossy compressed audio, such as 128kbps mp3, I always assumed that we could not use such audio for the training of Acoustic Models.  It was a rule of thumb I used without really questioning why.  Thanks for pointing out that mp3 could be 'good enough' for our purposes. 

Brough Turner in a post he made on his blog called Large Speech Corpora also discussed the use of podcasting audio (which is usually lossy compressed like mp3) as a possible source of audio data for creating Acoustic Models.  I commented (incorrectly) that the use of lossy compressed audio (such as MP3 or Ogg recordings) was not a good source of audio for training AMs.  However, he rightly noted that mp3 audio, although lossy, is probably better quality speech than what you would find in telephony audio (8kHz sampling rate at 8-bits per sample).  I never really thought that comment through ... Basically, the same logic would apply to audio used for Acoustic Models (which usually use 16kHz sample rates, at 16-bits per sample) for use in Desktop Computer Speech Recognition, which is your point.  I think is is worth exploring, and worth submitting as a Google Summer of Code project.  Hopefully it will get accepted, and we can find a student interested in helping out.

Librivox Project Management Issues

I also agree with your point that it would be difficult to get users to submit audio for books they have already completed (10% is optimistic).  I still plan to follow-up with Librivox on the possibility of their asking users to submit audio to VoxForge (we'll see how receptive they are...) as they create new audio books - that way we get the 'best' audio.  In addition the scripts created for processing mp3 audio could be reused for this purpose, so there should be little added burden to collect this audio.

Gutenberg and Librivox Copyright 

Librivox audio is public domain (they use a Creative Commons Public Domain Dedication).  They use ebooks obtained from Project Gutenberg.  Many Project Gutenberg ebooks are also public domain (not all).  To make sure that they only release audio readings of public domain texts, Librivox relies on Project Gutenberg's legal work to assure Copyright status of their books.  

The distribution restrictions (i.e. only being able to distribute the whole text, with markups) you are referring to are with respect to the distribution of an etext using the Project Gutenberg Trademark name, not with respect to the distribution of the eBook itself.   Basically if you want to re-distribute Project Gutenberg e-text with their Trademarked name, you must follow some licensing provisions that include a requirement that the text not be broken up in any way.  If we don't use the Project Gutenberg Name, and delete any reference to it in the text, we can distribute the text in any way we see fit.

For example, for the Herman Melville book 'Typee', the Librivox audio is public domain (it uses the Creative Commons Dedication).  The text of the Gutenberg Typee ebook has the following "license".  It says the following in the intro:

tm etexts, is a "public domain" work distributed by Professor
Michael S. Hart through the Project Gutenberg Association at
Carnegie-Mellon University (the "Project"). Among other
things, this means that no one owns a United States copyright
on or for this work, so the Project (and you!) can copy and
distribute it in the United States without permission and
without paying copyright royalties.

Then it goes on to say: 

Special rules, set forth
below, apply if you wish to copy and distribute this etext
under the Project's "PROJECT GUTENBERG" trademark.

So basically it says that no one owns copyright on the written text of this book in the US (and likely most other jurisdictions), and you can copy and distribute as you please.  But, if you want to copy and distribute the book along with references to the Gutenberg TradeMark, then you need to follow some special rules. 

Further on in the document it says:


You may distribute copies of this etext electronically, or by
disk, book or any other medium if you either delete this
"Small Print!" and all other references to Project Gutenberg,

This clarifies what you need to do if yo want to distribute the ebook without any restrictions - basically you delete the 'license' and the Gutenberg trademarks.

It then goes on to elaborate the conditions you must follow if you do want to distribute the text with the Gutenberg trademarks:  

[1] Only give exact copies of it ...
[2] Honor the etext refund and replacement provisions of this
"Small Print!" statement.
[3] Pay a trademark license fee to the Project of 20% of the
net profits ...

Number [1] is the provision I think you were referring to.  

So I think we are OK in this particular example.  However, I'm sure there will be new wrinkles that will need to be addressed as we go along.

Thanks again, your points have been very helpful,


I am not a lawyer, and this is not a legal opinion ... 


--- (Edited on 3/ 1/2007 10:06 pm [GMT-0500] by kmaclean) ---

--- (Edited on 3/ 1/2007 11:07 pm [GMT-0500] by kmaclean) ---

Re: Google Summer of Code
User: Tony Robinson
Date: 3/5/2007 5:08 am
Views: 474
Rating: 41

Thanks Ken, I'm happy about the IP/copyright position(s) now. 

Also, from your news it's now possible to evaluate the degradation of 128kbps MP3 over clean speech.  All we need to do is push some or all of the uncompressed and MP3 audio through an existing recogniser and score both results.   It doesn't matter too much if the acoustic and language models used aren't very well adapted to the task, or if the audio doesn't quite match the transcription, as the performance degradation will be pretty much the same for both sets of audio.   If we see a small difference in the scores then chances are the MP3 data is fine, if there is a large difference then chances are that the MP3 process has introduced noise/channel distortion that will degrade acoustic models trained on MP3 data.

 So, my question to everyone out there reading this, does anyone have a working LVCSR system that they can point at the two sets of audio and provide two lots of recognition results?




--- (Edited on 3/ 5/2007 5:08 am [GMT-0600] by Tony Robinson) ---

Re: Google Summer of Code
User: kmaclean
Date: 3/6/2007 9:08 am
Views: 421
Rating: 31

Hi Tony,

Could you clarify what you mean by: "LVCSR system that they can point at the two sets of audio and provide two lots of recognition results".   Do you mean we should be testing mp3 based Acoustic Models using a speech recognition that uses a language model (i.e. dictation based speech recognition) as opposed to a grammar file (i.e. command and control based speech recognition)?

I would have thought that creating two Acoustic Models (one generated with wav data and another generates with MP3 audio converted to wav) and running them separately through a grammar based speech recognition engine, and gathering the recognition results, would give us the answer we are looking for - i.e. whether MP3 audio, converted to Wav, adds or detracts to speech recognition performance.  Or would these results not be conclusive enough because of limitations inherent in grammar based speech recognition?




--- (Edited on 3/ 6/2007 10:08 am [GMT-0500] by kmaclean) ---

Re: Google Summer of Code
User: Tony Robinson
Date: 3/6/2007 1:38 pm
Views: 546
Rating: 37

Hi Ken,

My background and interest is in speaker independent large vocabulary recognition, so I sometimes only think in these terms - let me start again. 

If we (or anyone else reading this) had a speaker independent large vocabulary system to hand, then they could load in the MP3 audio and get some words out and then load in the uncompressed audio and get (almost the same) words out (this is a quick check, far less work than the total  project, with the aim to make sure the project runs smoothly).  Now this system will have been trained on different acoustics and will have a different language model than the domain of the book in question, but my point was that this would not matter too much.   All we care about is that the MP3 audio produces about the same recognition rate as the uncompressed audio.   If the difference in the error rates is not statistically significant then we can say that the noise/distortion introduced by the MP3 process is not significant with respect to all the other sources of noise/distortion/mismatch in the system.   It doesn't guarantee that with a better trained recognition system the MP3 distortion will show up, but it is a good indicator.

A preexisting system would just be an executable that read in acoustic and language models and then read in the audio and spat out the words.   Speech recognition companies will have these but may not want to disclose the error rates, university research groups will have these but perhaps won't be packaged up enough to process long chunks of audio without writing a lot of special code.    My appeal was for anyone who has a system already to run it and post the numbers.

If we could have these numbers before the SoC proposal deadline then I think it strengthens the proposal massively (assuming a positive result).

Hope that made more sense this time,




Dr Tony Robinson, CEO Cantab Research Ltd
Phone:  +44 845 009 7530, Fax: +44 845 009 7532

--- (Edited on 6-March-2007 7:38 pm [GMT+0000] by Tony Robinson) ---

Re: Google Summer of Code
User: kmaclean
Date: 3/11/2007 10:33 am
Views: 498
Rating: 39

Hi Tony,

I created a quick 'sanity test' to compare Acoustic Models trained with wav audio versus mp3 audio.  Basically I took the larger wav directories in the VoxForge corpus, and converted them to MP3 and then converted them back.  I then compared Acoustic Models ("AM") created with the original wav data, to AMs trained with converted mp3 data to get an idea of any performance differences.

The tests with Julius performed as expected, with a bit of a degradation of performance by using mp3-based Acoustic Models. 

The tests with HTK are a bit more confusing, since these show some improvement in performance when using AMs based on mp3 audio.  

Basically I need to use a larger test sample with a more complex grammar to get a better test.  Regardless, the use of MP3 audio for AM training looks promising.


--- (Edited on 3/11/2007 11:33 am [GMT-0400] by kmaclean) ---