Audio and Prompts Discussions

Nested
Speech recognition on MPEG/Audio encoded files
User: kmaclean
Date: 7/17/2007 7:15 pm
Views: 3656
Rating: 67

The approach VoxForge has taken in processing LibriVox audiobooks is to ask LibriVox users to submit their wav files to VoxForge before they compress them to mp3 format (see the uploads page). 

We've also done some tests to convert mp3 speech files to wav format and training acoustic models from the wav files, and the results look promising (see the Convert Audio to MP3 and Compare Results with Original Wav link).

I recently found a patent that trains acoustic models using mp3 audio directly (i.e. there is no requirement for conversion to an intermediate wav file before training acoustic models from the mp3 audio).  They showed a novel(?) way of indexing videos by training acoustic models be using mp3 audio track on a video (not sure how they filter out music or other non-speech noise...). They used the HTK toolkit for this approach.  Here is the abstract of the patent:

United States Patent 6370504
Link to this page:
Abstract:
A technique to perform speech recognition directly from audio files compressed using the MPEG/Audio coding standard. The technique works in the compressed domain and does not require the MPEG/Audio file to be decompressed. Only the encoded subband signals are extracted and processed for training and recognition. The underlying speech recognition engine is based on the Hidden Markov model. The technique is applicable to layers I and II of MPEG/Audio and training under one layer can be used to recognize the other.

What is interesting is that they provide speech recognition result comparisons between acoustic models trained from the original speech wav files (for a single speaker):

                             TABLE 2
Results Using Raw PCM for Both Training and Recognition
Accuracy Words Deletion Substitution Insertion
99.65 2,018 5 0 2

and recognition results using acoustic models trained using MP3 audio (for a single speaker):

                             TABLE 3
Training At One Bit Rate & Recognition At Other Bit Rates (Layer I)
Training Recognition at kbit/s
kbits/s 32 64 96 128 160 192
32 90.68% 50.35% 47.77% 47.87% 47.92% 47.87%
96 9.51% 97.87% 97.87% 97.92% 97.97% 97.92%
192 9.51% 97.97% 97.92% 98.02% 97.97% 98.02%

This would seem to confirm that our approach of converting mp3 audio to wav files is suitable for training acoustic models.

Ken 

--- (Edited on 7/17/2007 8:15 pm [GMT-0400] by kmaclean) ---

PreviousNext