Click here to register.

How to Manually Segment an Audio Book (Draft)

In order to use the audio and text from an Audio Book in the creation of the  VoxForge Acoustic Model, it must be segmented into 5-10 second audio files.   This involves using a silence detection tool to determine the location of pauses in the Audio Book, and segmenting the large audio file into a series of smaller files based on theses pauses.  Next the corresponding eText must also be segmented using a Perl script that can determine the location of sentences in a text.

Once we have a few hundred hours of speech audio, we will be able to use the VoxForge Acoustic Model to perform 'Forced Alignment' to automatically segment the speech audio and text into files.

Required Tools:

Step 1: Confirm Mono Audio

Check to see if the speech audio file is in Mono or Stereo (i.e. recorded with one channel or two).  You can use Audacity to do this - just import the wav file into Audacity.  If there are two tracks shown (one for each channel), then the audio is stereo.  If there is only one track, then it is mono.

If the audio file is in Stereo (i.e. recorded with two channels) use SoX to convert it to Mono (i.e. convert it to one channel) as follows:

    $sox stereo_file.wav -c 1 mono_file.wav

where the parameters are as follows:

  • -c  corresponds to the number of channels

Step 2: Segment the Audio Using Silence Detection

Go to the directory where the speech audio file you want to segment is located.  Create a sub-directory called 'wav'.

Use the Julius adintool to segment the audio file as follows:

    $ adintool -in file -out file -filename wav/segment -startid 1000 -freq 44100 -lv 1000 

The following will appear in your console:

----
Input-Source: Wave File (filename from stdin)
Segmentation: on, continuous
  SampleRate: 44100 Hz
       Level: 1000 / 32767
   ZeroCross: 60 per sec.
  HeadMargin: 400 msec.
  TailMargin: 400 msec.
  ZeroFrames: drop
   remove DC: off
Recording: segment.1000.wav, segment.1001.wav, ...
----
[start recording]
enter filename->

Enter the name of the audio file to be segmented next to "enter filename->".

The parameters are as follows:

  • -lv  is the Level threshold (default: 2000).  If the audio input amplitude goes over this threshold for  a  period,  this triggers a the begin of speech segment.  If the level goes below this level after triggering, it is the end  of  the  speech  segment.  

  • -zc zerocrossnum (default=60).  Fewer crossings of in a second would signify a pause in speech.

  • -headmargin msec (default: 400).  Header  margin  of  each  speech  segment  (unit:  milliseconds)

  • -tailmargin msec (default: 400). Tail   margin   of  each  speech  segment  (unit:  milliseconds)           

All audio is recorded differently.  As a result, you will have to adjust the  '-lv', '-zc' and -headmargin and -tailmargin parameters until get a set of files that are less then 10 seconds in duration.

Step 3: Segment the eText into Sentences

First, download and install the 'Lingua::EN::Sentence' module from CPAN.

Next, right click and save the eText2Prompts.pl script (change the suffix from "_pl.txt" to ".pl"), and execute it as follows:

    $perl ./eText2prompts.pl eText prompts 

This will create a file ("prompts") that contains all the sentences in the eText, separated by a line feed.

Step 4: Match the eText Sentences to the Generated Speech Audio Segments

Next, review each wav file generated in Step 2, and label the text in the prompts file with the name of the audio file it corresponds to.  You may have to break up sentences into separate prompt lines because the audio segmentation in Step 2 might have broken up a sentence into 2 or more files (depending on the number of pauses it detected in the audio file).

Step 5 - Submit your segmented files to VoxForge

Create Readme file

Create a README file that describes your submission.  Right-click this link and save the file to your upload folder.  Modify the entries where appropriate:

  • Each line in the readme has a question with some possible answers within brackets.  Please replace the suggestions between the brackets with your answer . 
  • Take your best guess as to the original author's dialect (follow this link for help on this).  If you are not sure, just put in Librivox.

Create License file 

Next, create a LICENSE file for your submission.  Right-click this link and save the file to your upload folder.   Change the year to the current year, and the 'name of author' to your name (or to the 'Free Software Foundation' - if you wish to assign your copyright to the FSF). 

Although the audio book you segmented is likely in the public domain, you have copyright over the way the audio was segmented (because you have rearranged this audio in a unique way) and therefore you can license the segmented audio under GPL. 

Tar your files.

Please create a single compressed tar file containing the following files:
  • your segmented wav files;
  • the corresponding prompts file;
  • your updated eText file (remove any references to Project Gutenberg - see this FAQ for an explanation why);
  • any changes/updates you might have made to the VoxForge Lexicon; and
  • your README and LICENSE files.
Name your tar file as follows "[voxforge username]-[year][month][day].tgz" .  For example, if you stored all these files in the /home/myusername/segment folder, you would execute the following command to create your gzipped tar file:

$cd /home/myusername
$tar -zcvf kmaclean-20070125.tgz segment

Connect to the VoxForge FTP site

Connect to the site using your favourite FTP client (see link below). 

If you are using Firefox 1.5 or greater, you can use FireFTP, a cross-platform FTP client.  For Linux you can use Nautilus (Gnome), for Windows you can use FileZilla or WinSCP, and Cyberduck can be used on a Mac.  

(Note: You need to be registered on the VoxForge site for the link to display and to get the current password. )

Copy your TarFile

Copy your compressed tarfile to the VoxForge FTP site. 

Submission Notification

Please add a note stating that you submitted some audio to the VoxForge FTP site and/or to ask questions about the FTP submission process.  You can do this by clicking the 'Add' link below (note: it is only visible if you are  logged in).

thanks!

By kmaclean - 4/20/2009 - 3 Replies Here Perl audio segmentation script that I worked on a while ago: Audiobook.pm.

By kmaclean - 8/23/2014 Guenter has also created an alignment script:

By kmaclean - 8/23/2014 from the CMU sphinx website:

By brsgrlr - 5/30/2014 - 2 Replies Hi,

By kmaclean - 2/20/2014 SailAlign is an open-source software toolkit for robust long speech-text