[Search]

I Forgot My Password - How can I get a new one?

If you forget your password, you can get a new one e-mailed to you by performing the following step:s

  1. click "Login" from the top menu;
  2. click "click here to register";
  3. click "I forgot my password";
  4. Enter your email address, click save;
  5.  An email will be sent to you with a new password.

[top]

Java WebStart and VoxForge SpeechSubmission Application

For Ubuntu (20.04):

Install Java

$ sudo apt install default-jdk

 

Options:

A) From Browser: 

no longer works.

B) From Command line:

You can us javaws to download the .jnlp file and run it locally from the command line:

$ javaws http://read.voxforge1.org/speech/SpeechSubmission.jnlp

-or-

download the .jnlp file to your computer first, and then run it locally:

$ wget http://read.voxforge1.org/speech/SpeechSubmission.jnlp

$ javaws SpeechSubmission.jnlp

 

C) If all else fails:

Download the jar file directly from command line (which is what .jnlp does from the browser...) and run java against the jar:

 

$ wget http://read.voxforge1.org/speech/speechrecorder_standalone.jar

 

$ java -jar speechrecorder_standalone.jar

[top]

Julius Confidence Scores

From this thread: For Noisy Input

For a recognition result like this:

### read waveform input
Stat: adin_file: input speechfile: seven.wav
STAT: 12447 samples (1.56 sec.)
STAT: ### speech analysis (waveform -> MFCC)
### Recognition: 1st pass (LR beam)
............................................................................pass1_best: <s> 5
pass1_best_wordseq: 0 2
pass1_best_phonemeseq: sil | f ay v
pass1_best_score: -1867.966309
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 120 generated, 120 pushed, 14 nodes popped in 76
sentence1: <s> 5 </s>
wseq1: 0 2 1
phseq1: sil | f ay v | sil
cmscore1: 1.000 0.316 1.000

score1: -1944.799561

(tpavelka's post): Julius outputs two types of scores:

The Viterbi score, e.g.:

score1: -1944.799561

This is the cummulative score of the most likeli HMM path. The Viterbi algorithm (decoder) is just a graph search which compares scores of all possible paths through the HMM and outputs the best one. The problem is, that a score of a path (sentence) depends on the sound files length but also on the sound file itself (see this thread for more discussion). This means that Viterbi scores for different files are not comparable. I understand that you want some kind of measure, which can tell you something about whether the result found by Julius is believable or not. In that case, have a look at

The confidence score, in your example:

cmscore1: 1.000 0.316 1.000

Julius outputs a separate score for each word, so in your example the starting silence has confidence score of 1.0 (i.e. 100%), the word "five" has the score 0.316 (i.e. not that reliable) and the ending silence has again 1.0.

[top]

Language Model Toolkits

Popular language modeling tools are:

[top]

Licensing of Public Domain Audio Books (LibriVox)

VoxForge uses Public Domain Audio Books (LibriVox) recordings to create a derivative works which will be licensed under the GPL (with all applicable rights held by the  Free Software Foundation). 

This will not affect the legal status of the recording you submitted here or anywhere else (e.g. LibriVox). Therefore you will still be able to put your recording into the public domain (or it will remain in the public domain if it's already in there).

[top]

Linux: How do I Adjust my Recording Volume Levels Using Audacity?

First make sure your microphone volume in Audacity is set to 1.0.  Then click Record (i.e. the red circle button) and begin speaking in your normal voice for a few seconds, and then click Stop (i.e. the yellow square button). 

Look at the Waveform Display for the audio track you just created (see image below).  The Vertical Ruler to the left of the Waveform Display provides your with a guide to the audio levels.  Try to keep your recording levels between 0.5 and -0.5, averaging around 0.3 to -0.3.  It is OK to have a few spikes go outside the 0.5 to -0.5 range, but avoid having any go beyond the 1.0 to -1.0 range, as this will generate distortion (see image):


If your Sound Level is too Low 

If you have increased your volume to the maximum and still are not getting an acceptable sound level, you may need to turn on the 'Mic Boost' switch in your Linux mixer.  Fedora's mixer (i.e. gnome-volume-control) is the "Volume Control" utility located in Applications>Sound & Video>Volume Control menu.  Select the 'Switches' tab of the Volume Control utility and then select 'Mic Boost (+20dB)' (see image below):

Note: Audacity's microphone volume control overrides any other microphone volume settings you may have in your Linux mixer (i.e. in the Capture tab).  

Hit the ctrl-z key in Audacity (to 'undo' your previous recording) and try recording again. 

If your Sound Level is too High

If the waveform display on your track beyond the 1.0 to -1.0 range (i.e. the waveforms have been clipped off at the top or bottom) your volume is too high.  Reduce it with Audacity's microphone volume control, and hit ctrl-z in Audacity and try again.  It is better to err on the side of having a lower volume level from a speech recognition perspective - clipped speech sounds distorted.

Once you are satisfied that the volume is acceptable, try playing the file back by clicking Play (i.e. the green triangle button) in Audacity.  You will likely need to adjust the Master Volume and the PCM Volume sliders for your speakers under the 'Playback' tab in your Volume Control utility, see image:

Note: Your Audacity Volume control slider and your mixer's PCM Volume Control slider move in tandem - i.e. moving one will move the other.  But you may still need to adjust your Master Volume control in your mixer to hear sound from your speakers.

You need to hear your utterances after each recording to make sure they sound OK - but make sure that your speakers are turned off when you are recording.  Hit ctrl-z in Audacity to remove the track you just created.

[top]

Linux: How do I tar my audio files and prompts for submission to VoxForge

Please create a single compressed tar file containing the following files:

Name your tar file as follows "[voxforge username]-[year][month][day].tgz" .  For example, if you stored all these files in the /home/myusername/train folder, you would execute the following command to create your gzipped tar file:

$cd /home/myusername
$tar -zcvf kmaclean-06092006.tgz train

[top]

Linux: how to adjust your microphone volume using GNOME

To set your microphone volume in Linux, you need use your distro's mixer.  To start the Gnome mixer, select:

    System>Preferences>Volume Control

and then click the Capture tab:

Move the sliders up or down to increase or decrease your microphone's recording volume. 

Determining optimal microphone volume settings

First make sure your microphone slider is set to it's mid-point.  Then click Record in the VoxForge Speech Submission Application and begin speaking in your normal voice for a few seconds, and then click Stop. 

Look at the Waveform Display for the recording you just created.   Adjust your microphone volume up or down depending on the size of the Waveforms.

If your Sound Level is too Low

If you have increased your volume to the maximum and still are not getting an acceptable sound level, you may need to either increase the volume settings or turn on the 'Mic Boost' switch in your Linux mixer.  Select the 'Switches' tab of the Volume Control utility and then select 'Mic Boost (+20dB)' (see image below):



Try re-recording some speech - you might have to reduce your microphone volume to compensate for the Mic Boost.

If your Sound Level is too High

If the waveforms in the display have been clipped off at the top or bottom, then your volume is too high.  Reduce your microphone volume, and re-record some speech.  It is better to err on the side of having a lower volume level from a speech recognition perspective - clipped speech sounds distorted.  But you also need it to be loud enough such that you can see your speech waveforms in the display (i.e. you should be able to see squiggly lines that correspond to your speech).

Adjusting your Playback Volume

Once you are satisfied that the volume is acceptable, try playing the file back by clicking Play .  You will likely need to adjust the Master Volume and the PCM Volume sliders for your speakers in your Volume Control utility, see image:

[top]

Linux: how to adjust your microphone volume using KDE

To set your microphone volume using Linux, you need use your distro's mixer.  To start the KDE mixer (which is included with the "kdemultimedia" package), select:

    System>Multimedia>KMix

and then click the Input tab:

KMix-input.jpg

Move the "Mic" slider up or down to increase or decrease your microphone's recording volume. 

Determining optimal microphone volume settings

First make sure your microphone slider is set to it's mid-point.  Then click Record in the VoxForge Speech Submission Application and begin speaking in your normal voice for a few seconds, and then click Stop. 

Look at the Waveform Display for the recording you just created.   Adjust your microphone volume up or down depending on the size of the Waveforms.

If your Sound Level is too Low

If you have increased your volume to the maximum and still are not getting an acceptable sound level, you may need to either increase the volume settings or turn on the 'Mic Boost' switch in your Linux mixer.  Select the 'Switches' tab of the KMix utility and then select 'Mic Boost (+20dB)' (see image below):

KMix-Switches.jpg

Try re-recording some speech - you might have to reduce your microphone volume to compensate for the Mic Boost.

If your Sound Level is too High

If the waveforms in the display have been clipped off at the top or bottom, then your volume is too high.  Reduce your microphone volume, and re-record some speech.  It is better to err on the side of having a lower volume level from a speech recognition perspective - clipped speech sounds distorted.  But you also need it to be loud enough such that you can see your speech waveforms in the display (i.e. you should be able to see "squiggly" lines that correspond to your speech).

Adjusting your Playback Volume

Once you are satisfied that the volume is acceptable, try playing the file back by clicking Play.  You might need to adjust the Master Volume and the PCM Volume sliders for your speakers in under the KMix "Output" tab.

[top]

Linux: How to Change your Audacity Preferences to Record VoxForge Speech Audio

VoxForge collects speech audio at the highest Sample Rate that your Sound Card can support (up to a Sampling Rate of 48kHz, at 16 Bits Per Sample).  You'll need to look at your Sound Card's manual to determine the maximum it supports (see this FAQ entry for more info on your sound card and recording rates).  For this example we will assume a 48kHz Sample Rate.

Project Sampling rate

In Audacity, you set the Project Sampling Rate in your Preferences.  First go to 'File', then select 'Preferences...', next click the 'Quality' tab, and then set your 'Default Sample Rate Format' by clicking the up/down arrows to change  it to 48000Hz (the default is usually 44100Hz), see image below:

AudacityPreferences_Quality.jpg

Sample Rate Format

Still in the 'Preferences...' menu, and still under the 'Quality' tab, click the  'Default Sample Format'.  Click the up/down arrows to change it to 16-bit, see image above.

Channels

While still in the 'Preferences...' menu, click the 'Audio I/O' tab, and then set your 'Channels' to 1 (Mono), see image below:

 

Export File Format 

While still in the 'Preferences...' menu, click the 'File Formats' tab, and then set your 'Uncompressed Export Format' to WAV (Microsoft 16 bit PCM), see image below:

 

You can also submit speech using FLAC format. 

Note: Please only submit audio files in an uncompressed format such as WAV or AIFF or lossless compressed format such as FLAC.

Click OK to save your settings. 

Making your settings active 

Now you need to exit and re-start Audacity to make these Project Setting changes active.   In Audacity, click File>Exit.  Restart Audacity by clicking Applications>Sound & Video>Audacity.

Look at Project rate selector on the bottom left hand corner of the Audacity window, make sure it says 48000.  If it does, then you are ready to continue.  If not, then re-check your Preferences tab to make sure your settings are correct.

[top]

Linux: How to determine your audio card's, or USB mic's, maximum sampling rate

To submit audio to VoxForge, you need to make sure you Sound Card and your Device driver both support a 48kHz sampling at 16 bits per sample.

You can use arecord, the command-line sound recorder (and player) for the ALSA sound-card driver.  It should be included with your Linux distribution (type in "man arecord" at the command line to confirm this).

The approach here is use the 'arecord' command to try to record your speech at a sampling rate higher than what your sound card supports.  arecord balks at this and will return an error message stating the maximum rate your sound card or usb mic can give you.  Details of this approach can be found near the end of this thread (go to the second page).  Many thanks to Robin for helping out on this one.

1. Sound Card or Integrated Audio

If you have a sound card or audio processing integrated into your motherboard, get a list of all the audio devices on your PC by executing this command:

$arecord --list-devices

You should get output similar to this: 

**** List of CAPTURE Hardware Devices ****
card 0: IXP [ATI IXP], device 0: ATI IXP AC97 [ATI IXP AC97]
  Subdevices: 1/1
  Subdevice #0: subdevice #0

This says that my integrated audio card is on card 0, device 0.

Next, try to record your speech at a rate higher than what you think your highest recording rate might be (replacing the numbers in hw:0,0 with your card and device number):

$ arecord -f dat -r 60000 -D hw:0,0 -d 5 test.wav

The 60000 corresponds to a sampling rate of 60kHz.  Your output should look something like this: 

Recording WAVE 'test.wav' : Signed 16 bit Little Endian, Rate 60000 Hz, Stereo
Warning: rate is not accurate (requested = 60000Hz, got = 48000Hz)
         please, try the plug plugin (-Dplug:hw:0,0)
Aborted by signal Interrupt...

This tells us that the maximum sampling rate supported on my integrated audio card is 48000Hz (or 48kHz).  You may have to experiment with different sampling rates to get the Warning message. 

2. USB Microphone or USB audio pod

If you have USB based audio, first get a list of all the audio devices on your PC using this command:

$ arecord --list-devices

You should get a listing similar to this:

[...]
card 1: default [Samson C01U              ], device 0: USB Audio [USB Audio]
Subdevices: 1/1
Subdevice #0: subdevice #0

This says that the USB microphone is is on card 1, device 0. 

Next, try to record your speech at a rate higher than what you think your highest recording rate might be (replacing the numbers in hw:1,0 with your card and device number):

$  arecord -f S16_LE -r 60000 -D hw:1,0 -d 5 testS16_LE.wav

"S16_LE" means 'Signed 16 bit Little Endian'.  This command will output something like this: 

Recording WAVE 'test.wav' : Signed 16 bit Little Endian, Rate 60000 Hz, Stereo
Warning: rate is not accurate (requested = 60000Hz, got = 48000Hz)
         please, try the plug plugin (-Dplug:hw:0,0)
Aborted by signal Interrupt...

The arecord output tells us that the maximum sampling rate supported on my integrated audio card is 48000Hz (or 48kHz).  You may have to experiment with different sampling rates to get the Warning message.

There is some additional information on USB mics on the Audacity site.

[top]

Posts: Nested and Flat Layouts

A message or post is the smallest unit in a discussion.  A threaded discussion (or thread) is a series of posts related to the same topic or subject.

A post's layout can be set using the Nested/Flat link in a VoxForge forum (located on the top right hand corner of a thread) and can be:

Example of flat structure:

Nested structure:

Flat/nested choice often determines the way people discuss. In the flat layout there is only one "path" - that is why it is sometimes called "linear".

Nested layout offers more freedom, more digressions and more paths in the discussion but it is more difficult to spot new posts (unless you use RSS feed or watch the thread).

[top]

Project Gutenberg and Librivox Copyright Status

LibriVox 

Librivox audio is public domain (they use a Creative Commons Public Domain Dedication).  They use ebooks obtained from Project Gutenberg.  Many Project Gutenberg ebooks are also public domain (not all).  To make sure that they only release audio readings of public domain texts, Librivox relies on Project Gutenberg's legal work to assure Copyright status of their books.  

Project Gutenberg

The Acoustic Model creation process requires that we segment any user submitted audio (and its corresponding text transcriptions) into 5-10 second speech audio snippets.  But, in doing so, we would contravene the Gutenberg Project's Trademark licensing terms if we kept any references to Gutenberg in the eText that accompanies the speech audio.  For this reason, we need to remove all references to Gutenberg in any speech audio and text submission made to VoxForge.

To distribute Project Gutenberg e-texts with the "Project Gutenberg" trademark name, you must follow some licensing provisions that include a requirement that the text not be broken up in any way, and pay a licensing fee.  If  you don't use the Project Gutenberg Name, and delete any reference to it in the text, you can distribute the text in any way you see fit.

For example, for the Herman Melville book 'Typee', the Librivox audio is public domain (it uses the Creative Commons Dedication).  The text of the Gutenberg Typee ebook has the following "license".  It says the following in the intro:

ABOUT PROJECT GUTENBERG-TM ETEXTS
This PROJECT GUTENBERG-tm etext, like most PROJECT GUTENBERG-
tm etexts, is a "public domain" work distributed by Professor
Michael S. Hart through the Project Gutenberg Association at
Carnegie-Mellon University (the "Project"). Among other
things, this means that no one owns a United States copyright
on or for this work, so the Project (and you!) can copy and
distribute it in the United States without permission and
without paying copyright royalties.

Then it goes on to say: 

Special rules, set forth
below, apply if you wish to copy and distribute this etext
under the Project's "PROJECT GUTENBERG" trademark.

So basically it says that no one owns copyright on the written text of this book in the US (and likely most other jurisdictions), and you can copy and distribute as you please.  But, if you want to copy and distribute the book along with references to the Gutenberg TradeMark, then you need to follow some special rules. 

Further on in the document it says:

          DISTRIBUTION UNDER "PROJECT GUTENBERG-tm"

You may distribute copies of this etext electronically, or by
disk, book or any other medium if you either delete this
"Small Print!" and all other references to Project Gutenberg,

This clarifies what you need to do if you want to distribute the ebook without any restrictions - basically you delete the 'license' and the Gutenberg trademarks.

It then goes on to elaborate the conditions you must follow if you do want to distribute the text with the Gutenberg trademarks:  

or:
[1] Only give exact copies of it ...
[2] Honor the etext refund and replacement provisions of this
"Small Print!" statement.
[3] Pay a trademark license fee to the Project of 20% of the
net profits ...

 

Note: I am not a lawyer, and this is not a legal opinion.

[top]

Speech Recognition Tutorial

A nice overview of speech recognition by Professor Don Colton: Automatic Speech Recognition Tutorial

[top]

Speech recognition vs voice recognition

from an article on DZone: Introduction to Synthetic Agents: Speech Recognition - Part 1

Technically, speech recognition extracts the words that are spoken whereas voice recognition identifies the voice that is speaking. Speech recognition is "what someone said" and voice recognition is "who said it". The underlying technologies do overlap but they serve very different purposes.

[top]

Speech Submission Mirrors

Full Mirror of VoxForge site (thanks to Coral Cache):

Partial Mirrors (only VoxForge Speech Submission app & some supporting docs):

[top]

Speech Submission: the Upload Link does not Appear in my Browser

You need Javascript enabled on your browser in order for the upload link to appear on the submission page.

[top]

Sphinx 3 Quickstart Guide

Hello World Decoder QuickStart Guide - is a tutorial to help you recognize audio files with spoken audio into text!

From the site:

Contents

[top]

Tips for Recording VoxForge Prompts with Audacity

The easiest way to record Voxforge Prompts with Audacity is to open the prompts file into its own browser window or tab.  Then maximize your browser to take up all your screen. 

Next, open Audacity into a smaller window - almost the same width as your browser but only 1/4 the height - see image below:

 

Use the top of the Audacity window as a ruler to highlight the line that you are reading.  When you finish one line, use your mouse to move the Audacity window down one line.

When you get too close to the bottom of your screen, just scroll up the prompts file in your browser window, and continue recording your prompts.

[top]

Video Transcription Software

From their website:

Version 1.1.0 of the open source transLectures-UPV toolkit (TLK) is out now for Linux and Mac, featuring new high-level scripts to make it simpler to run speech recognition tasks. Download TLK and try the new tutorial!

TLK, the transLectures-UPV toolkit, is the open source automatic speech recognition (ASR) software developed at Universitat Politècnica de València. It comprises a set of command-line tools for building, training and applying acoustic models that can be used, among other things, to generate transcripts for video lectures. Indeed, it is the software running behind the transLectures automatic subtitling system in the UPV’s Polimedia video lecture repository.

[top]

What are Phoneme Coverage Prompts?

Phoneme Coverage Prompts are prompts designed to provide good coverage of all the different phonemes in the English language.

[top]

What are Sampling Rate and Bits per Sample?

From the Audacity Digital Audio Tutorial :

The main device used in digital recording is a Analog-to-Digital Converter (ADC). The ADC captures a snapshot of the electric voltage on an audio line and represents it as a digital number that can be sent to a computer. By capturing the voltage thousands of times per second, you can get a very good approximation to the original audio signal:

http://manual.audacityteam.org/m/images/0/0e/Waveform_sample_rates.png

Each dot in the figure above represents one audio sample. There are two factors that determine the quality of a digital recording:

Higher sampling rates allow a digital recording to accurately record higher frequencies of sound. The sampling rate should be at least twice the highest frequency you want to represent. Humans can't hear frequencies above about 20,000 Hz, so 44,100 Hz was chosen as the rate for audio CDs to just include all human frequencies. Sample rates of 96 and 192 KHz are starting to become more common, particularly in DVD-Audio, but many people honestly can't hear the difference.

Higher sample sizes allow for more dynamic range - louder louds and softer softs. If you are familiar with the decibel (dB) scale, the dynamic range on an audio CD is theoretically about 90 dB, but realistically signals that are -24 dB or more in volume are greatly reduced in quality. Audacity supports two additional sample sizes: 24-bit, which is commonly used in digital recording, and 32-bit float, which has almost infinite dynamic range, and only takes up twice as much storage as 16-bit samples.

Here are some additional articles that provide more information on sampling rate and bit depth (i.e. bits per sample):

[top]

What are the HTK specific changes that were made to the lexicon file?

the Voxforge lexicon file contains special entries for SENT-START and SENT-END.  If you plan to use a non-Voxforge dictionnary, be sure to add these entries:

SENT-END        []              sil
SENT-START [] sil

[top]

What does tuning your speech recognition engine mean?

When the commercial SRE providers say that you need to 'tune' the SRE to your location, they are saying that you (or their consultants) need to either:

[top]

What is a Codec?

A Codec is a device or program capable of performing encoding and decoding on a digital data stream or signal.

[top]

What is a Desktop Command and Control Application?

It typically refers to a capability of voice recognition systems on a personal computer that lets you select menus and other functions by speaking the commands into a microphone. 

[top]

What is a Dialog Manager?

A Dialog Manager is one component of a Speech Recognition System.

Telephony and Command & Control Dialog Managers

A Dialog Manager used in Telephony applications (IVR - Interactive Voice Response), and in some desktop Command and Control Application, assigns meaning to the words recognized by the Speech Recognition Engine, determines how the utterance fits into the dialog spoken so far,and decides what to do next.  It might need to retrieve information from an external source.  If a response to the user is required, it will choose the words and phrases to be used in its response to the user, and transmit these to the Text-to-Speech System to speak the response to the user.

Dictation Dialog Manager 

A Dictation Dialog Manager will typically take the words recognized by the Speech Recognition Engine and type out the corresponding text on your computer screen.  It may also have some Command and Control elements, but these are usually limited to the types of commands typically used in a word processing program.  It usually responds to the user using text (i.e. it might not use Text to Speech to respond to the user).

Examples 

Examples of Telephony Dialog Managers include: 

Examples of Command & Control Dialog Managers:

Examples of Dictation Dialog Managers, with Command & Control elements, would be:

You can also write a domain specific application to perform Dialog Manager-like tasks using a traditional programming language (C, C++, Java, etc.) or a scripting Language (Perl, Python, Ruby, etc.). For example:

[top]

What is a Dictation Application?

A Dictation application uses Speech Recognition to translate your speech into written text on your computer. 

A Dictation application lets you speak into a microphone attached to your computer, and have the text print out on your computer screen.  It can recognize a larger number and variety of words.  It can recognize arbitrary phrases with words in any order.

This is different from a Command and Control application which also uses speech recognition but is limited to controlling your computer and software applications by speaking short commands.  Here, the vocabulary that the speech recognition engine can recognize is much smaller than in dictation, and is limited to a small set of words and predefined phrases.

Commercial dictation systems usually include a command and control system.

[top]

What is a Grammar?

A Speech Recognition Grammar sets out all the acceptable words and phrases that a user might say at a particular point in a dialog with a Speech Recognition System.  A Grammar file is used in Desktop Command & Control or Telephony IVR (Interactive Voice Response) Speech Recognition applications.

A simple grammar (in HTK format) might look like the following:

$name = [ STEVE ] YOUNG| [JOHN] DOE;
( START (PHONE|CALL) $name) END )

Basically this tells the Speech Recognition Engine to recognize the following utterances (note that the vertical bar in the grammar denotes 'or', and the contents of a set of square brackets indicates an optional utterance):

Any other utterance is ignored by the Speech Recognition Engine, which usually returns an 'out of grammar' error.  So the following utterances would be rejected by the Speech Recognition Engine:

For additional information, see these links: 

[top]

What is a Hertz or kilohertz?

A hertz (symbol: "Hz") is a unit of frequency.  The sampling rate, sample rate, or sampling frequency defines the number of samples per second taken from a audio signal to make a digital representation of that signal.

One hertz means one cycle per second; one hundred hertz means one hundred cycles per second; one thousand hertz (or a kilohertz - symbol "kHz") means one thousand cycles per second.

[top]