Step 1 - Task Grammar

Background - Speech Recognition Engines

All Speech Recognition Engines ("SRE"s) are made up of the following components:

  • Language Model or Grammar - Language Models contain a very large list of words and their probability of occurrence in a given sequence.  They are used in dictation applications.  Grammars are a much smaller file containing sets of predefined combinations of words.  Grammars are used in IVR or desktop Command and Control applications.   Each word in a Language Model or Grammar has an associated list of phonemes (which correspond to the distinct sounds that make up a word).
  • Acoustic Model - Contains a statistical representation of the distinct sounds that make up each word in the Language Model or Grammar.  Each distinct sound corresponds to a phoneme.
  • Decoder - Software program that takes the sounds spoken by a user and searches the Acoustic Model for the equivalent sounds.  When a match is made, the Decoder determines the phoneme corresponding to the sound.  It keeps track of the matching phonemes until it reaches a pause in the users speech.  It then searches the Language Model or Grammar file for the equivalent series of phonemes.  If a match is made it returns the text of the corresponding word or phrase to the calling program. 

Although Julius uses acoustic models created with the HTK toolkit, it uses its own Grammar definition format.

Grammar

A recognition Grammar essentially defines constraints on what the SRE can expect as input.  It is a list of words and/or phrases that the SRE listens for.  When one of these predefined words or phrases is heard, the SRE returns the word or phrase to the calling program - usually a Dialog Manager (but could also be a script written in Perl, Python, etc.).  The Dialog Manager then does some processing based on this word or phrase. 

The example in the HTK book is that of a voice-operated interface to for phone dialling.  If the SRE hears the sequence of words: 'Call Steve Young', it returns the textual representation of this phrase to the Dialog Manager, which then looks up Steve's telephone number and then dials the number.

It is very important to understand that the words that you can use in your Grammar are limited to the words that you have 'trained' in your Acoustic Model.  The two are tied very closely together.

Acoustic Model

An Acoutic Model is a file that contains a statistical representation of each distinct sound that makes up a spoken word.  It must contain the sounds for each word used in your grammar.  The words in your grammar give the SRE the sequence of sounds it must listen for.  The SRE then listens for the sequence of sounds that make up a particular word, and when it finds a particular sequence, returns the textual representation of the word to the calling program (usually a Dialog Manager).  Thus, when an SRE is listening for words, it is actually listening for the sequence of sounds that make up one of the words you defined in your Grammar.  The Grammar and the Acoustic Model work together.

Therefore, when you train your Acoustic Model to recognize the phrase 'call Steve Young', the SRE is actually listening for the phoneme sequence "k", "ao", "l", "s", "t", "iy", "v", "y", "ah" and "ng".  If you say each of these phonemes aloud in sequence, it will give you an idea of what the SRE is looking for. 

Commercial SREs use large databases of speech audio to create their Acoustic Models.  Because of this, most common words that might be used in a Grammar are already included in their Acoustic Model.  

When creating your own Acoustic Models and Grammars, you need to make sure that all the phonemes that make up the words in your Grammar are included in your Acoustic Model.

Background - Julius Grammars

In Julius, a recognition grammar is separated into two files: 

  • the ".grammar" file which defines a set of rules governing the words the SRE is expected to recognize;  rather than listing out each word in the .grammar file, a Julius grammar file uses "Word Categories" - which is the name for a list of words to be recognized (which are defined in a separate ".voca" file);
  • the ".voca" file which defines the actual "Word Candidates" in each Word Category and their pronunciation information (Note: the phonemes that make up this pronunciation information must be the same as will be used to train your Acoustic Model).

.grammar file

The rules governing the allowed words are defined in the .grammar file using a modified BNF format.  A .grammar specification in Julius uses a set of derivation rules, written as:

    Symbol: [expression with Symbols]

where:

  • Symbol is a nonterminal; and
  • [expression with Symbols] is an expression which consists of sequences of Symbols, which can be terminals and/or nonterminals. 

A terminal is BNF jargon for a symbol that represents a constant value.  It never appears to the left of the colon.  In Julius terminals represent Word Categories - lists of words that are further defined in a separate ".voca" file.  

A nonterminal is BNF jargon for a symbol that can be expressed in terms of other symbols.  It can be replaced as a result of substitution rules.

For example, look at the the following derivation rules:

S : NS_B LOOKUP NS_E
LOOKUP: CONNECT NAME

In this example, "S" is the initial sentence symbol.   NS_B and NS_E correspond to the silence that occurs just before the utterance you want to recognize and after.   "S", "NS_B" and "NS_E" are required in all Julius grammars.

"NS_B", "NS_E", "CONNECT", and "NAME" are terminals, and represent Word Categories that must be defined in the ".voca" file.  In the ".voca" file,"CONNECT" corresponds to two words: "PHONE" and "CALL" and their pronunciations.  "NAME" corresponds to two words: "STEVE" and "YOUNG" and their pronunciations.  

"LOOKUP" is a nonterminal, and does not have any definition in the .voca file.  It does have a further definition in the .grammar file, where it is replaced by the expression "CONNECT NAME".  All nonterminals must be further defined in the .grammar file until they are finally represented by terminals (which are then defined in the .voca file as Word Categories).

With Julius, only one Substitution Rule per line is permitted, with the colon ":" as the separator.   Alphanumeric ASCII characters and the underscore are permitted for Symbol names, and these are case sensitive.

.voca file

The ".voca" file contains Word Definitions for each Word Category defined in the .grammar file.

Each Word Category must be defined with "%" preceding it.  Word Definitions in each Word Category are then defined one per line. The first column is the string which will be output when recognized, and the rest is the pronunciation.  Spaces and/or tabs can act field separators.

Format:

    %[Word Category]
    [Word Definition]   [pronunciation ...]
    ...

For example the Word Categories "NS_B", "NS_E", "CONNECT", and "NAME" were referenced in the ".grammar" file above and are defined in a ".voca" as follows:

% NS_B
<s>        sil

% NS_E
</s>        sil

% CONNECT
PHONE        f ow n
CALL        k ao l

% NAME
STEVE        s t iy v
YOUNG        y ah ng

In the above example, the NS_B and NS_E Word Categories each have one Word Definition with a silence model ('sil' is a special silence model defined in your Acoustic Model).  These correspond to the head and tail silence in speech input.

"CONNECT" is broken out into two words "PHONE" and "CALL" with pronunciation information, which are the phonemes that make up the words to be recognized (and which correspond to phonemes that will be included in your Acoustic Model).   "NAME" is broken out into two words: "STEVE" and "YOUNG" and their phonemes

The phonemes used here must match the phonemes used in the creation of your Acoustic Model (which we will create in later steps). 

If you have words with different pronunciations, simply create the additional entries on separate lines for the same word but with the different pronunciation.

The .grammar and .voca files working together

Julius needs a predefined word lattice file where each word and each word-to-word transition is listed explicitly.  We get this by compiling the ".grammar" and ".voca" files together to generate the word lattice file (actually it is two files, but more on that later) with a script.  The mkdfa.pl script does this by looking for the Initial Sentence Symbol "S" in the .grammar file and replacing the Word Categories with all the possible Word Candidates from the .voca file, and making a predefined list of all the possible combinations of words and phrases Julius must recognize.  In this case, the list of all possible sentences would be:

<s> PHONE STEVE </s>
<s> PHONE YOUNG </s>
<s> CALL STEVE </s>
<s> CALL YOUNG </s> 

What this means is that when Julius hears the sounds that make up a word or phrase uttered by a user, it tries to match these sounds to the statistical representations of sounds contained in the Acoustic Model.  When a match is made, Julius determines the phoneme corresponding to the sound.  It keeps track of the matching phonemes until it reaches a pause in the user's speech.  It then searches the compiled grammar for the equivalent series of phonemes.  You can think of the compiled grammar as looking something like this: 

sil  f ow n s t iy v sil
sil  f ow n y ah ng sil
sil  k ao l s t iy v sil
sil  k ao l y ah ng sil

If, for example, a match is made with the list of phonemes: "sil  k ao l s t iy v sil", Julius returns the words " <s> CALL STEVE </s>" to the calling program.

Tutorial

.grammar file

For this tutorial, go to the 'voxforge' folder you created in your home directory.  Create a new directory called 'tutorial'. 

Next create a file called sample.grammar in your new 'voxforge/tutorial' folder, and add the following text:

S : NS_B SENT NS_E
SENT: CALL_V NAME_N
SENT: DIAL_V DIGIT


 

In this case, NS_B, NS_E, CALL_V, NAME_N, DIAL_V, DIGIT are Word Categories (i.e. terminals in BNF jargon), and they must be defined in a separate .voca file. 

"SENT" is the only nonterminal symbol.  The "SENT" in the first line will be substituted with either of the following Word Category Phrases:

  • "CALL_V NAME_N" (from the second line); or
  • "DIAL_V DIGIT" (from the third line).

Each Word Category (i.e. "CALL_V", "NAME_N", "DIAL_V", or "DIGIT") is replaced by  one of the Word Definitions set out in the .voca file below.

.voca file

For this tutorial, create a file called: sample.voca in your 'voxforge/tutorial' folder, and add the following text:

% NS_B
<s>        sil

% NS_E
</s>        sil

% CALL_V
PHONE        f ow n
CALL        k ao l

% DIAL_V
DIAL        d ay l

% NAME_N
STEVE        s t iy v
YOUNG        y ah ng

% DIGIT
FIVE        f ay v
FOUR        f ow r
NINE        n ay n
EIGHT        ey t
OH        ow
ONE        w ah n
SEVEN        s eh v ih n
SIX        s ih k s
THREE        th r iy
TWO        t uw
ZERO       z iy r ow

Compiling your Grammar

The .grammar and .voca files now need to be compiled into ".dfa"  and ".dict" files so that Julius can use them. 

Download the Julia mkdfa.jl grammar compiler script to your 'voxforge/bin' folder.

Note: the mkdfa.jl script assumes that  the following julius programs:

  • Linux: dfa_minimize and mkfa,
  • Windows:  dfa_minimize.exe and mkfa.exe

are accessible from your PATH (which should be the case since they are included as part of the Julius executable you just downloaded). 

The .grammar and .voca files need to have the same file prefix, and this prefix is then specified to the mkdfa.jl script.  From a command prompt in your 'voxforge/tutorial' directory, compile your files (sample.grammar and sample.voca) using the following command:

julia ../bin/mkdfa.jl sample

Where 'julia' is the name of the julia programming language; and "../bin/mkdfa.jl" tells Julia to go up one directory, then down into the bin directory to execute the "mkdfa.jl" script; and "sample" is the name of the prefix for your grammar files (i.e. your grammar files are "sample.grammar" and "sample.dfa").

The following shows the expected output from running the mkdfa.jl script:

julia ../bin/mkdfa.jl sample

sample.grammar has 3 rules
---
sample.voca has 6 categories and 23 words
generated: sample.term
---
Now parsing grammar file
Now modifying grammar to minimize states[-1]
Now parsing vocabulary file
Now making nondeterministic finite automaton[6/6]
Now making deterministic finite automaton[6/6]
Now making triplet list[6/6]
6 categories, 6 nodes, 6 arcs
-> minimized: 6 nodes, 6 arcs
generated: sample.dict

 

The generated sample.dfa and sample.term files contain finite automaton information, and the sample.dict  file contains word dictionary information.  All are in Julius format.

Comments

By uzma perveen - 11/3/2020 I am doing speech recognition using arabic language.I want to create 4 gram language model .I used LGPrep following chapter 16 of htk book. This command has executed successfully but instead of creating gram.0 , gram. 1 ,gram.2 w map files in assigned folder it is just creating gram.0 file ...please guide.

By resstymanuzon - 10/8/2018 - 1 Replies Hi! Do you have any insight into how to incorporate the provided Japanese Models on the Julius website into Julius? I want to slowly add more words and expand the already existing lexicon of Julius to suit our use-case. Thank you.

By prithviraj - 4/3/2018 - 4 Replies Hi,

By prithviraj - 3/29/2018 Hi,

By birdieagle - 3/15/2018 - 2 Replies Hello everyone,

By prithviraj - 3/13/2018 - 4 Replies Hi,

By birdieagle - 11/2/2017 Hello,

By birdieagle - 10/29/2017 - 7 Replies Hello, i tried to do the tutorial, and i got this error.. Do you have any ideas to solve this problem? I'm using Windows 10, with developer command prompt VS17

By sriharshap.nalkv - 6/1/2017 - 1 Replies i am getting following error while running in windows 10 for kannada language

By kaiti - 3/24/2017 using windows 10

By Sharma - 7/6/2016 - 1 Replies I obtained an error that said lexical mistake after creating the term file.

By Tuyen - 3/16/2016 - 1 Replies Hi , I'm trying to use HTK in Window 7. When I complie the .grammar and .voca files with mkdfa.jl file, I have error below:

By eaytekin - 12/3/2015 - 1 Replies Hello,

By jerry - 11/30/2015 - 2 Replies Hello sir,

By jarvis - 10/27/2015 - 3 Replies hi,

By ibr - 8/12/2015 - 3 Replies hi

By jonathanalis - 3/20/2015 - 1 Replies I got a problem when running mkdfa.pl sample. Im running on ~/voxforge/manual, the sample.grammar and sample.voca are ok (already with linux terminations). Here what i got:

By rhoda52 - 3/2/2015 I followed all the instuctions listed here, I even edited mkdfa.pl sample in vim and I am still getting the same error that "cannot open sample.grammar at julius/bin/mkdfa...and I am also in the path of my files. it is saved as .sample.grammar.swp

By mobline1 - 2/21/2014 - 1 Replies i 'm trying to compile the sample.grammar and sample.voca with mkdfa.pl and i have this error

By Visitor - 10/12/2013 sample.grammar has 9 rules

By newbie - 8/26/2013 - 1 Replies Hi,

By Nikhil - 8/5/2013 Since .grammar files contains expected sentence, this form ASR or control command application, how to build grammmar for CSR application where any word can follow any word?

By lubingwu88 - 7/14/2013 Hi,

By crack IT - 6/14/2013 - 1 Replies when executing the mkdfa.pl script, change the directory in yuor cygwin(for windows platform) to the directory where .grammar and .voca file is stored. Then run it. Hope, it works

By Sam - 3/20/2013 I am a newby at this so I apologize in advance if this is a stupid question.

By Tuan Dinh - 2/7/2013 - 2 Replies Hi everyone,

By Miracle - 12/21/2012 I want to do something to realise the identification of Chines digiital ,but I cannot find the beep file ,so how to do ? or could any one send me some material AT [email protected], thank you

By raszky - 12/21/2012 I got the same problem here...but anyway,thanxz for the help man.

By ripul_88 - 7/2/2012 - 3 Replies

By bhupaesh - 3/25/2012 - 4 Replies I have installed HTK 3.4.1 on Linux(Fedora). HTK Demo has been used to test the installation. It's working. But when i Start using HTK tool HSLab for recording speech data by writing command