 
    VoxForge
I've just downloaded the julius4-segmentation toolkit with the intention of performing forced alignment for phonemes. According to the README a transcription file listing the phonemes in the input speech must be supplied.
Can Julius perform forced alignment given:
1. a transcription file contain words, nor phonemes
2. a phonetic dictionary
This approach is simpler and potentially allows Julius to select the best of several possible transcriptions for a word.
--- (Edited on 7/12/2013 6:52 pm [GMT-0500] by Olumide) ---
You should be able to perform forced alignment with julius allowing it to select the best of several possible transcriptions for a word.
Indeed, I have used julius for this task (following the info in the julius4-segmentation toolkit), and I can give you an example on how it works for me.
For an input sentence like this:
"facendo attività fisica tre o più volte alla settimana si incrementa il metabolismo anche finché si riposa"
I have generated the two following files:
1. a phndict file (N.B.: there are multiple pronunciations for the words "tre" and "o"):
0 [sil] sil
1 [FACENDO] f a tS E1 n d o
2 [ATTIVITÀ] a t i v i t a1
3 [FISICA] f i1 z i k a
4 [TRE] t r E1
4 [TRE] t r e1
5 [O] o1
5 [O] O1
6 [PIÙ] p i u1
7 [VOLTE] v O1 l t e
8 [ALLA] a1 l a
9 [SETTIMANA] s e t i m a1 n a
10 [SI] s i1
11 [INCREMENTA] i ng k r e m e1 n t a
12 [IL] i1 l
13 [METABOLISMO] m e t a b o l i1 z m o
14 [ANCHE] a1 ng k e
15 [FINCHÉ] f i ng k e1
16 [SI] s i1
17 [RIPOSA] r i p O1 s a
18 [sil] sil
2. and a dfa file:
0 18 1 0 1
1 17 2 0 0
2 16 3 0 0
3 15 4 0 0
4 14 5 0 0
5 13 6 0 0
6 12 7 0 0
7 11 8 0 0
8 10 9 0 0
9 9 10 0 0
10 8 11 0 0
11 7 12 0 0
12 6 13 0 0
13 5 14 0 0
14 4 15 0 0
15 3 16 0 0
16 2 17 0 0
17 1 18 0 0
18 0 19 0 0
19 -1 -1 1 0
The syntax of the dfa is not straightwforward to understand.
Basically, every line describes an edge of an automaton.
The line
"a b c d e"
means
"there is an edge from state 'c' to state 'a'; the label on that edge is the word 'b'".
So the sample dfa file above is to be read backwards (bottom-up). The word number in the dfa file makes reference to the word number of the phndict file. 'd' and 'e' are used in the cases of initial and final states. The special state -1 and the word -1 are used in the case of the initial state.
If one wants Julius to perform FA on an arbitrary audio file plus transcription, a script that generates .dfa and .phndict files must be implemented (for ex. the julius4-segmentation toolkit can do this task).
Once the .dfa and .phndict files are ready, one must call julius to perform forced alignment.
The following options are mandatory:
-dfa file.dfa \
-v file.phndict \
-walign \
-palign \
Having said that, I recommend you also this thread
http://www.voxforge.org/home/forums/message-boards/speech-recognition-engines/problem-with-silence/fillers-using-julius4-forced-alignment
in which I describe the problems I encountered with trying to perform FA with julius including also optional silence and fillers (like breath, lipsmacks, ..).
--- (Edited on 7/15/2013 5:27 am [GMT-0500] by azeem) ---
Thanks for replying. I am reading up on the deterministic finite automaton file (.dfa) and a dictionary file (.dict) in the Julius book. I'll probably have some more questions later on.
Before then, how do you think julius compares against HTK's HVite for decoding?
Update:
According to the Julius book and http://julius.sourceforge.jp/en_index.php?q=en_grammar.html a .grammar and .voca file are required in order to generate the .dfa and .dict files respectively. Unfortunately I do not have a grammar file. They should not be required for forced alignment, neither should word categories. HTK HVite for example does not require them.
I see your .dfa and .dict file are different and do not seem to be derived from a grammar. Why is that so?
--- (Edited on 7/15/2013 8:11 pm [GMT-0500] by Olumide) ---
> Before then, how do you think julius compares against HTK's HVite for decoding?
Honestly, I don't have an opinion on that. I would like to run some accuracy comparison tests, but I lack of time. :-(
It is worth saying that comparing two (or more) forced alignment results is way trickier than comparing mere Word Error Rates from "standard" ASR.
> Update:
> According to the Julius book and http://julius.sourceforge.jp/en_index.php?q=en_grammar.html a .grammar and .voca file are required in order to generate the .dfa and .dict files respectively.
I think I tried the tool that generates .dfa and .phndict from a .grammar and a .voca files.
As far as I remember, .grammar uses a higher-level syntax than .dfa one. Thus, .grammar files are easier to write.
But, since for the task of forced alignment the grammar is very easy, I thought that it would have been faster to write my own script that generates directly a .dfa file and a .phndict from a lexicon and the sentence to be aligned.
> Unfortunately I do not have a grammar file. They should not be required for forced alignment, neither should word categories. HTK HVite for example does not require them.
I don't have much experience with HVite, but yes, I know it does not requires a grammar for FA.
As far as I understand (and sorry if I sound pendantic), we can say that Julius does not "know" the difference between recognition and forced alignment.
So yes, with Julius indeed a grammar (yet very simple) is needed.
Basically, to perform FA, such a grammar is expected to be
"<initial_silence> word1 word2 word3 ... lastword <final_silence>"
This forces Julius recognition to follow that *single* path (made up of the words of the sentence to be aligned). This special case of recognition is, practically, equal to FA.
This is also the philosophy of julius4-segmentation toolkit.
> I see your .dfa and .dict file are different and do not seem to be derived from a grammar. Why is that so?
As I said before, it is because I actually didn't derive them from .voca and .grammar.
I can try to manually build a .grammar file from the sentence to be aligned, then generate a .dfa, and then compare with my .dfa.
If this does not take too long, I'll report this comparison later on.
--- (Edited on 7/16/2013 5:49 am [GMT-0500] by azeem) ---
I just created a .grammar and .voca files for forced alignment and generated the .dfa and .dict files with the mkdfa.pl script provided by julius distribution (as explained in http://julius.sourceforge.jp/en_index.php?q=en_grammar.html).
They actually look like the .dfa and .phndict I posted above. There are little naming differences, but the idea is the same.
I post the two (mkdfa.pl-generated) files here:
.dfa:
0 1 1 0 0
1 18 2 0 0
2 17 3 0 0
3 16 4 0 0
4 15 5 0 0
5 14 6 0 0
6 13 7 0 0
7 12 8 0 0
8 11 9 0 0
9 10 10 0 0
10 9 11 0 0
11 8 12 0 0
12 7 13 0 0
13 6 14 0 0
14 5 15 0 0
15 4 16 0 0
16 3 17 0 0
17 2 18 0 0
18 0 19 0 0
19 -1 -1 1 0
.dict:
0 [<s>] sil
1 [</s>] sil
2 [FACENDO] f a tS E1 n d o
3 [ATTIVITA] a t i v i t a1
4 [FISICA] f i1 z i k a
5 [TRE] t r E1
5 [TRE] t r e1
6 [O] o1
6 [O] O1
7 [PIU] p i u1
8 [VOLTE] v O1 l t e
9 [ALLA] a1 l a
10 [SETTIMANA] s e t i m a1 n a
11 [SI] s i1
12 [INCREMENTA] i ng k r e m e1 n t a
13 [IL] i1 l
14 [METABOLISMO] m e t a b o l i1 z m o
15 [ANCHE] a1 ng k e
16 [FINCHE] f i ng k e1
17 [SI] s i1
18 [RIPOSA] r i p O1 s a
--- (Edited on 7/16/2013 8:17 am [GMT-0500] by azeem) ---
Thanks. I've noe created a .dfa and .dic file and now I'm trying to run Julius as follows:
julius -dfa file.dfa -v file.dic -walign -palign -h hmmdefs -hlist tiedlist -multipath -spmodel "sp" -iwsp -b 200 -b2 200 -bs 200 -sb 200.0 -gprune safe -iwcd1 max -iwsppenalty -30.0 -input file
Unfortunately all I'm getting is an impressive log about the various inputs, e.g. acoustic model, dictionary etc. I try to type the file name but nothing happens.
Pleasefind the dump of the output here
http://pastebin.com/mkg9PS0K
--- (Edited on 7/17/2013 9:19 pm [GMT-0500] by Olumide) ---
I see from the dump file that your julius output abruptly ends like this:
<<
(...)
FrontEnd:
Input stream:
input type = waveform
input source = waveform file
input filelist = (none, get file name from stdin)
sampling freq. = 16000 Hz required
threaded A/D-in = not supported (live input may be dropped)
zero frames stripping = on
silence cutting = off
long-term DC removal = off
reject short input = off
----------------------- System Information end -----------------------
------
>>
I think that this means that you didn't actually provide an audio file.
According to the invocation command you posted above, the julius program should wait for the user to type in the name of the audio file to be processed.
One way to speed-up the julius execution is to invoke it like this:
$ echo my_audio_file.wav \
| julius \
-dfa file.dfa \
-v file.phndict \
..all other parameters here.. \
-input file
The above tells julius to use my_audio_file.wav for decoding (at least, this works for linux - I don't know if it's OK on Windows).
--- (Edited on 7/18/2013 3:17 am [GMT-0500] by azeem) ---
The command works on Cygwin, however the decoding failed with the error message:
### read waveform input
Stat: adin_file: input speechfile: mono16.wav
STAT: 88681 samples (5.54 sec.)
STAT: ### speech analysis (waveform -> MFCC)
### Recognition: 1st pass (LR beam)
pass1_best: sil
pass1_best_wordseq: 0
pass1_best_phonemeseq: sil
pass1_best_score: -15571.160156
### Recognition: 2nd pass (RL heuristic best-first)
WARNING: 00 _default: hypothesis stack exhausted, terminate search now
STAT: 00 _default: 0 sentences have been found
WARNING: 00 _default: got no candidates, search failed
STAT: 00 _default: 0 generated, 0 pushed, 0 nodes popped in 552
<search failed>
--- (Edited on 7/18/2013 3:17 pm [GMT-0500] by Visitor) ---
A few comments on your output:
1) julius 1st pass decoding seems to have been performed OK. However, it has outputted "sil" as the best result.
Is that reasonable? What is the content of the audio file?
2) julius second pass failed.
I can tell you that <search failed> is an error that I got on some inputs, too. This happens for me especially when I try to perform forced alignment with some kind of "advanced" grammar, that contemplates the possibility of inserting optional silences and fillers (such breath, coughs, ..) after every word. This increases the search hypothesis space and complexity.
However, regarding simple forced alignment (i.e. with a grammar made up solely of the sentence to be aligned, without fillers) I can tell you that julius works fine.
Moreover, since the first pass gave "sil" as the best result, the decoding of your file does not seem to have been much complicated.
So I think that there should be another reason behind that search failure.
Then I suggest you to check the following conditions:
A) make sure that the audio format matches the acoustic model format (they must share the same sample rate, encoding,..)
B) revise your grammar (is it correctly designed? All of its paths go from the start state to the final one? Is it describing a non acyclic graph?)
C) also check your acoustic model: did you build it by yourself? Or did you download a pre-built one?
If you are willing to share your data (or just a little part of it), you can write me a private message, send me your relevant file and I can test them a bit deeper on my system. I am not an ultimate julius guru, but maybe I can help you out.
The outcomes of those tests will be published on this thread, so the people who read the forum will known the (hopefully happy) end of the story.
--- (Edited on 7/19/2013 4:42 am [GMT-0500] by azeem) ---
I don't mind sharing my acoustic model and data sample data with with you. Perhaps also we could take this conversation offline as it has already become a monologue. I'll post the results for posterity when I get some.
I may also be able to give you hints on HVite as I have already got it working but am merely testing other toolkits.
My adderss is videohead<at_char>mail<d0t>c0m . I hope you can deciper it.
--- (Edited on 7/19/2013 7:46 pm [GMT-0500] by Visitor) ---