VoxForge
Basically, I am not able to spot silences/fillers with forced alignment in julius.
1 Forced Alignment with Julius
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1.1 word / phoneme segmentation kit
====================================
I started from the word / phoneme segmentation kit that is present
in the home page of the Julius project:
+ [http://sourceforge.jp/projects/julius/downloads/32570/julius4-segmentation-kit-v1.0.tar.gz]
In this package there is a Perl script that automatically generates
the speech grammar files 'tmp.dfa', 'tmp.phndict' and 'tmp.dict' from
transcription. Then recognition with julius is performed with
-walign and/or -palign parameters.
1.2 What I have done successfully
==================================
So I used Julius for Forced Alignment and I prepared a script that
builds a grammar (like in the Perl script mentioned above) for each
input file from text transcription. I experienced very good
results both in word and phone alignment.
1.3 What I am not able to do
=============================
Unfortunately, I am not able to implement the following feature:
I would like Julius to be able to spot silence (and furthermore
even non verbal sounds) that may occur between words (or even
between phones). I would like to do this without explicitly
designing a particular grammar to contain also optional states
related to silence or filler words.
1.4 -iwsp parameter
====================
The "-iwsp" parameter seems to be related to the -iwsp option:
$ julius -help
...
"[-iwsp] insert sp for all word end (multipath)(off)"
...
I have tried it, but without the expected results. I.e., with
input audio files containing silences between words, there has not
been detected any "sp" in the output.
2 My question
~~~~~~~~~~~~~~
Does anybody know how to spot silences or non verbal sounds in a
forced alignment procedure with Julius4, without explicitly
designing a grammar that include states associated with silence/non
verbal sounds?
3 Test with voxforge acoustic model for English
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If someone is interested, I can share a test example I'm struggling
with. It's about forced alignment for the English language.
3.1 Acoustic model
===================
I downloaded a pre-built AM, contained in this tarball:
+ [http://www.repository.voxforge1.org/downloads/software/julius-3.5.2-quickstart-linux.tgz]
The "tiedlist" file, namely
julius-3.5.2-quickstart-linux/acoustic_model_files_build726/tiedlist
is lacking several triphones, and this can cause Julius to give an
error (see below).
3.2 A sample audio and grammar
===============================
Here follows a link to a tarball with files to test this issue:
[https://dl.dropboxusercontent.com/u/10183668/julius\_FA/test\_julius\_fa.tar.gz]
3.2.1 Here follows a description of each file contained in that tarball:
-------------------------------------------------------------------------
- sample_eng/c31c030s_plus_sil.wav
This is a sample audio file we used to test julius forced
alignment. It is a sentence taken from WSJ corpus. A silence chunk
has been manually added.
- sample_eng/c31c030s_plus_sil.txt
This is the corresponding text file:
+ "he updates his list of things to do today before going home each
evening"
The silence has been inserted between the words "today" and "before".
- fa_files/file.dfa, fa_files/file.phndict and fa_files/file.dict
Files that specify the grammar and dictionary for forced alignment
- output_files/c31c030s_plus_sil.phn and output_files/c31c030s_plus_sil.wrd
These two files are the output I gave from julius. There is no
silence (nor "sp", short pause) inserted between "today" and
"before". Actually, the phoneme "ey" in the alignment is lasting
more than a second and a half, spanning over the long silence.
These files have been obtained by converting the actual output
of julius, from timings in frames to timings in seconds.
3.3 Missing triphones
======================
In the sample sentence I provided there are triphones that are
missing in the voxforge acoustic model. One can map some of them
into other triphones and put them into the voxforge tiedlist file,
like this:
$ echo "ah-p+d ah-p+ch
ax-p+d ax-p+t
b-iy+f b-iy+s
hh-ix+z hh-ix+s
iy-f+ao iy-f+ow
ow-ix+n ow-ix+ng
p-d+ey n-d+ey
uw-d+ey ax-d+ey" \
>> julius-3.5.2-quickstart-linux/acoustic_model_files_build726/tiedlist
3.4 The julius command line I used is the following:
=====================================================
$ echo sample_eng/c31c030s_plus_sil.wav | julius \
-h julius-3.5.2-quickstart-linux/acoustic_model_files_build726/hmmdefs \
-hlist julius-3.5.2-quickstart-linux/acoustic_model_files_build726/tiedlist \
-dfa fa_files/file.dfa \
-v fa_files/file.phndict \
-walign \
-palign \
-multipath \
-spmodel "sp" \
-iwsp \
-b 200 \
-b2 200 \
-bs 200 \
-sb 200.0 \
-gprune safe \
-iwcd1 max \
-iwsppenalty -30.0 \
-input file
--- (Edited on 4/23/2013 6:08 am [GMT-0500] by azeem) ---
> I would like Julius to be able to spot silence (and furthermore even non
>verbal sounds) that may occur between words (or even between phones).
First, for better results, use a current nightly build of the VoxForge acoustic models
Second, if you look at the output of the forced alignment, you should have word and phoneme timestamps. You should be able to create a script to collect the time stamps from the end of one word to the beginning of the next to give you an idea of where there might be silence.
I used HTK's Hvite for forced alignment in this tutorial (on how to segment a speech file). The word and phoneme timestamps from that tutorial look like this... the sp "short pause" entries correspond to the silence you are looking for.
--- (Edited on 4/24/2013 11:28 am [GMT-0400] by kmaclean) ---
@kmaclean: First of all, thank you very much for your advices.
> First, for better results, use a current nightly build of the
> VoxForge acoustic models
Thanks, I will surely try those!
> Second, if you look at the output of the forced alignment, you
> should have word and phoneme timestamps. You should be able to
> create a script to collect the time stamps from the end of one
> word to the beginning of the next to give you an idea of where
> there might be silence.
Yes, I have already prepared such a script. Actually, in the
output that I linked in my post there is an example of output
with timestamps (I realize that my post may be too long,
difficult to read and hide this info). I have two versions, .wrd
and .phn. Here follows the word version, it can be noted that
the word "today" is very long (and that's the error: a silence
should have been spotted right after that word):
0 0.32 sil
0.32 0.45 he
0.45 0.88 updates
0.88 1.04 his
1.04 1.31 list
1.31 1.37 of
1.37 1.75 things
1.75 1.87 to
1.87 2.06 do
2.06 3.98 today
3.98 4.31 before
4.31 4.6 going
4.6 4.88 home
4.88 5.08 each
5.08 5.63 evening
5.63 5.91 sil
> I used HTK's Hvite for forced alignment in this tutorial (on how
> to segment a speech file). The word and phoneme timestamps from
> that tutorial look like this... the sp "short pause" entries
> correspond to the silence you are looking for.
Sure, HVite is another option that I want to try. But in this
thread I would like to trouble shoot julius, i.e. to investigate
wheter it is capable of automatically spot silence/fillers events
in a Forced Alignment task, without designing a specific grammar
containing those nodes.
In particular, the -iwsp option seems to deliver what I am
looking for, and I would like to understand if I am using it
correctly.
--- (Edited on 4/28/2013 10:18 am [GMT-0500] by azeem) ---