VoxForge
Hi NadaDeNada,
>i looking for a tool to decompose in italian phonemes my dictionary
>words.
Some open source text-to-speech engines let you do this... like Festival or Espeak. If they can generate text-to-speech for Italian, then there should be a command to generate phonemes.
For example, with Festival, to determine the pronunciation of a word, you need to use the "lex.lookup" command as follows:
festival> (lex.lookup "internet")
("internet" nil (((ih n t) 1) ((er n) 0) ((eh t) 1)))
Festival will list the phonemes included in the word, but also includes numbers (these indicate "lexical stress" for a phoneme). Ignore the parenthesis and numbers, and you have Festival's view of the phonemes that make up the word you entered. Therefore, for the word "Internet", Festival says its phonemes are: "ih n t er n eh t".
For the Italian version of Festival, see the FESTIVAL
speaks Italian! page.
Ken
look, i don't know how to load a a file in this site but i will give you that perl script to normalize the Festival dictionary.
1) first you find the festlex_IFD.tar file (countig 500.000 word in Festival format).
2) untar it and look for inside the folders to get lex.out (30MB).
3) launch the perl script(at the end of this post) as:
perl cleandict.pl lex.out normalized.dict
you have now the 500K names in italian with its fonetic, you should also find useful to get the phonethic table under festival/lib/italian_scm/italian_phoneset.scm
someone has already formatted to resemble a phonetic table you'll use in HTK. look around for it. or juts get mine here.
a
a1
b
d
dz
dZZ
e
e1
EE
f
g
i
i1
j
JJ
k
l
LL
m
n
nf
ng
o
o1
OO
p
r
s
SIL
SS
t
ts
tSS
u
u1
v
w
z
I thin i got it from the user "nsh". Anyway here follows also the cleandict.pl script:
#!/usr/local/bin/perl -w
#
# -- Script usato per pulire il dizionario preso da festival
# e renderlo un semplice lista di parole fonemi f OO n e1 m i
#
# TODO
# don't convert in latin1, don't know why. Anyway at the end everything
# should be put in plain ASCII.
#
use feature "switch";
use Encode;
use PerlIO::encoding;
my ($srcdic, $dstdic);
#$srcdic="lessico_italiano_500K.dic";
#$dstdic="it-500kNorm.dic";
if (@ARGV != 2) {
print "usage: $0 Festival-like.dic Normalized.dic\n\n";
exit (0);
}
($srcdic, $dstdic) = @ARGV;
#encoding ISO-8859-1 is latin-1
open(my $SRCDIC, "<:encoding(iso-8859-1)", $srcdic) or die;
open(my $DSTDIC, ">:encoding(iso-8859-1)", $dstdic) or die;
$newline = encode("latin1", "\n");
$nlinee=0;
while ($linea = <$SRCDIC>){
$nlinee++;
@lista = split(//, $linea);
$got=0;
$deep=0;
for $i (@lista){
#print "[INFO processing=\"".$i."\"| got=".$got." deep=".$deep."]\n";
$enc= encode("utf-8", $i);
given ($enc) {
when (/["]/) {
if( $got == 0){
$got = 1
}else{
$got = 0;
$tmp = encode("latin1", " ");
print $DSTDIC $tmp;
}
}
when ( /[(]/ ) {
if ($deep == 3){
$tmp = encode("latin1", " ");
print $DSTDIC $tmp;
}
$deep++;
}
when ( /[)]/ ) { $deep-- }
when (/[ 1-9a-zA-Zèéìíùúàáòó]/){
if($got || $deep > 3){
$dec = decode("latin1", $enc);
print $DSTDIC $dec;
}
}
}
}
print $DSTDIC $newline;
}
close($SRCDIC);
close($DSTDIC);
print "processed ".$nlinee." parole\n";
phonetic table again, well formatted and pointing out that in this particular phone list I've already inserted the SIL phoneme, so watch out if you don't mean to use it that way.
a
a1
b
d
dz
dZZ
e
e1
EE
f
g
i
i1
j
JJ
k
l
LL
m
n
nf
ng
o
o1
OO
p
r
s
SIL
SS
t
ts
tSS
u
u1
v
w
z