VoxForge
Hello everybody,
since I am trying to build a german accoustic model from the actual voxforge corpus I noticed two things.
in the 16khz_16bit Folder :
- rebecca-20071016_de2
the wav de2-027 is missing. ( it is in the prompt though )
- anonymous-20100108-vhh
all the wav's contains no hearable sounds. Maybe happend during the downsample process.
I just though I let you know
Binh
Edit: Another thing. Serveral of the german prompts seem to contain the word two thousand. some even contain two thousands0 what indicate that 2000 was replaced by two thousands so 2000 become two thousand and 20000 became two thousand0.
Hi Bihn,
thanks for the fixes!
updated in: Changeset 6857
> rebecca-20071016_de2 the wav de2-027 is missing. ( it is in the prompt though )
removed de2-027 from prompt list
>anonymous-20100108-vhh
removed submission from repository
>Serveral of the german prompts seem to contain the word two thousand.
Not sure I understand what you are getting at here... I am not German, so if you can give me the prompt ids of the prompts with the problem and a correction, I can fix the submission applet
thanks,
Ken
I'm sorry. I forgot it isn't so obvious if you don't have it in front of your nose all the time.
Every file is the main folder/16khz_16bit
anonymous-20080405-phz
*/de5-088 ES GIBT ZAHLREICHE BUCHTEN AN DER ETWA TWO THOUSAND0 KM LANGEN ATLANTIKKÜSTE
Should be: ES GIBT ZAHLREICHE BUCHTEN AN DER ETWA 20000 KM LANGEN ATLANTIKKÜSTE
or better: ES GIBT ZAHLREICHE BUCHTEN AN DER ETWA ZWANZIGTAUSEND KM LANGEN ATLANTIKKÜSTE
(ZWANZIGTAUSEND is the german word for the number 20000)
justmoon-20080204-hbp
*/de5-085 IM JAHR 1998 LEBTEN DORT TWO THOUSAND BÜRGER
Should be:*/de5-085 IM JAHR 1998 LEBTEN DORT 2000 BÜRGER
or better: */de5-085 IM JAHR 1998 LEBTEN DORT ZWEITAUSEND BÜRGER
(ZWEITAUSEND is the german word for 2000)
Rest is one of these two sentences and should be replaced the same.
justmoon-20080204-hbp
*/de5-088 ES GIBT ZAHLREICHE BUCHTEN AN DER ETWA TWO THOUSAND0 KM LANGEN ATLANTIKKÜSTE
ralfherzog-20070822_de5
/*de5-085 IM JAHR 1998 LEBTEN DORT TWO THOUSAND BÜRGER
ralfherzog-20070822_de5
*/de5-088 ES GIBT ZAHLREICHE BUCHTEN AN DER ETWA TWO THOUSAND0 KM LANGEN ATLANTIKKÜSTE
ralfherzog-20070826_de9
*/de9-059 AM 21 SEPTEMBER TWO THOUSAND IST DAS PATENT ABGELAUFEN
timiobaumann-20080418-ryd
*/de5-085 IM JAHR 1998 LEBTEN DORT TWO THOUSAND BÜRGER
That is what I meant that it looked like a search and replace. Every occourence of the number 2000 seemed to be replaced by the englisch word for 2000(two thousands).
Since 2000 is part of 20000 we got some strange prompt with TWO THOUSAND0.
In case anyone wondered why I said it is better to take the word than the number. I encountered some serious problems while testing training with htk if the prompts contain numbers.
Hope it helps
Binh
Found another dead file
16khz_16bit:
anonymous-20080310-rdy
All the waves are just empty
>One invalid audio data and one missing file in voxforge german audio corpus
This looks like a problem with the acoustic model creation scripts... created a ticket to track thi