User:
timobaumann
Date: 12/20/2007 2:40 am
Views: 246
Rating: 33
Hi Ralf,
well, no reason to be impressed. It was you who read those 3400 utterances, composed of 26000 word tokens, made up of 5000 individual word types. It's your countless number of hours that really help the project.
Now, appraisal aside:
You are right, the lines with the special characters should be removed. They are just there because of the scripting I did (I will write more about that when I find the time).
Same with upper-/lowercase words. It would be best not to discard this information completely by uppercasing/lowercasing everything, we will need some slightly smart way of figuring this out. For the moment it would probably suffice to always join the two cases.
The numbers shouldn't be discarded. They should be handled differently, though. We need some number normalization (5->fünf, 85 -> fünf und achtzig) and then just keep the word tokens forming the numbers. (NB: This is incorrect, because "fünf und achtzig" is /fYnf QUnt QaxtsiC/, while "85" should be /fYn vUn daxtsiC/; there is no Auslautverhärtung, because there are no Auslauts within the number).
If we start to build (as in "with our own hands") a *serious* pronunciation resource, we will have to put some thought into how we organize our data. Some good starting point for the storage of pronunciation lexicon data is the W3 Pronunciation Lexicon Specifiaciton ( http://www.w3.org/TR/pronunciation-lexicon/ ).
I think though, that our manual work should be on a more structured level (separated by noun, verb, adjective, thus allowing us to automatically guess all the different forms), leaving traces (so that errors can be traced and corrected in the base form) and several more things. It's probably smartest to build a tool that helps us managing our yet to be built great pr0nlex.
All this takes time, which I will not have before january. My next steps will be to improve and publish the process by which I extracted the lexicon above and then to think about how we can start to manually review, sort and improve the lexicon.
Have a nice christmas!
Timo