VoxForge
Hello
We have a system where the employees are calling a voicemail and tells a score like this:
ONE FIVE SEVEN NINE
The recordings is then manually heard and written into a database.I want this process automated so that the recordings is automatically converted into text, that can be processed by a program.We do not need to recognise sentences, just the numbers from 0 to 9, i have old recordings that i can train the system with.
I have done some research and found out that Julius should be able to do something like this, but the guides, documentation etc. that is have found so far is centered about advanced sentence recognizing.I would really appreciate if anoyne could give a hint on the more specific capabilities of Julius (and maybe others?) that i need to look into for accomplishing this project.
Thanks in advance.
--- (Edited on 10/15/2010 7:28 am [GMT-0500] by smith388) ---
Julius isn't really a good fit for this task. One of the reasons is that it doesn't have good acoustic model for number. I suggest you to try pocketsphinx which has TIDIGITS - an acoustic model trained from commercial digits database, very accurate one (WER is 1%
I've recently wrote a text on using pocketsphinx for voicemail transcription in asterisk, check it out
http://nsh.nexiwave.com/2010/09/voicemail-transcription-with.html
to recognize only digits you need to perform one small change in process described there. Pocketsphinx arguments must be
-hmm model/hmm/en/tidigits/ -dic model/lm/en/tidigits.dic -fsg model/lm/en/tidigits.fsg
instead of
-hmm Communicator
This way only digits will be recognized with highly accurate TIDIGITS model.
--- (Edited on 10/15/2010 16:57 [GMT+0400] by nsh) ---
This seems to be exactly what i want, although some difficulties has arised.
I have installed latest pocketsphinx-snapshot and sphinxbase-snapshot for testing. Pocketsphinx is executed like this:
pocketsphinx_continuous \
-infile /home/smith388/one_to_ten.wav \
-hmm /usr/local/share/pocketsphinx/model/hmm/en/tidigits/ \
-dic /usr/local/share/pocketsphinx/model/lm/en/tidigits.dic \
-fsg /usr/local/share/pocketsphinx/model/lm/en/tidigits.fsg \
-samprate 8000
The following error messages appear (i have only pasted line 1 and 133436 of the "Bad ciphone" message):
*********
ERROR: "dict.c", line 194: Line 1: Bad ciphone: EH; word !exclamation-point ignored
.
.
.
ERROR: "dict.c", line 194: Line 133436: Bad ciphone: R; word }right-brace ignored
INFO: dict.c(212): Allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(324): 0 words read
INFO: dict2pid.c(396): Building PID tables for dictionary
INFO: dict2pid.c(404): Allocating 34^3 * 2 bytes (76 KiB) for word-initial triphones
INFO: dict2pid.c(131): Allocated 14008 bytes (13 KiB) for word-final triphones
INFO: dict2pid.c(195): Allocated 14008 bytes (13 KiB) for single-phone word triphones
INFO: fsg_search.c(145): FSG(beam: -1080, pbeam: -1080, wbeam: -634; wip: -26, pip: 0)
INFO: fsg_model.c(678): FSG: 24 states, 11 unique words, 11 transitions (23 null)
INFO: fsg_model.c(213): Computing transitive closure for null transitions
INFO: fsg_model.c(264): 143 null transitions added
ERROR: "fsg_search.c", line 324: The word 'ONE' is missing in the dictionary
*********
Any ideas on what i might be doing wrong?
--- (Edited on 10/15/2010 10:22 am [GMT-0500] by smith388) ---
Thanks, i should have noticed that dic was not an option in the manpage.
I have made a quick test with an audio file that contained the word 'ONE' this got translated to '000000000: EIGHT OH EIGHT', but i have not been reading the documentation thoroughly yet. Is it possible to calibrate it with some of my old recordings?
I have seen the following messages a couple of times:
FATAL_ERROR: "continuous.c", line 149: Failed to calibrate voice activity detection
It seems that the has something to do with how the audio is exported, but its not clear to me yet.
I definitely thinks that Pocketsphinx is the best solution to my projekt, i just need to understand it better.
--- (Edited on 10/15/2010 4:20 pm [GMT-0500] by Visitor) ---
> I have made a quick test with an audio file that contained the word 'ONE' this got translated to '000000000: EIGHT OH EIGHT', but i have not been reading the documentation thoroughly yet. Is it possible to calibrate it with some of my old recordings?
> FATAL_ERROR: "continuous.c", line 149: Failed to calibrate voice activity detection
> It seems that the has something to do with how the audio is exported, but its not clear to me yet.
Also looks like an issue with format. If you have issues with some audios, can you just share them so I will take a look?
--- (Edited on 10/16/2010 04:32 [GMT+0400] by nsh) ---
Thanks again, the problem was the audio format, so after converting and downsampling clip with sox the error message disappeared and ONE was correctly recognized. Test with other numbers has also been almost 100% correct.
I will try to test with different voices now. In case of a wrong answer from Pocketsphinx, is it possible to tell it the right answer so it can learn from it? I don't know if this will be needed at all, but maybe this functionality could become handy in some situations, if it exists.
--- (Edited on 10/16/2010 8:07 am [GMT-0500] by Visitor) ---
> In case of a wrong answer from Pocketsphinx, is it possible to tell it the right answer so it can learn from it? I don't know if this will be needed at all, but maybe this functionality could become handy in some situations, if it exists.
Yes, CMUSphinx project provides tools for acoustic model adaptation, you could use incorrectly recognized data to adapt the TIDIGITS acoustic model to increase the accuracy. You need to run few commands which could be orgzanized in a simple shell script. See
http://cmusphinx.sourceforge.net/wiki/tutorialadapt
For more detail
--- (Edited on 10/18/2010 17:16 [GMT+0400] by nsh) ---
I have found a recording that is not recognized and made some experiments adapting the TIDIGITS model by following the tutorial, everything until the chapter "Generating acoustic feature files" has been succesful.
I create the feature files with the following command and arguments:
sphinx_fe `cat tidigits/feat.params` -samprate 16000 -c unrecognized.fileids -di . -do . -ei wav -eo mfc -mswav yes
This generate the following output:
INFO: cmd_ln.c(512): Parsing command line:
sphinx_fe \
-dither yes \
-lowerf 1 \
-upperf 4000 \
-nfilt 20 \
-transform dct \
-round_filters no \
-remove_dc yes \
-wlen 0.025 \
-feat s2_4x \
-agc none \
-cmn current \
-cmninit 63,-1,1 \
-varnorm no \
-samprate 16000 \
-c unrecognized.fileids \
-di . \
-do . \
-ei wav \
-eo mfc \
-mswav yes
ERROR: "cmd_ln.c", line 567: Unknown argument name '-feat'
ERROR: "cmd_ln.c", line 658: cmd_ln_parse_r failed
To successful create the mfc file i need to delete the following arguments from tidigits/feat.params:
-feat s2_4x
-agc none
-cmn current
-cmninit 63,-1,1
-varnorm no
But i don't think it is a good idea to remove them, maybe i am missing something?
--- (Edited on 10/19/2010 3:57 pm [GMT-0500] by Visitor) ---
> But i don't think it is a good idea to remove them, maybe i am missing something? That was a good idea. Actually `cat feat.params` is broken way to pass options. I've updated the adaptation tutorial, it now uses sphinx_fe -argfile feat.params
--- (Edited on 10/20/2010 01:24 [GMT+0400] by nsh) ---