General Discussion

Corpus Thresholds
User: serial_strat
Date: 4/3/2007 11:44 am
Views: 4949
Rating: 27
Is there already some statistics about the actual size of Corpus being built and an estimate of the "distance" to a working threshold?

--- (Edited on 4/ 3/2007 11:44 am [GMT-0500] by serial_strat) ---

Re: Corpus Thresholds
User: kmaclean
Date: 4/3/2007 8:02 pm
Views: 2120
Rating: 17

Hi serial_strat,

The Metrics link on the VoxForge Download gives stats on the VoxForge Speech Corpus.

You can get the output of the HDMan run for the previous night's Acoustic Model Creation run in the HTK.tgz tar file (which is updated nightly).  Go to the /HTK/AMCreate_scripts/logs/ directory and open the Step2_HDMan1_log file.

Last night's (April 3, 2007) summary statistics are as follows:

Dictionary Usage Statistics
  Dictionary           TotalWords     WordsUsed      TotalProns     PronsUsed
VoxForgeDict       129528           4367                   129545           4371
        dict                4367               4367                   4371               4371

4367 words required, 0 missing

New Phone Usage Counts
  1. ax    :  1762
  2. sp    :  4369
  3. ae    :   579
  4. b     :   476
  5. l     :  1270
  6. ow    :   411
  7. n     :  1638
  8. d     :  1005
  9. m     :   711
 10. t     :  1336
 11. iy    :   858
 12. s     :  1388
 13. aa    :   536
 14. z     :   560
 15. er    :   752
 16. ix    :   722
 17. ey    :   447
 18. ao    :   324
 19. r     :  1308
 20. sh    :   276
 21. aw    :    98
 22. ng    :   366
 23. ah    :   269
 24. v     :   337
 25. k     :   993
 26. dx    :   319
 27. uw    :   301
 28. eh    :   782
 29. p     :   659
 30. ch    :   169
 31. jh    :   223
 32. w     :   299
 33. y     :   160
 34. ih    :   584
 35. f     :   457
 36. ay    :   367
 37. g     :   309
 38. th    :   134
 39. hh    :   238
 40. dh    :    59
 41. uh    :    86
 42. zh    :    40
 43. oy    :    53
 44. sil   :     2

>estimate of the "distance" to a working threshold?

No real estimate to a working threshold other than 140 hours of speech, which corresponds to the number of hours of speech used in Sphinx group  Acoustic Models. 

Note working threshold is relative.  Dictation applications would require much more speech audio than 140 hours - this is more of a target for Command and Control and Telephony IVR type applications. 


--- (Edited on 4/ 3/2007 9:02 pm [GMT-0400] by kmaclean) ---

--- (Edited on 4/ 3/2007 9:06 pm [GMT-0400] by kmaclean) ---
