
Russian data and models
User: nsh
Date: 6/9/2007 10:21 am
Views: 11588
Rating: 27

To raise discussion we had some time ago. We collect and distribute public Russian models:

The size is growing (currently 10 speakers take 170 Mb) and it became harder to distribute this through berlios. It would be really nice to upload files to voxforge and make it central point of distribution.

Btw,  unrelated thing. Voice control application is accepted as GNOME SoC. Mutilingual models are one of our goals and probably we'll record data to create model for desktop control. It's interesting is it possible to organize such recording with voxforge.

Re: Russian data and models
User: kmaclean
Date: 6/10/2007 3:40 pm
Views: 363
Rating: 29


>It would be really nice to upload files to voxforge and make it central point of distribution.

Yes we can do that!

I've set up a separate subversion/trac environment for Russian speech and models at:

I will email a password to the email address you set up on VoxForge.  Please commit your files to the subversion site, taking care to ensure that all wav audio files, transcriptions, and acoustic models are released under GPL.  You should have at least one short-form license header for each person who submitted audio - something like this:

Copyright (C) 2007  [name of person who submitted the audio]
These files are free software; you can redistribute them and/or modify them under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

These files are distributed in the hope that they will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.

Once I get a better idea of the file structure, I'll look into updating the nightly mirroring script (it uses GNU 'make' ...) to create a Russian tar file on the VF repository for high speech downloading.  Please do *not* use Trac or the Subversion as the primary download point, it will slow down the VoxForge web site too much.

You can also update the Trac site in Russian.

I have also set up a new Russian Speech Submission Forum for users who wish to submit speech.

>Voice control application is accepted as GNOME SoC. Mutilingual models are one of our goals and probably we'll record data to create model for desktop control. It's interesting is it possible to organize such recording with voxforge.

We will help out if we can!  Although I need a  better idea of what you are looking for.

All the best, 




Re: Russian data and models
User: nsh
Date: 6/10/2007 5:16 pm
Views: 343
Rating: 27

Hello Ken

Thanks a lot for your quick support. Excuse me that I'm a bit lost actually. Do you really think I should commit audio data to svn? I've looked into main repository, it doesn't seem like you store english audio there.

About trac I think we don't need it right now. svn repo if existed will be enough. About form it would be great idea. I'd like to organize it a bit and submit a lexicon for recording. I'll try to make it soon.

About SoC I haven't decided yet. We'll certainly need multilingual models to work properly but it's really a big work. So this work is only in planning stage.

Re: Russian data and models
User: nsh
Date: 6/10/2007 5:22 pm
Views: 314
Rating: 33
Oh, excuse me my ignorance. I see data is just in other repo. Ok, I'll commit Russian files than and then update you. Thanks a lot!
Re: Russian data and models
User: nsh
Date: 6/11/2007 9:44 pm
Views: 317
Rating: 26
So I've just committed the data. I hope everything is ok. Thanks a lot.
Re: Russian data and models
User: kmaclean
Date: 6/11/2007 10:41 pm
Views: 306
Rating: 29

>So I've just committed the data. I hope everything is ok. Thanks a lot.

Looks good ...
Please group the audio by user.  That way I can create gzipped tar files by user for the audio.  The reason for this is that it reduces the load when I "rsync" with the VoxForge Repository, and it also allows users to download just the new audio (rather than having to download a single large tar file with all the audio every time new audio is submitted).
I will also create a separate gzipped tar file for the Acoustic Models and scripts - most users are only interested in the Acoustic Models ("AM"s).  You might want to include a short how-to for users to tell them how to use the AM with Sphinx. 
Once things are set up, if you add any new audio files in the Subversion repository or update any scripts, the nightly script should pick up the change, create the tar files, and update the mirror on the VoxForge Repository automatically.


P.S.  Please remember to keep backups of all files you submit to VoxForge.  While we will use reasonable efforts to back up site data and make such data available in the event of loss or deletion, we take no responsibility or liability for the deletion or failure to store any Content.  See VoxForge's terms and conditions for more information.

Re: Russian data and models
User: nsh
Date: 6/13/2007 3:29 am
Views: 391
Rating: 20

> Please group the audio by user.  That way I can create gzipped tar files by user for the audio.  The reason for this is that it reduces the load when I "rsync" with the VoxForge Repository, and it also allows users to download just the new audio (rather than having to download a single large tar file with all the audio every time new audio is submitted).


>  I will also create a separate gzipped tar file for the Acoustic Models and scripts - most users are only interested in the Acoustic Models ("AM"s).  You might want to include a short how-to for users to tell them how to use the AM with Sphinx.

Will do

 >Please remember to keep backups of all files you submit to VoxForge



Re: Russian data and models
User: kmaclean
Date: 6/19/2007 9:36 am
Views: 352
Rating: 29

Everything is set up!

I've got a link to the Trac front-end to Subversion on the VoxForge dev page

I also have links to the Russian Speech Corpus on the VoxForge Downloads page:

I hope I didn't mangle your original directory structure too much - I wanted to keep the changes to the mirroring scripts to a minimum.

If there is anything else, please let me know,


Re: Russian data and models
User: nsh
Date: 6/19/2007 10:12 am
Views: 3953
Rating: 29

Amazing, thanks a lot Ken.

Now the only problem is audio data but it's a minor thing Laughing
