VoxForge
Shortly put, creating a good recording place boils down to two things:
1) Eliminating external noise.
2) Breaking up as much surface as possible to avoid echo.
As to the elimination of external noise, there is only so much you can do without spending a small [or huge] fortune: Pick a room that is the furthest away from trafic-noise. Close doors and windows. Shut the blinders/pull the curtains. (I take it that you have read the documentation, so telling you to turn off the aircondition/fan etc. should not be nessecary at this point).
Now we get to the FUN part! You see, the art of braking surface is the art of doing what your mother told you never to do: Making A Mess(TM)! Thats right. What you need to do is to "scientifically" make a mess of the room. First, if there is no carpet on the floor, spreading out books with about a foot apart is a good start, but don't forget to make them stand up open if they can. Also, moving all the plants you have in the other rooms into your recording studio gives good results, as plants have a huge surface. Preferably the plants are placed on chairs, or the like, evenly distributed in the room. But the big problem is the walls... bare walls kill good recordings! Closets, "littered" shelves, racks and framed pictures help a lot here. Just remember that pictures with glass covers are actually worse than a bare wall, as glass bounces more sound than wallpaper! And while we are at it.. so does the hard unbroken surface of a door. The only easy/cheap way I can come up with is to place a mattress in front of it, or if it has a hook, hang your biggest coat on it. Then you systematically inspect the room to check if you can come up with a solution to every surface you see: Can you stand something in front of it? Can you move it out of the room? Can you pull a blanket over it? Use poster-gum to fasten something to it? Etc. Be inventive!
Once your homebrew recording studio looks pretty much like a warzone you are ready to create clear and noise-free recordings... that is... if you can grab hold of a decent microphone!
Have fun making a mess and recording :-)
/macavity
--
FSF Associate member number 3423.
--- (Edited on 10/12/2006 7:33 am [GMT-0500] by Visitor) ---Awesome - just when we think we have everything covered in the how-to, someone else points out the obvious (after it has been pointed out of course),
thanks for the info,
Ken
--- (Edited on 10/12/2006 10:41 am [GMT-0400] by kmaclean) ---I am by no means an expert on this. The work I have done for this project is about all the experience I have.
That said, I have had good feedback on recordings performed in the following environments in my home.
1- inside my car, parked inside my closed garage
2- inside a walk-in closet with the closet and room doors closed
The closet is more comfortable than the car.
In both cases I did think of issues around "breaking up the surfaces" but I didn't want to truly make a mess around my house. I was looking for places that I could minimize setup time, where I can just sit down and record.
-joe
--- (Edited on 12/ 5/2006 14:34:05 [GMT-0500] by jaiger) ---
The more closely the training data matches the data provided by users, the better speech recognition systems tend to work. So I don't think making special changes to the recording environment to reduce echo is the best course of action. It seems to me that all you need to do regarding the recording environment is:
(1) Use a microphone type (headset being an example of a type) that the users of the type of application you are interested in would be using. If the application you are interested in is dictation, I recommend using a headset mic, since dictation users usually use headset mics.
(2) Avoid environments with loud background noise such as people talking in the same room, radio or TV on, etc.
The comp.speech.research newsgroup is a good place to post if you want expert feedback on your recording guidelines. Some very experienced people like Tony Robinson (HTK team) and Arthur Chan (Sphinx team) read that newsgroup.
David Gelbart
a somewhat experienced person
International Computer Science Institute, Speech Group
--- (Edited on 12/15/2006 12:19 am [GMT-0600] by Visitor) ---
Regarding my post above:
I do think Ken's comments in the FAQ regarding headset mic placement are important: "Your microphone should be a bit to the side and below your mouth (so the microphone won't pick-up your breathing), and no more than a half inch (1-2 cm) away."
As I recall, NaturallySpeaking provides instructions like this as part of the installation process, and an open source dictation package could (and should) do the same.
David Gelbart
--- (Edited on 12/15/2006 12:25 am [GMT-0600] by Visitor) ---
Hi David,
Thanks for the post.
I have been including all user-submitted audio in the creation of the current VoxForge Acoustic Models. My plan was to create different types of Acoustic Models based on the quality of the audio. But, based on your comments, this is likely not a good idea.
VoxForge was set up to collect audio for desktop command and
control and VoIP IVR type applications, and once enough data was
collected, to start looking at dictation applications. In all
these applications, I assumed that the majority of people would be
using headset mics
... so I think we are on the right track with respect to microphone
types. I also assumed that any VoIP IVR applications would use
lossless codecs, so I think we are OK there also (though there may be
jitter/latency issues).
It seems that in addition to ensuring we get good monophone and triphone coverage, and including as many different people and dialects as possible, we also need to ensure that we get recordings covering as many different hardware configurations (different microphones types with computers with audio cards and on-board audio, noisy and quiet cooling fans/hard drives), and different recording environments (rooms with and without echo), so you can create as robust an acoustic model as possible.
The current rating approach (thumbs up or down) will likely have to be updated to reflect these requirements.
I'll follow-up with the comp.speech.research newsgroup to get more feedback on our recording guidelines.
all the best,
Ken
--- (Edited on 12/16/2006 8:53 pm [GMT-0500] by kmaclean) ---
Hello Ken,
I am not sure if the majority of users will favor a headset mic for command and control. (Although the more different possible commands there are -- and also, the more limited the training data is -- the more likely a headset mic will be needed for good performance.) This link from Apple is interesting:
http://www.apple.com/education/accessibility/technology/speech_recognition.html
According to Apple, "Over one hundred speakable commands are already created for you... Apple’s Speech Recognition ... is optimized to work with the built-in microphones in Apple’s all-in-one machines such as iBooks, eMacs, iMacs, and PowerBooks. So you don’t have to tether yourself to the machine with a head-mounted microphone...."
When I make VoIP calls I tend to use a 'multimedia' microphone that sits next to my computer, rather than a headset. I don't know what most people do. VoIP codecs are lossy, so I recommend using a different acoustic model for VoIP than for dictation. If you can get a copy of the codec in question, you can take non-VoIP recordings and run them through the codec, thus simulating VoIP. If you have a model of network effects (packet loss and timing effects) you can simulate that as well. There are academic ASR researchers studying VoIP ASR who have been simulating VoIP for years, and perhaps you could partner with one of them to do the simulation. (If you want to find such researchers, you could try comp.speech.research, Google, Google Scholar, and the specialized Google search box at http://www.isca-students.org/search_tools that allows you to search abstracts in the ISCA Archive. I'm not sure what are the best search terms. 'VoIP OR "voice over IP" OR Internet' turns up some stuff in the ISCA archive.) I suppose there are also networking and speech coding researchers with VoIP simulators but that's out of my area.
When doing recordings, my research group has found it useful to display a software VU meter while people are adjusting their mics. This way people can see whether the VU meter level changes while they breath or not. Perhaps there is some good, free VU meter software out there that could be mentioned in the FAQ.
Regards,David
PS Given the right hardware,
the Asterisk open source PBX (which seems like the most popular one?) can be used with callers from the public telephone network, as
well as VoIP callers. I just wanted to mention that in case you weren't aware; I'm not criticizing your decision to focus on VoIP. I found an interesting page (I'm sure there's more out there) which
talks about using Sphinx with Asterisk:
http://www.voip-info.org/wiki-Sphinx
--- (Edited on 12/18/2006 3:39 pm [GMT-0600] by Visitor) ---
Hi David,
You're 'keeping me on my toes'! with your comments ... thanks. They really help focus where VoxForge should be going and what it should be doing.
> I am not sure if the majority of users will favor a headset mic for command and control.
I really don't know either. I just assumed (once again ...) that most users would want to use a headset microphone because it allows you to work hands-free, and not have to move your mouth closer to the microphone when issuing commands. But based on your comments (and the Apple web site) the quality of built-in microphones seems to have improved greatly from the 5 year Dell laptop I tested with a while back.
This is what our VoxForge contributors have used for microphones:
Headset microphones:
crxssi - Andrea NC-8 (headset microphone)
csawtell - Anonymous East Asian headset
jaiger - Logitech Precision PC Gaming Headset
mfread - Logitech Digital USB Headset
pmahomey - Labtec gaming fx:1 (headset)
kmaclean – Cyberacoustics headset
Desktop microphones:
corno1979 - Labtec AM-242 (deskboom microphone)
kylegoetz - Logitech Microphone (Pro 4000) – WebCam built-in mic
Studio microphone:
rusty - Samson C15 Studio Condenser (studio microphone)
I think a survey is in order – I'll set one up on the front page to ask users what type of mic they might use for Speech Recognition.
>VoIP codecs are lossy, so I recommend using a different acoustic model for VoIP than for dictation.
That was the intent of creating the 8kHz:16-bit Acoustic Model – Asterisk seems to use this sampling/bit rate internally when you redirect a call's audio output to an outside application. It seemed work when I managed to get Asterisk talking to Julius a while back, but I did not have access to a good enough acoustic model to test it well (this is really where the idea for VoxForge originated ...).
> If you can get a copy of the codec in question, you can take non-VoIP recordings and run them >through the codec, thus simulating VoIP. If you have a model of network effects (packet loss and timing effects) you can simulate that as well.
That's an excellent approach, thanks! I will follow-up on it.
>When doing recordings, my research group has found it useful to display a software VU meter while people are adjusting their mics.
VoxForge recommends users use Audacity for their recordings. Audacity has a view meter and can be used on many platforms.
>PS Given the right hardware, the Asterisk open source PBX (which seems like the most >popular one?) can be used with callers from the public telephone network, as well as VoIP >callers.
I was trying to keep things simple
initially – ASR over PSTN or cell networks just seemed to add too
great a level of complexity for the project at this point. Focusing
on VoIP, at least initially, would make things easier for any user to develop and
test.
All the best,
Ken
--- (Edited on 12/18/2006 9:53 pm [GMT-0500] by kmaclean) ---
Hello again Ken,
Thanks for your reply.
> Audacity has a view meter and can be used on many platforms.
That's fantastic! I suggest mentioning explicitly in the recording instructions that people can check the meter to make sure a headset mic is not picking up their breathing. That's what we did in my research group at the start of recording sessions for one project -- we directed everyone in the session to take a look at the VU meter.
Cheers,
David
--- (Edited on 12/19/2006 8:00 pm [GMT-0600] by Visitor) ---
> If you can get a copy of the codec in question, you can take non-VoIP recordings and run them through the codec, thus simulating VoIP. If you have a model of network effects (packet loss and timing effects) you can simulate that as well.
It would also be possible to push non-VoIP recordings through an actual VoIP connection and re-record them at the far end.
With somewhat more effort it would be possible to include the effects of other kinds (non-VoIP) of telephone connections (connect audio output from a computer to the microphone input of a phone, and so on).
--- (Edited on 10/18/2007 7:35 pm [GMT-0500] by Visitor) ---