VoxForge
Hi,
I am wondering if it is possible to use the HMMs of the trained Acoustic Model to synthesize speech. It should be possible to generate the most likely output sequence of MFCC frames given any input sequence of phonemes.
Would this synthesized speech resemble the voice of the speaker who trained the AM (assuming that the AM was trained by a single speaker)?
Maybe this is the standard synthesizer method in combined speech recognition & synthesis tools? Can someone point me to examples, possibly with technical description of the synthesizer? If this approach is not recommended, why (bad speech quality, waste of computing time,...)?
Thanks
John
--- (Edited on 7/2/2007 5:41 am [GMT-0500] by Visitor) ---
Well, I found the answer in this forum:
http://www.voxforge.org/home/forums/message-boards/acoustic-model-discussions/using-the-models-as-source-for-speech-synthesis2
Did someone try HTS and can suggest whether it is worth examining?
John
--- (Edited on 7/2/2007 5:51 am [GMT-0500] by Visitor) ---
According to Blizzard Challenge 2006 HTS voices are the best currently
http://festvox.org/blizzard/blizzard2006.html
The only problem with them is that currently voice sounds a little buzzy. It's because you model mel-cepstral coefficients with HMM well but you can't model residual. Usually residual is modelled by simple white noise. The situation can be improved by using different feature set but mostly it's not investigated area.
About your idea of unification of recognition and synthesis it's not the case. Recognition and synthesis are different tasks and they have to be solved differently. For example set of prompts in database certainly must differ, dictionary should be different not talking about tree questions and so on. So I suppose it's not possible to unify recognition and synthesis on the basis of HMM. And it's not required actually.
--- (Edited on 7/2/2007 8:42 am [GMT-0500] by nsh) ---
Thank you, nsh.
I have very little experience with speech processing, but I am learning the basics behind HTK.
I understand that the excitation function is one problem that only plays a role in synthesis, not in recognition. But why is this special for HMM-based synthesizers? If one knows a good mapping of phoneme contexts onto excitation functions in any synthesizer, this mapping can also be used in HTS, correct?
I also understand that it might be of advantage to chose different training data for the two tasks (although I don't see how to modify the data to optimize either the rec. or synth. task, but that is not important).
Assuming I have set up a working speech recognition based on triphone HMMs in very limited hardware (small RAM and ROM). Would it be feasable to use exactly these HMMs for synthesis? Or is there some reason why I should forget about this idea without trying?
Please say a few more words on what you meant by
"So I suppose it's not possible to unify recognition and synthesis on the basis of HMM. And it's not required actually."
John
--- (Edited on 7/2/2007 4:59 pm [GMT-0500] by Visitor) ---
>I understand that the excitation function is one problem that only plays a role in synthesis, not in recognition. But why is this special for HMM-based synthesizers? If one knows a good mapping of phoneme contexts onto excitation functions in any synthesizer, this mapping can also be used in HTS, correct?
Yes exactly. Its a generic problem with compression of this speech in a way it will allow good HMM modelling. It's easy to model 12 mel-cep coefficients but it's not easy to model real speech. Currently HTS people are looking on different parametrization like STRAIGHT and they indeed show good results but unfortunately they are patented.
> Would it be feasable to use exactly these HMMs for synthesis?
You can try but results will not be that good you can expect. Btw, HTS data is very small (around 1 Mb) and you already have dictionary and text processing functions. So 1 Mb for synthesis is not a big cost.
That's all my points, nothing more You can try to ask on HTS mailing list too btw, they will be glad to help.
--- (Edited on 7/3/2007 1:10 am [GMT-0500] by nsh) ---