作者: Shigeo Morishima , Shin Ogata , Kazumasa Murai , Satoshi Nakamura
DOI: 10.1109/ICASSP.2002.5745053
关键词:
摘要: Speech-to-speech translation has been studied to realize natural human communication beyond language barriers. Toward further multi-modal communication, visual information such as face and lip movements will be necessary. In this paper, we introduce a English-to-Japanese Japanese-to-English system that also translates the speaker's speech motion while synchronizing it translated speech. To retain facial expression, substitute only organ's image with synthesized one, which is made by three-dimensional wire-frame model adaptable any speaker. Our approach enables synthesis an extremely small database. We conduct subjective evaluation connected digit discrimination using data without audiovisual lip-synchronicity. The results confirm sufficient quality of proposed audio-visual system.