作者: Chalapathy Neti , Gerasimos Potamianos , Giridharan Iyengar , Eric Helmuth
DOI:
关键词:
摘要: We compare automatic recognition with human perception of audio-visual speech, in the large-vocabulary, continuous speech (LVCSR) domain. Specifically, we study benefit visual modality for both machines and humans, when combined audio degraded by speech-babble noise at various signal-to-noise ratios (SNRs). first consider an speechreading system a pixel based front end that uses feature fusion bimodal integration, its performance audio-only LVCSR system. then describe results experiments, where subjects are asked to transcribe audiovisual utterances SNRs. For observe approximately 6 dB effective SNR gain compared 10 dB, however such gains significantly diverge other Furthermore, outperforms audioonly low