The selective use of gaze in automatic speech recognition

作者: Ao Shen

DOI:

关键词: Language modelEye movementFeature extractionSpeech processingGazePsychologySpeech recognitionNatural language processingModality (human–computer interaction)Speech enhancementArtificial intelligenceEye tracking

摘要: The performance of automatic speech recognition (ASR) degrades significantly in natural environments compared to laboratory assessments. Being a major source interference, acoustic noise affects intelligibility during the ASR process. There are two main problems caused by noise. first is signal contamination. second speakers' vocal and non-vocal behavioural changes. These phenomena elicit mismatch between training conditions, which leads considerable degradation. To improve noise-robustness, exploiting prior knowledge enhancement, feature extraction models popular approaches. An alternative approach presented this thesis introduce eye gaze as an extra modality. Eye behaviours have roles interaction contain information about cognition visual attention; not all relevant speech. Therefore, used selectively performance. This achieved inference procedures using noise-dependant their temporal semantic relationship with `Selective gaze-contingent ASR' systems proposed evaluated on corpus movement related different clean, noisy environments. best performing utilise both language model adaptation.

参考文章(294)
Philip C. Woodland, Jason J. Humphries, Using accent-specific pronunciation modelling for improved large vocabulary continuous speech recognition conference of the international speech communication association. ,(1997)
Shuji Doshita, Toshiyuki Sakai, The Phonetic Typewriter. ifip congress. pp. 445- 450 ,(1962)
Saeed Vaseghi, Ben P. Milner, Noise-adaptive hidden Markov models based on wiener filters. conference of the international speech communication association. ,(1993)
Eric David Petajan, Automatic lipreading to enhance speech recognition (speech reading) University of Illinois at Urbana-Champaign. ,(1984)
Chalapathy Neti, Gerasimos Potamianos, Andrew W. Senior, Giridharan Iyengar, Benoît Maison, Perceptual interfaces for information interaction: joint processing of audio and visual information for human-computer interaction. conference of the international speech communication association. pp. 11- 14 ,(2000)
David Martin Ward Powers, None, Evaluation: from Precision, Recall and F-measure to ROC, Informedness, Markedness and Correlation arXiv: Learning. ,vol. 2, pp. 37- 63 ,(2011)
James Rossiter, Multimodal intent recognition for natural human-robotic interaction University of Birmingham. ,(2011)
Eric Vatikiotis-Bateson, Kevin G. Munhall, Takaaki Kuratate, Michel Pitermann, J. Lucero, Christian Kroos, Hani Yehia, Studies of audiovisual speech perception using production-based animation. conference of the international speech communication association. pp. 7- 10 ,(2000)