The Challenge of Multispeaker Lip-Reading

作者: Yuxuan Lan , Richard W. Harvey , Barry-John Theobald , Jacob L. Newman , Stephen J. Cox

DOI:

关键词:

摘要: In speech recognition, the problem of speaker variability has been well studied. Common approaches to dealing with it include normalising for a speaker's vocal tract length and learning linear transform that moves speaker-independent models closer new speaker. pure lip-reading (no audio) less Results are often presented based on speaker-dependent (single speaker) or multispeaker (speakers in test-set also training-set) data, situations limited use real applications. This paper shows danger not using different speakers trainingand test-sets. Firstly, we present classification results single-word database AVletters 2 which is high-definition version known database. By careful choice features, show possible performance visual-only be very close audio-only recognition single multi-speaker configurations. However, independent configuration, channel degrades dramatically. applying multidimensional scaling (MDS) both audio features visual demonstrate when compared MFCCs commonly used have inherently small variation within across all classes spoken. highly sensitive identity speaker, whereas relatively invariant.

参考文章(27)
Patrick J. Lucey, Gerasimos Potamianos, Sridha Sridharan, A Unified Approach to Multi-Pose Audio-Visual ASR Faculty of Built Environment and Engineering; Information Security Institute. ,(2007)
T. Kohonen, The self-organizing map Proceedings of the IEEE. ,vol. 78, pp. 1464- 1480 ,(1990) , 10.1109/5.58325
Kathleen E. Finn, Allen A. Montgomery, Automatic optically-based recognition of speech Pattern Recognition Letters. ,vol. 8, pp. 159- 164 ,(1988) , 10.1016/0167-8655(88)90094-3
Brian E. Walden, Robert A. Prosek, Allen A. Montgomery, Charlene K. Scherr, Carla J. Jones, Effects of Training on the Visual Recognition of Consonants Journal of Speech and Hearing Research. ,vol. 20, pp. 130- 145 ,(1977) , 10.1044/JSHR.2001.130
J. Richard Franks, Joan Kimble, The Confusion of English Consonant Clusters in Lipreading Journal of Speech and Hearing Research. ,vol. 15, pp. 474- 482 ,(1972) , 10.1044/JSHR.1503.474
Barry J. Theobald, Richard Harvey, Stephen J. Cox, Colin Lewis, Gari P. Owen, Lip-reading enhancement for law enforcement Optics and Photonics for Counterterrorism and Crime Fighting II. ,vol. 6402, pp. 640205- ,(2006) , 10.1117/12.689960
Mary F. Woodward, Carroll G. Barber, Phoneme Perception in Lipreading Journal of Speech and Hearing Research. ,vol. 3, pp. 212- 222 ,(1960) , 10.1044/JSHR.0303.212
Juergen Luettin, Neil A. Thacker, Speechreading using Probabilistic Models Computer Vision and Image Understanding. ,vol. 65, pp. 163- 178 ,(1997) , 10.1006/CVIU.1996.0570
Elmer Owens, Barbara Blazek, Visemes observed by hearing-impaired and normal-hearing adult viewers. Journal of Speech Language and Hearing Research. ,vol. 28, pp. 381- 393 ,(1985) , 10.1044/JSHR.2803.381
Cletus G. Fisher, Confusions Among Visually Perceived Consonants Journal of Speech and Hearing Research. ,vol. 11, pp. 796- 804 ,(1968) , 10.1044/JSHR.1104.796