LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild

作者: Shiguang Shan , Xilin Chen , Shuang Yang , Keyu Long , Mingmin Yang

DOI:

关键词:

摘要: Large-scale datasets have successively proven their fundamental importance in several research fields, especially for early progress some emerging topics. In this paper, we focus on the problem of visual speech recognition, also known as lipreading, which has received increasing interest recent years. We present a naturally-distributed large-scale benchmark lip reading wild, named LRW-1000, contains 1,000 classes with 718,018 samples from more than 2,000 individual speakers. Each class corresponds to syllables Mandarin word composed one or Chinese characters. To best our knowledge, it is currently largest word-level lipreading dataset and only public dataset. This aims at covering "natural" variability over different modes imaging conditions incorporate challenges encountered practical applications. It shown large variation aspects, including number each class, video resolution, lighting conditions, speakers' attributes such pose, age, gender, make-up. Besides providing detailed description its collection pipeline, evaluate typical popular methods perform thorough analysis results aspects. The demonstrate consistency dataset, may open up new promising directions future work.

参考文章(22)
Bowon Lee, Suketu Kamdar, Thomas S. Huang, Sarah Borys, Mark Hasegawa-Johnson, Ming Liu, Camille Goudeseune, AVICAR: audio-visual speech corpus in a car environment. conference of the international speech communication association. ,(2004)
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri, Learning Spatiotemporal Features with 3D Convolutional Networks 2015 IEEE International Conference on Computer Vision (ICCV). pp. 4489- 4497 ,(2015) , 10.1109/ICCV.2015.510
Iryna Anina, Ziheng Zhou, Guoying Zhao, Matti Pietikainen, OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis ieee international conference on automatic face gesture recognition. ,vol. 1, pp. 1- 5 ,(2015) , 10.1109/FG.2015.7163155
G. Potamianos, H.P. Graf, E. Cosatto, An image transform approach for HMM based automatic lipreading international conference on image processing. pp. 173- 177 ,(1998) , 10.1109/ICIP.1998.999008
I. Matthews, T.F. Cootes, J.A. Bangham, S. Cox, R. Harvey, Extraction of visual features for lipreading IEEE Transactions on Pattern Analysis and Machine Intelligence. ,vol. 24, pp. 198- 213 ,(2002) , 10.1109/34.982900
S. Dupont, J. Luettin, Audio-visual speech modeling for continuous speech recognition IEEE Transactions on Multimedia. ,vol. 2, pp. 141- 151 ,(2000) , 10.1109/6046.865479
Shigeo Morishima, Shin Ogata, Kazumasa Murai, Satoshi Nakamura, Audio-visual speech translation with automatic lip syncqronization and face tracking based on 3-D head model IEEE International Conference on Acoustics Speech and Signal Processing. ,vol. 2, pp. 2117- 2120 ,(2002) , 10.1109/ICASSP.2002.5745053
Guoying Zhao, M. Barnard, M. Pietikainen, Lipreading With Local Spatiotemporal Descriptors IEEE Transactions on Multimedia. ,vol. 11, pp. 1254- 1265 ,(2009) , 10.1109/TMM.2009.2030637
Amr Bakry, Ahmed Elgammal, MKPLS: Manifold Kernel Partial Least Squares for Lipreading and Speaker Identification computer vision and pattern recognition. pp. 684- 691 ,(2013) , 10.1109/CVPR.2013.94
Yun Fu, Shuicheng Yan, Thomas S. Huang, Classification and Feature Extraction by Simplexization IEEE Transactions on Information Forensics and Security. ,vol. 3, pp. 91- 100 ,(2008) , 10.1109/TIFS.2007.916280