Video Frame Prediction by Deep Multi-Branch Mask Network

作者: Sen Li , Jianwu Fang , Hongke Xu , Jianru Xue

DOI: 10.1109/TCSVT.2020.2984783

关键词:

摘要: Future frame prediction in video is one of the most important problem computer vision, and useful for a range practical applications, such as intention or anomaly detection. However, this task challenging because complex dynamic evolution scene. The difficulty to model inherent spatio-temporal correlation between frames pose an adaptive flexible framework large motion change appearance variation. In paper, we construct deep multi-branch mask network (DMMNet) which adaptively fuses advantages optical flow warping RGB pixel synthesizing methods, i.e., common two kinds approaches task. procedure DMMNet, add layer each branch adjust magnitude estimated weight predicted by synthesizing, respectively. other words, provide more masking fusion on prediction. Exhaustive experiments Caltech pedestrian UCF101 datasets show that proposed can obtain favorable performance compared with state-of-the-art methods. addition, also put our into detection problem, superiority verified UCSD dataset.

参考文章(56)
Amir Roshan Zamir, Khurram Soomro, Mubarak Shah, UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild arXiv: Computer Vision and Pattern Recognition. ,(2012)
Koray Kavukcuoglu, Max Jaderberg, Karen Simonyan, Andrew Zisserman, Spatial transformer networks neural information processing systems. ,vol. 28, pp. 2017- 2025 ,(2015)
Dit-Yan Yeung, Hao Wang, Xingjian Shi, Zhourong Chen, Wang-chun Woo, Wai-kin Wong, Convolutional LSTM Network: a machine learning approach for precipitation nowcasting neural information processing systems. ,vol. 28, pp. 802- 810 ,(2015)
Yoshua Bengio, Xavier Glorot, Understanding the difficulty of training deep feedforward neural networks international conference on artificial intelligence and statistics. pp. 249- 256 ,(2010)
Ronan Collobert, Arthur Szlam, Marc'Aurelio Ranzato, Joan Bruna, Michaël Mathieu, Sumit Chopra, Video (language) modeling: a baseline for generative models of natural videos. arXiv: Learning. ,(2014)
Geoffrey E. Hinton, Vinod Nair, Rectified Linear Units Improve Restricted Boltzmann Machines international conference on machine learning. pp. 807- 814 ,(2010)
Christian Szegedy, Sergey Ioffe, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift international conference on machine learning. ,vol. 1, pp. 448- 456 ,(2015)
Alexey Dosovitskiy, Jost Tobias Springenberg, Thomas Brox, None, Learning to generate chairs with convolutional neural networks computer vision and pattern recognition. pp. 1538- 1546 ,(2015) , 10.1109/CVPR.2015.7298761
Philipp Fischer, Thomas Brox, None, U-Net: Convolutional Networks for Biomedical Image Segmentation medical image computing and computer assisted intervention. pp. 234- 241 ,(2015) , 10.1007/978-3-319-24574-4_28
Jonathan Long, Evan Shelhamer, Trevor Darrell, Fully convolutional networks for semantic segmentation computer vision and pattern recognition. pp. 3431- 3440 ,(2015) , 10.1109/CVPR.2015.7298965