HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval.

作者: Haoqi Fan , Song Liu , Yiru Chen , Wenkui Ding , Zhongyuan Wang

DOI:

关键词: Matching (statistics)Discriminative modelContrast (statistics)Feature (machine learning)Momentum (technical analysis)Benchmark (computing)The InternetComputer scienceTransformer (machine learning model)Machine learningArtificial intelligence

摘要: … Matching to achieve multi-view and comprehensive video-text re… Bootstrap your own latent - A new approach to self-… contextual attention network for fake news detection. In the 44th …

参考文章(65)
Ryan Kiros, Ruslan Salakhutdinov, Richard S Zemel, None, Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models arXiv: Learning. ,(2014)
Yoon Kim, Convolutional Neural Networks for Sentence Classification empirical methods in natural language processing. pp. 1746- 1751 ,(2014) , 10.3115/V1/D14-1181
Benjamin Klein, Guy Lev, Gil Sadeh, Lior Wolf, Associating neural word embeddings with deep image representations using Fisher Vectors computer vision and pattern recognition. pp. 4437- 4446 ,(2015) , 10.1109/CVPR.2015.7299073
Fangxiang Feng, Xiaojie Wang, Ruifan Li, Cross-modal Retrieval with Correspondence Autoencoder acm multimedia. pp. 7- 16 ,(2014) , 10.1145/2647868.2654902
Anna Rohrbach, Marcus Rohrbach, Niket Tandon, Bernt Schiele, A dataset for Movie Description computer vision and pattern recognition. pp. 3202- 3212 ,(2015) , 10.1109/CVPR.2015.7298940
Tomas Mikolov, Andrea Frome, Marc'Aurelio Ranzato, Greg S Corrado, Samy Bengio, Jeff Dean, Jon Shlens, DeViSE: A Deep Visual-Semantic Embedding Model neural information processing systems. ,vol. 26, pp. 2121- 2129 ,(2013)
Jun Xu, Tao Mei, Ting Yao, Yong Rui, MSR-VTT: A Large Video Description Dataset for Bridging Video and Language computer vision and pattern recognition. pp. 5288- 5296 ,(2016) , 10.1109/CVPR.2016.571
Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, Kevin Wilson, CNN architectures for large-scale audio classification international conference on acoustics, speech, and signal processing. pp. 131- 135 ,(2017) , 10.1109/ICASSP.2017.7952132
Youngjae Yu, Hyungjin Ko, Jongwook Choi, Gunhee Kim, End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering computer vision and pattern recognition. pp. 3261- 3269 ,(2017) , 10.1109/CVPR.2017.347
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, Heng Tao Shen, Adversarial Cross-Modal Retrieval acm multimedia. pp. 154- 162 ,(2017) , 10.1145/3123266.3123326