HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval.

作者： Haoqi Fan , Song Liu , Yiru Chen , Wenkui Ding , Zhongyuan Wang

DOI:

关键词: Matching (statistics) 、 Discriminative model 、 Contrast (statistics) 、 Feature (machine learning) 、 Momentum (technical analysis) 、 Benchmark (computing) 、 The Internet 、 Computer science 、 Transformer (machine learning model) 、 Machine learning 、 Artificial intelligence

摘要: … Matching to achieve multi-view and comprehensive video-text re… Bootstrap your own latent - A new approach to self-… contextual attention network for fake news detection. In the 44th …

参考文章(65)

Ryan Kiros, Ruslan Salakhutdinov, Richard S Zemel, None, Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models arXiv: Learning. ,(2014)

Yoon Kim, Convolutional Neural Networks for Sentence Classification empirical methods in natural language processing. pp. 1746- 1751 ,(2014) , 10.3115/V1/D14-1181

Benjamin Klein, Guy Lev, Gil Sadeh, Lior Wolf, Associating neural word embeddings with deep image representations using Fisher Vectors computer vision and pattern recognition. pp. 4437- 4446 ,(2015) , 10.1109/CVPR.2015.7299073

Fangxiang Feng, Xiaojie Wang, Ruifan Li, Cross-modal Retrieval with Correspondence Autoencoder acm multimedia. pp. 7- 16 ,(2014) , 10.1145/2647868.2654902

Anna Rohrbach, Marcus Rohrbach, Niket Tandon, Bernt Schiele, A dataset for Movie Description computer vision and pattern recognition. pp. 3202- 3212 ,(2015) , 10.1109/CVPR.2015.7298940

Tomas Mikolov, Andrea Frome, Marc'Aurelio Ranzato, Greg S Corrado, Samy Bengio, Jeff Dean, Jon Shlens, DeViSE: A Deep Visual-Semantic Embedding Model neural information processing systems. ,vol. 26, pp. 2121- 2129 ,(2013)

Jun Xu, Tao Mei, Ting Yao, Yong Rui, MSR-VTT: A Large Video Description Dataset for Bridging Video and Language computer vision and pattern recognition. pp. 5288- 5296 ,(2016) , 10.1109/CVPR.2016.571

Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, Kevin Wilson, CNN architectures for large-scale audio classification international conference on acoustics, speech, and signal processing. pp. 131- 135 ,(2017) , 10.1109/ICASSP.2017.7952132

Youngjae Yu, Hyungjin Ko, Jongwook Choi, Gunhee Kim, End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering computer vision and pattern recognition. pp. 3261- 3269 ,(2017) , 10.1109/CVPR.2017.347

10.

Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, Heng Tao Shen, Adversarial Cross-Modal Retrieval acm multimedia. pp. 154- 162 ,(2017) , 10.1145/3123266.3123326

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval.

来源期刊

我的账户

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval.

来源期刊

相似文章 1

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval.

我的账户