作者: Kehan Wang , Seth Z Zhao , David Chan , Avideh Zakhor , John Canny
DOI:
关键词:
摘要: Short videos have become the most popular form of social media in recent years. In this work, we focus on the threat scenario where video, audio, and their text description are semantically mismatched to mislead the audience. We develop self-supervised methods to detect semantic mismatch across multiple modalities, namely video, audio and text. We use state-of-the-art language, video and audio models to extract dense features from each modality, and explore transformer architecture together with contrastive learning methods on a dataset of one million Twitter posts from 2021 to 2022. Our best-performing method benefits from the robustness of Noise-Contrastive loss and the context provided by fusing modalities together using a cross-transformer. It outperforms state-of-the-art by over 9% in accuracy. We further characterize the performance of our system on topic-specific datasets containing COVID-19 and …