作者: Sumedh Pendurkar , Sameer Kolpekwar , Shreyas Dhoot , Yashodhara V. Haribhakta , Biplab Banerjee
DOI: 10.1016/J.PROCS.2020.04.047
关键词:
摘要: Abstract Open-ended Video Question Answering systems is a very challenging problem with widespread applications in real life. Existing tend to focus on single word video question answering system, which cannot be easily extended develop. In this paper, we propose using an architecture, popularly used for captioning solve the of open-ended based systems. For generating good answers, model required each frame separately as well understand how link information from different frames generate answer. The also needs keep mind modalities and adapt itself accordingly while processing videos questions. We attention multimodal fusion architecture (AMF-VQA) that uses mechanism at every time output word. Such kind allows outputting proposed flexible were can just add other such audio features, captions, etc. existing fine-tune get improve results if these new features are available.