作者: Arun Mallya , Svetlana Lazebnik
DOI: 10.1007/978-3-319-46448-0_25
关键词: Computer science 、 Learning models 、 Context (language use) 、 Multiple choice 、 Task (project management) 、 Network model 、 Question answering 、 Transfer (computing) 、 Machine learning 、 Object (computer science) 、 Artificial intelligence
摘要: This paper proposes deep convolutional network models that utilize local and global context to make human activity label predictions in still images, achieving state-of-the-art performance on two recent datasets with hundreds of labels each. We use multiple instance learning handle the lack supervision level individual person instances, weighted loss unbalanced training data. Further, we show how specialized features trained these can be used improve accuracy Visual Question Answering (VQA) task, form choice fill-in-the-blank questions (Visual Madlibs). Specifically, tackle types person-object relationship improvements over generic ImageNet classification task