作者: Georgia Gkioxari , Jitendra Malik , Ross Girshick
DOI:
关键词:
摘要: There are multiple cues in an image which reveal what action a person is performing. For example, jogger has pose that characteristic for jogging, but the scene (e.g. road, trail) and presence of other joggers can be additional source information. In this work, we exploit simple observation actions accompanied by contextual to build strong recognition system. We adapt RCNN use more than one region classification while still maintaining ability localize action. call our system R*CNN. The action-specific models feature maps trained jointly, allowing specific representations emerge. R*CNN achieves 90.2% mean AP on PASAL VOC Action dataset, outperforming all approaches field significant margin. Last, show not limited recognition. particular, also used tackle fine-grained tasks such as attribute classification. validate claim reporting state-of-the-art performance Berkeley Attributes People dataset.