作者: Yiannis Aloimonos , Hal Daume Iii , Yezhou Yang , Ching Teo
DOI:
关键词:
摘要: We propose a sentence generation strategy that describes images by predicting the most likely nouns, verbs, scenes and prepositions make up core structure. The input are initial noisy estimates of objects detected in image using state art trained detectors. As actions from still directly is unreliable, we use language model English Gigaword corpus to obtain their estimates; together with probabilities co-located prepositions. these as parameters on HMM models process, hidden nodes components detections emissions. Experimental results show our combining vision produces readable descriptive sentences compared naive strategies alone.