作者: Alexander Hauptmann , Rong Yan , Wei-Hao Lin , Michael Christel , Howard Wactlar
DOI: 10.1007/978-1-84800-076-6_10
关键词:
摘要: Digital images and motion video have proliferated in the past few years, ranging from ever-growing personal photo collections to professional news documentary archives. In searching through these archives, digital imagery indexing based on low-level image features like colour texture, or manually entered text annotations, often fails meet user’s information need, i.e. there is a semantic gap produced by “the lack of coincidence between that one can extract visual data interpretation same for user given situation” (Smeulders, Worring, Santini, Gupta Jain 2000). The image/video analysis community has long struggled bridge this feature (colour histograms, shape) content description video. Early retrieval systems (Lew 2002; Smith, Lin, Naphade, Natsev Tseng 2002) usually modelled clips with set (low-level) detectable generated different modalities. It possible accurately automatically features, such as histograms HSV, RGB, YUV space, Gabor texture wavelets, structure edge direction maps. However, because meaning cannot be expressed way, had very restricted success approach queries. Several studies confirmed difficulty addressing needs (Markkula Sormunen 2000; Rodden, Basalaj, Sinclair Wood 2001). To overcome “semantic gap”, utilise intermediate textual descriptors reliably applied concepts (e.g. outdoors, faces, animals). Many researchers been developing automatic concept classifiers those related people (face, anchor, etc.), acoustic (speech, music, significant pause), objects (image blobs, buildings, graphics),