Thumbnail Summarization Techniques for Web Archives

作者: Ahmed AlSum , Michael L. Nelson

DOI: 10.1007/978-3-319-06028-6_25

关键词: Computer scienceInformation retrievalAutomatic summarizationThumbnailWeb pageLevenshtein distanceStyle sheetDigital libraryWorld Wide Web

摘要: Thumbnails of archived web pages as they appear in common browsers such Firefox or Chrome can be useful to convey the nature a page and how it has changed over time. However, creating thumbnails for all is not feasible large collections, both terms time create space store them. Furthermore, at least purposes initial exploration collection understanding, people will likely only need few dozen thousands. In this paper, we develop different algorithms optimize thumbnail creation procedure archives based on information retrieval techniques. We study features HTML text that correlate with changes rendered so know advance which use thumbnails. find SimHash correlates i¾?=0.59, p<0.005. propose suitable applications, reducing number generated 9% --- 27% total size.

参考文章(36)
Ben Shneiderman, m.c. schraefel, Max L. Wilson, William Kules, From Keyword Search to Exploration: How Result Visualization Aids Discovery on the Web s.n.. ,(2008)
Filipe Coelho, Cristina Ribeiro, Image abstraction in crossmedia retrieval for text illustration european conference on information retrieval. pp. 329- 339 ,(2012) , 10.1007/978-3-642-28997-2_28
Shaun Kaasten, Saul Greenberg, Christopher Edwards, How People Recognise Previously Seen Web Pages from Titles, URLs and Thumbnails People and Computers XVI - Memorable Yet Invisible. pp. 247- 265 ,(2002) , 10.1007/978-1-4471-0105-5_15
Bonomo Alexander, Paepcke Andreas, Siddhi Soman, Arti Arti Chharjta, ArcSpread for Analyzing Web Archives Stanford InfoLab. ,(2012)
William C. Janssen, Document Icons and Page Thumbnails: Issues in Construction of Document Thumbnails for Page-Image Digital Libraries international conference theory and practice digital libraries. pp. 111- 121 ,(2004) , 10.1007/978-3-540-30230-8_11
Marianne Lykke, Birger Larsen, Haakon Lund, Peter Ingwersen, Developing a Test Collection for the Evaluation of Integrated Search Lecture Notes in Computer Science. pp. 627- 630 ,(2010) , 10.1007/978-3-642-12275-0_63
Adam Jatowt, Yukiko Kawai, Satoshi Nakamura, Yutaka Kidawara, Katsumi Tanaka, Journey to the past Proceedings of the seventeenth conference on Hypertext and hypermedia - HYPERTEXT '06. pp. 135- 144 ,(2006) , 10.1145/1149941.1149969
Junghoo Cho, Hector Garcia-Molina, Taher Haveliwala, Wang Lam, Andreas Paepcke, Sriram Raghavan, Gary Wesley, Stanford WebBase components and applications ACM Transactions on Internet Technology. ,vol. 6, pp. 153- 186 ,(2006) , 10.1145/1149121.1149124
Nikolaus Augsten, Mateusz Pawlik, RTED Proceedings of the VLDB Endowment. ,vol. 5, pp. 334- 345 ,(2011) , 10.14778/2095686.2095692
Helen Hockx-Yu, The past issue of the web Proceedings of the 3rd International Web Science Conference on - WebSci '11. pp. 12- ,(2011) , 10.1145/2527031.2527050