作者: Ahmed AlSum , Michael L. Nelson
DOI: 10.1007/978-3-319-06028-6_25
关键词: Computer science 、 Information retrieval 、 Automatic summarization 、 Thumbnail 、 Web page 、 Levenshtein distance 、 Style sheet 、 Digital library 、 World Wide Web
摘要: Thumbnails of archived web pages as they appear in common browsers such Firefox or Chrome can be useful to convey the nature a page and how it has changed over time. However, creating thumbnails for all is not feasible large collections, both terms time create space store them. Furthermore, at least purposes initial exploration collection understanding, people will likely only need few dozen thousands. In this paper, we develop different algorithms optimize thumbnail creation procedure archives based on information retrieval techniques. We study features HTML text that correlate with changes rendered so know advance which use thumbnails. find SimHash correlates i¾?=0.59, p<0.005. propose suitable applications, reducing number generated 9% --- 27% total size.