Enhancing Contents-Link Coupled Web Page Clustering and Its Evaluation

作者: Masaru Kitsuregawa , Yitong Wang

DOI:

关键词:

摘要: Web page clustering is a fundamental technique to offer solution for data management, information locating and its interpretation of facilitate users navigation, discrimination understanding. Most existing algorithms cannot adapt well directly in terms efficiency effectiveness. Combining contents analysis hyperlink structure has been proven better approach. However, how effectively combine the two features with different nature get satisfactory results remains an open problem there still little work on it. In this paper, we present experimental study enhancing coupling links pages robust clustering. particular, introduce techniques: in-link reinforcement anchor window improve adaptability contents-link coupled Our detailed evaluation indicates those techniques can quality wide range topics. 1. Introduction are more than 2 billion web without counting so-called hidden that be generated from underneath databases. At same time 100 million become obsolete every month. Locating truly needed interpreting them appropriately big challenge faced by researchers fields database, Information Retrieval (IR) mining. So, correctly both source search engines very important help end discrimination, summarization Web. well-cited topic directories such as Yahoo! (www.yahoo.com) directory (www.dmoz.com) mainly created maintained manually domain experts. Therefore cover only small portion whole due extremely low scalability manual creating maintenance. They also often outdated changes all time. Some topics have no corresponding sub-categories Yahoo or directory. Such unsatisfactory performance calls needs semi-automatic automatic expected scale able follow evolution well. Document studied field tradition IR. The most commonly used developed under vector-space model. Under

参考文章(17)
Piotr Indyk, Taher H. Haveliwala, Aristides Gionis, Scalable Techniques for Clustering the Web. WebDB (Informal Proceedings). pp. 129- 134 ,(2000)
M. M. Kessler, Bibliographic coupling between scientific papers American Documentation. ,vol. 14, pp. 10- 25 ,(1963) , 10.1002/ASI.5090140103
Jon M. Kleinberg, Authoritative sources in a hyperlinked environment symposium on discrete algorithms. pp. 668- 677 ,(1998) , 10.5555/314613.315045
Jeffrey Dean, Monika R Henzinger, Finding related pages in the World Wide Web the web conference. ,vol. 31, pp. 1467- 1479 ,(1999) , 10.1016/S1389-1286(99)00022-5
Douglass R. Cutting, David R. Karger, Jan O. Pedersen, John W. Tukey, Scatter/Gather: a cluster-based approach to browsing large document collections international acm sigir conference on research and development in information retrieval. ,vol. 51, pp. 318- 329 ,(1992) , 10.1145/3130348.3130362
Henry Small, Co-citation in the scientific literature: A new measure of the relationship between two documents Journal of the Association for Information Science and Technology. ,vol. 24, pp. 265- 269 ,(1973) , 10.1002/ASI.4630240406
Daniel Boley, Maria Gini, Robert Gross, Eui-Hong Sam Han, Kyle Hastings, George Karypis, Vipin Kumar, Bamshad Mobasher, Jerome Moore, None, Partitioning-based clustering for Web document categorization decision support systems. ,vol. 27, pp. 329- 341 ,(1999) , 10.1016/S0167-9236(99)00055-X
David Gibson, Jon Kleinberg, Prabhakar Raghavan, Inferring Web communities from link topology acm conference on hypertext. pp. 225- 234 ,(1998) , 10.1145/276627.276652
Sergey Brin, Lawrence Page, The anatomy of a large-scale hypertextual Web search engine the web conference. ,vol. 30, pp. 107- 117 ,(1998) , 10.1016/S0169-7552(98)00110-X
James Pitkow, Peter Pirolli, Life, death, and lawfulness on the electronic frontier human factors in computing systems. pp. 383- 390 ,(1997) , 10.1145/258549.258805