作者: Masaru Kitsuregawa , Yitong Wang
DOI:
关键词:
摘要: Web page clustering is a fundamental technique to offer solution for data management, information locating and its interpretation of facilitate users navigation, discrimination understanding. Most existing algorithms cannot adapt well directly in terms efficiency effectiveness. Combining contents analysis hyperlink structure has been proven better approach. However, how effectively combine the two features with different nature get satisfactory results remains an open problem there still little work on it. In this paper, we present experimental study enhancing coupling links pages robust clustering. particular, introduce techniques: in-link reinforcement anchor window improve adaptability contents-link coupled Our detailed evaluation indicates those techniques can quality wide range topics. 1. Introduction are more than 2 billion web without counting so-called hidden that be generated from underneath databases. At same time 100 million become obsolete every month. Locating truly needed interpreting them appropriately big challenge faced by researchers fields database, Information Retrieval (IR) mining. So, correctly both source search engines very important help end discrimination, summarization Web. well-cited topic directories such as Yahoo! (www.yahoo.com) directory (www.dmoz.com) mainly created maintained manually domain experts. Therefore cover only small portion whole due extremely low scalability manual creating maintenance. They also often outdated changes all time. Some topics have no corresponding sub-categories Yahoo or directory. Such unsatisfactory performance calls needs semi-automatic automatic expected scale able follow evolution well. Document studied field tradition IR. The most commonly used developed under vector-space model. Under