Web page clustering enhanced by summarization

作者: Xuanhui Wang , Dou Shen , Hua-Jun Zeng , Zheng Chen , Wei-Ying Ma

DOI: 10.1145/1031171.1031223

关键词: Factor (programming language)Feature vectorCluster analysisComputer scienceWeb pageData miningInformation retrievalRepresentation (mathematics)Automatic summarizationLatent semantic analysisHITS algorithm

摘要: Traditional Web page clustering algorithms use the full-text in documents to generate feature vectors. Such methods often produce unsatisfactory results because there is much noisy information, such as decoration, interaction, and advertisement, pages. The varying-length problem of pages also a significant negative factor affecting performance. In this paper, we investigate several summarization techniques tackle these issues when Compared with representation pages, our experimental indicate that proposed approach effectively solves problems information varying-length, thus significantly boosts

参考文章(7)
Víctor Pàmies, Open Directory Project Softcatalà (http://www.softcatala.org/). ,(2003)
George Karypis, Michael Steinbach, Vipin Kumar, A Comparison of Document Clustering Techniques ,(2000)
Yihong Gong, Xin Liu, Generic text summarization using relevance measure and latent semantic analysis international acm sigir conference on research and development in information retrieval. pp. 19- 25 ,(2001) , 10.1145/383952.383955
H. P. Luhn, The automatic creation of literature abstracts Ibm Journal of Research and Development. ,vol. 2, pp. 159- 165 ,(1958) , 10.1147/RD.22.0159
James P. Callan, Passage-level evidence in document retrieval international acm sigir conference on research and development in information retrieval. pp. 302- 310 ,(1994) , 10.5555/188490.188589
Gerard Salton, J. Allan, Chris Buckley, Approaches to passage retrieval in full text information systems international acm sigir conference on research and development in information retrieval. pp. 49- 58 ,(1993) , 10.1145/160688.160693
Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman, Indexing by Latent Semantic Analysis Journal of the Association for Information Science and Technology. ,vol. 41, pp. 391- 407 ,(1990) , 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9