Cluster Generation and Cluster Labelling for Web Snippets

作者: Marco Maggini , Marco Pellegrini , Filippo Geraci , Fabrizio Sebastiani

DOI:

关键词: Information retrievalCluster analysisBenchmark (computing)Fuzzy clusteringSnippetDocument clusteringComputer scienceRelevance (information retrieval)Clustering high-dimensional dataMetric (mathematics)Data mining

摘要: This paper describes Armil, a meta-search engine that groups into disjoint labelled clusters the Web snippets returned by auxiliary search engines. The cluster labels generated Armil provide user with compact guide to assessing relevance of each her information need. Striking right balance between running time and well-formedness was key point in design our system. Both clustering labelling tasks are performed on fly processing only provided engines, use no external sources knowledge. Clustering is means af ast version furthest-point-first algorithm for metric k- center clustering. Cluster achieved combining intra-cluster inter-cluster term extraction based variant gain measure. We have tested effectiveness against Vivisimo ,t hede facto industrial standard snippet clustering, us- ing as benchmark comprehensive set obtained from Open Directory Project hierarchy. According two widely accepted "ex- ternal" metrics quality, achieves better performance levels 10%. also report results thorough evaluation both algorithms.

参考文章(39)
Dorit S. Hochbaum, David Shmoys, A best possible approximation algorithm for the k--center problem Mathematics of Operations Research. ,(1985)
Israel Z. Ben-Shaul, Yoelle S. Maarek, Dan Pelleg, Ronald Fagin, Ephemeral Document Clustering for Web Applications ,(2001)
Piotr Indyk, Taher H. Haveliwala, Aristides Gionis, Scalable Techniques for Clustering the Web. WebDB (Informal Proceedings). pp. 129- 134 ,(2000)
Emilio Di Giacomo, Walter Didimo, Luca Grilli, Giuseppe Liotta, A Topology-Driven Approach to the Design of Web Meta-search Clustering Engines SOFSEM 2005: Theory and Practice of Computer Science. pp. 106- 116 ,(2005) , 10.1007/978-3-540-30577-4_14
Dell Zhang, Yisheng Dong, Semantic, Hierarchical, Online Clustering of Web Search Results asia-pacific web conference. pp. 69- 78 ,(2004) , 10.1007/978-3-540-24655-8_8
Oren Etzioni, Oren Zamir, Richard M. Karp, Omid Madani, Fast and intuitive clustering of web documents knowledge discovery and data mining. pp. 287- 290 ,(1997)
Joydeep Ghosh, Raymond Mooney, Alexander Strehl, Impact of Similarity Measures on Web-page Clustering ,(2000)
Steven J. Phillips, Acceleration of K-Means and Related Clustering Algorithms algorithm engineering and experimentation. pp. 166- 177 ,(2002) , 10.1007/3-540-45643-0_13
Filippo Geraci, Marco Pellegrini, Marco Maggini, Fabrizio Sebastiani, Cluster generation and cluster labelling for web snippets: a fast and accurate hierarchical solution string processing and information retrieval. ,vol. 4209, pp. 25- 36 ,(2006) , 10.1007/11880561_3