Graph-Theoretic Techniques for Web Content Mining

作者: Adam Schenker , Abraham Kandel , Horst Bunke

DOI:

关键词:

摘要: In this dissertation we introduce several novel techniques for performing data mining on web documents which utilize graph representations of document content. Graphs are more robust than typical vector as they can model structural information that is usually lost when converting the original content to a representation. For example, capture such location, order and proximity term occurrence, discarded under standard representation models. Many machine learning methods rely distance computations, centroid calculations, other numerical techniques. Thus many these have not been applied represented by graphs since no suitable graph-theoretical concepts were previously available. We Graph Hierarchy Construction Algorithm (GHCA), performs topic-oriented hierarchical clustering search results modeled using graphs. The system created around new algorithm its prior version compared with similar systems gauge usefulness. An important advantage approach over conventional better organized easily browsed users. Next present extensions classical algorithms, k-means k-Nearest Neighbors classification algorithm, allows use fundamental items instead vectors. We perform experiments comparing performance graph-based traditional vector-based three collections. Our experimental show an improvement approaches both documents. propose allow computation similarity in polynomial time; determination NP-Complete problem. fact, there some cases where execution time graph-oriented was faster approaches.

参考文章(120)
Julie Beth Lovins, Development of a Stemming Algorithm Mech. Transl. Comput. Linguistics. ,vol. 11, pp. 22- 31 ,(1968)
Reinhold Klapsing, Peter Hannappel, Gustaf Neumann, Adrian Krug, MSEEC – A Multi Search Engine with Multiple Clustering ,(2000)
Adam Schenker, Mark Last, Horst Bunke, Abraham Kandel, Clustering of Web Documents using a Graph Model. Web Document Analysis. pp. 3- 18 ,(2003)
Information Retrieval and HyperText : Kluwer Academic Publishers. ,(1996) , 10.1007/978-1-4613-1373-1
F. Masseglia, P. Poncelet, R. Cicchetti, WebTool: An Integrated Framework for Data Mining database and expert systems applications. pp. 892- 901 ,(1999) , 10.1007/3-540-48309-8_84
Ludovic Denoyer, Patrick Gallinari, A belief networks-based generative model for structured documents: an application to the XML categorization machine learning and data mining in pattern recognition. ,vol. 2734, pp. 328- 342 ,(2003) , 10.1007/3-540-45065-3_29
H. Bunke, S. Günter, X. Jiang, Towards Bridging the Gap between Statistical and Structural Pattern Recognition: Two New Concepts in Graph Matching international conference on advances in pattern recognition. pp. 1- 11 ,(2001) , 10.1007/3-540-44732-6_1
Joydeep Ghosh, Raymond Mooney, Alexander Strehl, Impact of Similarity Measures on Web-page Clustering ,(2000)
M. Lazarescu, H. Bunke, S. Venkatesh, Graph Matching: Fast Candidate Elimination Using Machine Learning Techniques Lecture Notes in Computer Science. pp. 236- 245 ,(2000) , 10.1007/3-540-44522-6_25
Ophir Frieder, David A. Grossman, Information Retrieval: Algorithms and Heuristics ,(1998)