Measuring the Structural Similarity of Web-based Documents: A Novel Approach

作者: Frank Emmert Streib , Jürgen Kilian , Alexander Mehler , Matthias Dehmer

DOI:

关键词:

摘要: Most known methods for measuring the structural similarity of document structures are based on, e.g., tag measures, path metrics and tree measures in terms their DOM-Trees. Other framework well vector space model. In contrast to these we present a new approach web-based documents represented by so called generalized trees which more general than DOM-Trees represent only directed rooted trees. We will design measure graphs representing hypertext structures. Our is mainly on novel representation graph as strings linear integers, whose components properties graph. The two then defined optimal alignment underlying property strings. this paper apply technique sequence alignments solve challenging problem: Measuring More precisely, first transform our considered high dimensional objects Then derive values from order Hence, problem string problem. demonstrate that captures important information applying it different test sets consisting documents. Keywords—Graph similarity, hierarchical graphs, hypertext, trees, web structure mining.

参考文章(19)
PH Winne, L Gupta, JC Nesbit, Exploring Individual Differences in Studying Strategies Using Graph Theoretic Statistics. Alberta Journal of Educational Research. ,vol. 40, pp. 177- 193 ,(1994)
Yiming Yang, Seán Slattery, Rayid Ghani, A Study of Approaches to Hypertext Categorization intelligent information systems. ,vol. 18, pp. 219- 241 ,(2002) , 10.1023/A:1013685612819
Thierry Lecroq, Maxime Crochemore, Christophe Hancart, Algorithms on Strings ,(2007)
Isabel F. Cruz, Slava Borisov, Michael A. Marks, Timothy R. Webb, Measuring Structural Similarity Among Web Documents: Preliminary Results international conference on electronic publishing. pp. 513- 524 ,(1998) , 10.1007/BFB0053296
Alexander Mehler, Rüdiger Gleim, Matthias Dehmer, Towards Structure-sensitive Hypertext Categorization GfKl. pp. 406- 413 ,(2006) , 10.1007/3-540-31314-1_49
Matthias Dehmer, Strukturelle Analyse Web-basierter Dokumente Technische Universität. pp. 1- 173 ,(2005)
Kaizhong Zhang, Dennis Shasha, Simple fast algorithms for the editing distance between trees and related problems SIAM Journal on Computing. ,vol. 18, pp. 1245- 1262 ,(1989) , 10.1137/0218082
Stanley M. Selkow, The tree-to-tree editing problem Information Processing Letters. ,vol. 6, pp. 184- 186 ,(1977) , 10.1016/0020-0190(77)90064-3