A flexible structured-based representation for XML document mining

作者: Anne-Marie Vercoustre , Mounir Fegas , Saba Gul , Yves Lechevallier

DOI: 10.1007/978-3-540-34963-1_34

关键词:

摘要: This paper reports on the INRIA group’s approach to XML mining while participating in INEX Mining track 2005. We use a flexible representation of documents that allows taking into account structure only or both and content. Our consists representing by set their sub-paths, defined according some criteria (length, root beginning, leaf ending). By considering those sub-paths as words, we can standard methods for vocabulary reduction, simple clustering such k-means. an implementation algorithm known dynamic clouds work with distinct groups independent modalities put separate variables. is useful our model since embedded are not independent: split potentially dependant paths variables, resulting each them containing independant paths. Experiments collections show good results structure-only collections, but could scale well large structure-and-content collections.

参考文章(26)
Elio Masciari, Sergio Flesca, Giuseppe Manco, Andrea Pugliese, Luigi Pontieri, Detecting Structural Similarities between XML Documents. international workshop on the web and databases. pp. 55- 60 ,(2002)
Gianluca Gordano, Andrea Tagarelli, Riccardo Ortale, Francesco De Francesca, Distance-based Clustering of XML Documents ,(2003)
Helena Ahonen-Myka, Antoine Doucet, Naïve Clustering of a large XML Document Collection. INEX Workshop. pp. 81- 87 ,(2002)
Mounir Fegas, Thierry Despeyroux, Anne-Marie Vercoustre, Yves Lechevallier, Classification de documents XML à partir d'une représentation linéaire des arbres de ces documents. Actes des 6ème journées Extraction et Gestion des Connaissances (EGC 2006), Revue des Nouvelles Technologies de l'Information (RNTI-E-3). ,vol. 2, pp. 433- 444 ,(2006)
Theodore Dalamagas, Tao Cheng, Klaas-Jan Winkel, Timos Sellis, Clustering XML documents using structural summaries extending database technology. pp. 547- 556 ,(2004) , 10.1007/978-3-540-30192-9_54
Theodore Dalamagas, Tao Cheng, Klaas-Jan Winkel, Timos Sellis, Clustering XML Documents by Structure hellenic conference on artificial intelligence. pp. 112- 121 ,(2004) , 10.1007/978-3-540-24674-9_13
Laurent Candillier, Isabelle Tellier, Fabien Torre, Olivier Bousquet, SSC: Statistical Subspace Clustering Machine Learning and Data Mining in Pattern Recognition. ,vol. 3587, pp. 100- 109 ,(2005) , 10.1007/11510888_11
Jianwu Yang, Xiaoou Chen, A semi-structured document model for text mining Journal of Computer Science and Technology. ,vol. 17, pp. 603- 610 ,(2002) , 10.1007/BF02948828
H. V. Jagadish, Andrew Nierman, Evaluating Structural Similarity in XML Documents international workshop on the web and databases. pp. 61- 66 ,(2002)