Discovery of concept entities from web sites using web unit mining

作者: Ming Yin Ming , Dion Hoe‐lian Goh , Ee‐Peng Lim , Aixin Sun

DOI: 10.1108/17440080580000088

关键词:

摘要: A web site usually contains a large number of concept entities, each consisting one or more pages connected by hyperlinks. In order to discover these entities for expressive queries and other applications, the unit mining problem has been proposed. Web aims determine that constitute entity classify into categories. Nevertheless, performance an existing algorithm, iWUM, suffers as it may create than (incomplete units) from single entity. This paper presents two methods solve this problem. The first method introduces effective fragment construction so reduce later classification errors. second incorporates site‐specific knowledge handle incomplete units. Experiments show units can be removed overall accuracy significantly improved, especially on precision F1 measures.

参考文章(21)
YongHong Tian, TieJun Huang, Wen Gao, Jun Cheng, PingBo Kang, Two-phase Web site classification based on hidden Markov tree models web intelligence. pp. 227- 234 ,(2003) , 10.1109/WI.2003.1241198
Thorsten Joachims, Making large-scale support vector machine learning practical Advances in kernel methods. pp. 169- 184 ,(1999)
Xue-Mei Jiang, Gui-Rong Xue, Wen-Guan Song, Hua-Jun Zeng, Zheng Chen, Wei-Ying Ma, Exploiting PageRank at Different Block Level web information systems engineering. pp. 241- 252 ,(2004) , 10.1007/978-3-540-30480-7_26
K. Tajima, K. Tanaka, New techniques for the discovery of logical documents in Web international symposium on database applications in non traditional environments. pp. 125- 132 ,(1999) , 10.1109/DANTE.1999.844950
Wessel Kraaij, Thijs Westerveld, Djoerd Hiemstra, The Importance of Prior Probabilities for Entry Page Search Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '02. pp. 27- 34 ,(2002) , 10.1145/564376.564383
Johannes Fürnkranz, Hyperlink ensembles: a case study in hypertext classification Information Fusion. ,vol. 3, pp. 299- 312 ,(2002) , 10.1016/S1566-2535(02)00090-8
Yiming Yang, Xin Liu, A re-examination of text categorization methods international acm sigir conference on research and development in information retrieval. pp. 42- 49 ,(1999) , 10.1145/312624.312647
Wen-Syan Li, Okan Kolak, Quoc Vu, Hajime Takano, Defining logical domains in a web site acm conference on hypertext. pp. 123- 132 ,(2000) , 10.1145/336296.336345
Aixin Sun, Ee-Peng Lim, Web unit mining: finding and classifying subgraphs of web pages conference on information and knowledge management. pp. 108- 115 ,(2003) , 10.1145/956863.956885
Martin Ester, Hans-Peter Kriegel, Matthias Schubert, Web site mining Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02. pp. 249- 258 ,(2002) , 10.1145/775047.775084