PEBL: Web page classification without negative examples

作者: Hwanjo Yu , Jiawei Han , K.C. Chang

DOI: 10.1109/TKDE.2004.1264823

关键词:

摘要: Web page classification is one of the essential techniques for mining because classifying pages an interesting class often first step Web. However, constructing a classifier requires laborious preprocessing such as collecting positive and negative training examples. For instance, in order to construct "homepage" classifier, needs collect sample homepages (positive examples) nonhomepages (negative examples). In particular, examples arduous work caution avoid bias. The paper presents framework, called example based learning (PEBL), which eliminates need manually preprocessing. PEBL framework applies algorithm, mapping-convergence (M-C), achieve high accuracy (with unlabeled data) that traditional SVM data). M-C runs two stages: mapping stage convergence stage. stage, algorithm uses weak draws initial approximation "strong" data. Based on approximation, iteratively internal (e.g., SVM) maximizes margins progressively improve Thus, boundary eventually converges true feature space. We present with supporting theoretical experimental justifications. Our experiments show that, given same set examples; outperforms one-class SVMs, it almost accurate SVMs.

参考文章(31)
Wai-ching Wong, Ada Wai-Chee Fu, Finding Structure and Characteristics of Web Documents for Classification. international conference on management of data. pp. 96- 105 ,(2000)
Francesco De Comité, François Denis, Rémi Gilleron, Fabien Letouzey, Positive and Unlabeled Examples Help Learning algorithmic learning theory. pp. 219- 230 ,(1999) , 10.1007/3-540-46769-6_18
FranÇois Denis, PAC Learning from Positive Statistical Queries algorithmic learning theory. pp. 112- 126 ,(1998) , 10.1007/3-540-49730-7_9
Thorsten Joachims, Text categorization with support vector machines Universität Dortmund. ,(1999) , 10.17877/DE290R-5097
Bernhard Scholkopf, Alex Smola, Robert C. Williamson, John Shawe-Taylor, SV Estimation of a Distribution's Support In: UNSPECIFIED (pp. 582-588). MIT Press (2000). ,(2000)
Eddy Nicolas Mayoraz, Multiclass Classification with Pairwise Coupled Neural Networks or Support Vector Machines Artificial Neural Networks — ICANN 2001. pp. 314- 321 ,(2001) , 10.1007/3-540-44668-0_45
Bing Liu, Wee Sun Lee, Philip S Yu, Xiaoli Li, None, Partially Supervised Classification of Text Documents international conference on machine learning. pp. 387- 394 ,(2002)
M. Gori, L. Lastrucci, G. Soda, Autoassociator-based models for speaker verification Pattern Recognition Letters. ,vol. 17, pp. 241- 250 ,(1996) , 10.1016/0167-8655(95)00101-8
Yiming Yang, Xin Liu, A re-examination of text categorization methods international acm sigir conference on research and development in information retrieval. pp. 42- 49 ,(1999) , 10.1145/312624.312647