Enhanced hypertext categorization using hyperlinks

作者: Soumen Chakrabarti , Byron Dom , Piotr Indyk

DOI: 10.1145/276304.276332

关键词:

摘要: A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword ambiguity, and improves the quality of profile-based routing filtering. Therefore, an accurate classifier essential component a database. Hyperlinks pose new problems not addressed extensive text classification literature. Links clearly contain high-quality semantic clues are lost upon purely term-based classifier, but exploiting link information non-trivial because it noisy. Naive use terms neighborhood document can even degrade accuracy. Our contribution propose robust statistical models relaxation labeling technique for better by small around documents. also adapts gracefully fraction neighboring documents having known topics. We experimented with pre-classified samples from Yahoo!1 US Patent Database2. In previous work, we developed misclassified only 13% well-known Reuters benchmark; this was comparable best results ever obtained. This 36% patents, indicating classifying be more difficult than text. Naively increased error 38%; our reduced 21%. Results Yahoo! sample were dramatic: showed 68% error, whereas

参考文章(60)
John C. Shafer, Rakesh Agrawal, Manish Mehta, SPRINT: A Scalable Parallel Classifier for Data Mining very large data bases. pp. 544- 555 ,(1996)
Michael J. Pazzani, Clifford A. Brunk, Glenn Silverstein, A knowledge-intensive approach to learning relational concepts Machine Learning Proceedings 1991. pp. 432- 436 ,(1991) , 10.1016/B978-1-55860-200-7.50089-1
Byron Dom, Soumen Chakrabarti, Prabhakar Raghavan, Rakesh Agrawal, Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases very large data bases. pp. 446- 455 ,(1997)
Daniela Florescu, Daphne Koller, None, Using Probabilistic Information in Data Integration very large data bases. pp. 216- 225 ,(1997)
Eric Horvitz, David E. Heckerman, James R. Flynn, Samuel D. Hobson, Gregory L. Shaw, Erich-S.o slashed.ren Finkelstein, John S. Breese, Karen Jensen, On-line help method and system utilizing free text query ,(1995)
S. Muggleton, C. Feng, Efficient Induction of Logic Programs algorithmic learning theory. pp. 368- 381 ,(1990)
Manish Mehta, Rakesh Agrawal, Jorma Rissanen, SLIQ: A fast scalable classifier for data mining Advances in Database Technology — EDBT '96. pp. 18- 32 ,(1996) , 10.1007/BFB0014141