Enhanced hypertext categorization using hyperlinks

作者： Soumen Chakrabarti , Byron Dom , Piotr Indyk

关键词:

摘要: A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword ambiguity, and improves the quality of profile-based routing filtering. Therefore, an accurate classifier essential component a database. Hyperlinks pose new problems not addressed extensive text classification literature. Links clearly contain high-quality semantic clues are lost upon purely term-based classifier, but exploiting link information non-trivial because it noisy. Naive use terms neighborhood document can even degrade accuracy. Our contribution propose robust statistical models relaxation labeling technique for better by small around documents. also adapts gracefully fraction neighboring documents having known topics. We experimented with pre-classified samples from Yahoo!1 US Patent Database2. In previous work, we developed misclassified only 13% well-known Reuters benchmark; this was comparable best results ever obtained. This 36% patents, indicating classifying be more difficult than text. Naively increased error 38%; our reduced 21%. Results Yahoo! sample were dramatic: showed 68% error, whereas

ucsd.edu PDF 下载加速

acm.org LINK 下载加速

freepatentsonline.com LINK 下载加速

lens.org UNKNOWN 下载加速

sci-hub.se PDF 下载加速

参考文章(60)

Joseph P. Mehrle, Computer-based system for classifying documents into a hierarchy and linking the classifications to the hierarchy ,(1996)

John C. Shafer, Rakesh Agrawal, Manish Mehta, SPRINT: A Scalable Parallel Classifier for Data Mining very large data bases. pp. 544- 555 ,(1996)

Michael J. Pazzani, Clifford A. Brunk, Glenn Silverstein, A knowledge-intensive approach to learning relational concepts Machine Learning Proceedings 1991. pp. 432- 436 ,(1991) , 10.1016/B978-1-55860-200-7.50089-1

Byron Dom, Soumen Chakrabarti, Prabhakar Raghavan, Rakesh Agrawal, Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases very large data bases. pp. 446- 455 ,(1997)

Daniela Florescu, Daphne Koller, None, Using Probabilistic Information in Data Integration very large data bases. pp. 216- 225 ,(1997)

Eric Horvitz, David E. Heckerman, James R. Flynn, Samuel D. Hobson, Gregory L. Shaw, Erich-S.o slashed.ren Finkelstein, John S. Breese, Karen Jensen, On-line help method and system utilizing free text query ,(1995)

Hinrich Schuetze, Document information retrieval using global word co-occurrence patterns ,(1994)

Kazuo Misue, Yasubumi Sakakibara, Building of a document classification tree by recursive optimization of keyword selection function ,(1993)

S. Muggleton, C. Feng, Efficient Induction of Logic Programs algorithmic learning theory. pp. 368- 381 ,(1990)

10.

Manish Mehta, Rakesh Agrawal, Jorma Rissanen, SLIQ: A fast scalable classifier for data mining Advances in Database Technology — EDBT '96. pp. 18- 32 ,(1996) , 10.1007/BFB0014141

Enhanced hypertext categorization using hyperlinks

来源期刊

我的账户

Enhanced hypertext categorization using hyperlinks

来源期刊

相似文章 10

我的账户