作者: Soumen Chakrabarti , Byron Dom , Piotr Indyk
关键词:
摘要: A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword ambiguity, and improves the quality of profile-based routing filtering. Therefore, an accurate classifier essential component a database. Hyperlinks pose new problems not addressed extensive text classification literature. Links clearly contain high-quality semantic clues are lost upon purely term-based classifier, but exploiting link information non-trivial because it noisy. Naive use terms neighborhood document can even degrade accuracy. Our contribution propose robust statistical models relaxation labeling technique for better by small around documents. also adapts gracefully fraction neighboring documents having known topics. We experimented with pre-classified samples from Yahoo!1 US Patent Database2. In previous work, we developed misclassified only 13% well-known Reuters benchmark; this was comparable best results ever obtained. This 36% patents, indicating classifying be more difficult than text. Naively increased error 38%; our reduced 21%. Results Yahoo! sample were dramatic: showed 68% error, whereas