Multiple sets of features for automatic genre classification of web documents

作者: Chul Su Lim , Kong Joo Lee , Gil Chang Kim

DOI: 10.1016/J.IPM.2004.06.004

关键词:

摘要: With the increase of information on Web, it is difficult to find desired quickly out documents retrieved by a search engine. One way solve this problem classify web according various criteria. Most document classification has been focused subject or topic document. A genre style another view different from topic. The also criterion documents. In paper, we suggest multiple sets features genres basic set features, which have proposed in previous studies, acquired textual properties documents, such as number sentences, certain word, etc. However, are that they contain URL and HTML tags within pages. We introduce new specific extracted tags. present work an attempt evaluate performance discuss their characteristics. Finally, conclude appropriate automatic

参考文章(21)
Niklas Wolkert, Jussi Karlgren, Johan Dewe, Ivan Bretan, Anders Hallberg, Iterative Information Retrieval Using Fast Clustering and Usage-Specific Genres ,(1999)
Jussi Karlgren, Johan Dewe, Ivan Bretan, Assembling a Balanced Corpus from the Internet Proceedings of the 11th Nordic Conference of Computational Linguistics (NODALIDA 1998). pp. 100- 108 ,(1998)
Rich Caruana, Dayne Freitag, Greedy Attribute Selection Machine Learning Proceedings 1994. pp. 28- 36 ,(1994) , 10.1016/B978-1-55860-335-6.50012-X
Fiona J. Tweedie, R. Harald Baayen, How variable may a constant be? Measures of lexical richness in perspective Computers and The Humanities. ,vol. 32, pp. 323- 352 ,(1998) , 10.1023/A:1001749303137
Wessel Kraaij, Thijs Westerveld, Djoerd Hiemstra, The Importance of Prior Probabilities for Entry Page Search Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '02. pp. 27- 34 ,(2002) , 10.1145/564376.564383
Douglass R. Cutting, David R. Karger, Jan O. Pedersen, Constant interaction-time scatter/gather browsing of very large document collections international acm sigir conference on research and development in information retrieval. pp. 126- 134 ,(1993) , 10.1145/160688.160706
E. Stamatatos, N. Fakotakis, G. Kokkinakis, Text genre detection using common word frequencies international conference on computational linguistics. pp. 808- 814 ,(2000) , 10.3115/992730.992763