Content Extraction from News Pages Using Particle Swarm Optimization

作者: Cai-Nicolas Ziegler , Michal Skubacz

DOI: 10.1007/978-3-642-27714-6_8

关键词:

摘要: Today’s Web pages are commonly made up of more than merely one cohesive block information. For instance, news from popular media channels such as Financial Times or Washington Post consist no 30%-50% textual news, next to advertisements, link lists related articles, disclaimer information, and so forth.

参考文章(20)
Natalie S. Glance, Matthew Hurst, Takashi Tomokiyo, BlogPulse: Automated Trend Discovery for Weblogs ,(2003)
Nicholas Kushmerick, Daniel S. Weld, Wrapper induction for information extraction international joint conference on artificial intelligence. pp. 729- 737 ,(1997)
r;ribeiro-neto bueza-yates (b), Modern Information Retrieval ,(1999)
J. R. Quinlan, Improved use of continuous attributes in C4.5 Journal of Artificial Intelligence Research. ,vol. 4, pp. 77- 90 ,(1996) , 10.1613/JAIR.279
Suhit Gupta, Gail Kaiser, David Neistadt, Peter Grimm, DOM-based content extraction of HTML documents Proceedings of the twelfth international conference on World Wide Web - WWW '03. pp. 207- 214 ,(2003) , 10.1145/775152.775182
Shian-Hua Lin, Jan-Ming Ho, Discovering informative content blocks from Web documents Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02. pp. 588- 593 ,(2002) , 10.1145/775047.775134
Aixin Sun, Ee-Peng Lim, Web unit mining: finding and classifying subgraphs of web pages conference on information and knowledge management. pp. 108- 115 ,(2003) , 10.1145/956863.956885
Alexandros Ntoulas, Marc Najork, Mark Manasse, Dennis Fetterly, Detecting spam web pages through content analysis Proceedings of the 15th international conference on World Wide Web - WWW '06. pp. 83- 92 ,(2006) , 10.1145/1135777.1135794
Suhit Gupta, Gail E. Kaiser, Peter Grimm, Michael F. Chiang, Justin Starren, Automating Content Extraction of HTML Documents World Wide Web. ,vol. 8, pp. 179- 224 ,(2005) , 10.1007/S11280-004-4873-3
Yudong Yang, HongJiang Zhang, HTML page analysis based on visual cues international conference on document analysis and recognition. pp. 859- 864 ,(2001) , 10.1109/ICDAR.2001.953909