Thai Related Foreign Language Specific Web Crawling Approach

作者: Tanaphol Suebchua , Bundit Manaskasemsak , Arnon Rungsawang

DOI: 10.1007/978-981-4585-18-7_72

关键词:

摘要: National web archives have been successfully made available through domain—and language-specific crawlers for years. We here propose another focused crawler collecting foreign language pages that are also related to a nation. Rather finding the most relevant pages, an ensemble machine learning has trained with selective features find clusters of unvisited called website segments. During consecutive crawling cycles, will be retrained extracted from new found Preliminary experiments in real space on Thai-tourism topics show this approach can take advantage recent experiences produce more promising harvest rates than traditional breadth—and best-first baselines.

参考文章(12)
Saad H. Alabbad, Sultan Alanazi, Language Based Crawling: Crawling the Arabic Content of the Web. international conference on internet computing. pp. 83- 88 ,(2009)
Romesh Ranawana, Vasile Palade, Multi-Classifier Systems: Review and a roadmap for developers hybrid intelligent systems. ,vol. 3, pp. 35- 61 ,(2006) , 10.3233/HIS-2006-3104
João Miranda, Daniel Gomes, Miguel Costa, André Nogueira, Introducing the Portuguese web archive initiative 8th International Web Archiving Workshop. ,(2008)
Salvador García, Francisco Herrera, Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy Evolutionary Computation. ,vol. 17, pp. 275- 306 ,(2009) , 10.1162/EVCO.2009.17.3.275
Punnawat Tadapak, Thanaphon Suebchua, Arnon Rungsawang, A Machine Learning Based Language Specific Web Site Crawler network-based information systems. pp. 155- 161 ,(2010) , 10.1109/NBIS.2010.25
Kulwadee Somboonviwat, Masaru Kitsuregawa, Takayuki Tamura, A method for language-specific Web crawling and its evaluation Systems and Computers in Japan. ,vol. 38, pp. 10- 20 ,(2007) , 10.1002/(ISSN)1520-684X
Ricardo Baeza-Yates, Carlos Castillo, Mauricio Marin, Andrea Rodriguez, Crawling a country: better strategies than breadth-first for web page ordering the web conference. pp. 864- 872 ,(2005) , 10.1145/1062745.1062768
Ilaria Bordino, Paolo Boldi, Debora Donato, Massimo Santini, Sebastiano Vigna, Temporal Evolution of the UK Web international conference on data mining. pp. 909- 918 ,(2008) , 10.1109/ICDMW.2008.88
Carlos Castillo, Ricardo Baeza Yates, Vicente López, Characteristics of the Web of Spain Cybermetrics: International Journal of Scientometrics, Informetrics and Bibliometrics. pp. 3- ,(2005)
Ekkasit Srisukha, Supakpong Jinarat, Choochart Haruechaiyasak, Arnon Rungsawang, Naïve bayes based language-specific web crawling international conference on electrical engineering/electronics, computer, telecommunications and information technology. ,vol. 1, pp. 113- 116 ,(2008) , 10.1109/ECTICON.2008.4600385