Thai Related Foreign Language Specific Web Crawling Approach

作者： Tanaphol Suebchua , Bundit Manaskasemsak , Arnon Rungsawang

关键词:

摘要: National web archives have been successfully made available through domain—and language-specific crawlers for years. We here propose another focused crawler collecting foreign language pages that are also related to a nation. Rather finding the most relevant pages, an ensemble machine learning has trained with selective features find clusters of unvisited called website segments. During consecutive crawling cycles, will be retrained extracted from new found Preliminary experiments in real space on Thai-tourism topics show this approach can take advantage recent experiences produce more promising harvest rates than traditional breadth—and best-first baselines.

参考文章(12)

Saad H. Alabbad, Sultan Alanazi, Language Based Crawling: Crawling the Arabic Content of the Web. international conference on internet computing. pp. 83- 88 ,(2009)

Romesh Ranawana, Vasile Palade, Multi-Classifier Systems: Review and a roadmap for developers hybrid intelligent systems. ,vol. 3, pp. 35- 61 ,(2006) , 10.3233/HIS-2006-3104

João Miranda, Daniel Gomes, Miguel Costa, André Nogueira, Introducing the Portuguese web archive initiative 8th International Web Archiving Workshop. ,(2008)

Salvador García, Francisco Herrera, Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy Evolutionary Computation. ,vol. 17, pp. 275- 306 ,(2009) , 10.1162/EVCO.2009.17.3.275

Punnawat Tadapak, Thanaphon Suebchua, Arnon Rungsawang, A Machine Learning Based Language Specific Web Site Crawler network-based information systems. pp. 155- 161 ,(2010) , 10.1109/NBIS.2010.25

Kulwadee Somboonviwat, Masaru Kitsuregawa, Takayuki Tamura, A method for language-specific Web crawling and its evaluation Systems and Computers in Japan. ,vol. 38, pp. 10- 20 ,(2007) , 10.1002/(ISSN)1520-684X

Ricardo Baeza-Yates, Carlos Castillo, Mauricio Marin, Andrea Rodriguez, Crawling a country: better strategies than breadth-first for web page ordering the web conference. pp. 864- 872 ,(2005) , 10.1145/1062745.1062768

Ilaria Bordino, Paolo Boldi, Debora Donato, Massimo Santini, Sebastiano Vigna, Temporal Evolution of the UK Web international conference on data mining. pp. 909- 918 ,(2008) , 10.1109/ICDMW.2008.88

Carlos Castillo, Ricardo Baeza Yates, Vicente López, Characteristics of the Web of Spain Cybermetrics: International Journal of Scientometrics, Informetrics and Bibliometrics. pp. 3- ,(2005)

10.

Ekkasit Srisukha, Supakpong Jinarat, Choochart Haruechaiyasak, Arnon Rungsawang, Naïve bayes based language-specific web crawling international conference on electrical engineering/electronics, computer, telecommunications and information technology. ,vol. 1, pp. 113- 116 ,(2008) , 10.1109/ECTICON.2008.4600385

Thai Related Foreign Language Specific Web Crawling Approach

来源期刊

我的账户

Thai Related Foreign Language Specific Web Crawling Approach

来源期刊

相似文章 1

Language based web crawling on big data

我的账户