作者: Tanaphol Suebchua , Bundit Manaskasemsak , Arnon Rungsawang
DOI: 10.1007/978-981-4585-18-7_72
关键词:
摘要: National web archives have been successfully made available through domain—and language-specific crawlers for years. We here propose another focused crawler collecting foreign language pages that are also related to a nation. Rather finding the most relevant pages, an ensemble machine learning has trained with selective features find clusters of unvisited called website segments. During consecutive crawling cycles, will be retrained extracted from new found Preliminary experiments in real space on Thai-tourism topics show this approach can take advantage recent experiences produce more promising harvest rates than traditional breadth—and best-first baselines.