作者: Ali Moosavi , Salman Hooshmand , Sara Baghbanzadeh , Guy-Vincent Jourdan , Gregor V. Bochmann
DOI: 10.1007/978-3-319-08245-5_12
关键词: Web crawler 、 Component (UML) 、 Ajax 、 JavaScript 、 Finite-state machine 、 Rich Internet application 、 Computer science 、 Search engine indexing 、 Crawling 、 Real-time computing
摘要: Automatic crawling of Rich Internet Applications (RIAs) is a challenge because client-side code modifies the client dynamically, fetching server-side data asynchronously. Most existing solutions model RIAs as state machines with DOMs states and JavaScript events execution transitions. This approach fails when used “real-life”, complex RIAs, size produced much too large to be practical. In this paper, we propose new method crawl AJAX-based in an efficient manner by detecting “components”, which are areas DOM that independent from each other, component separately. leads dramatic reduction required space for model, without loss content coverage. Our does not require prior knowledge RIA nor predefined definition components. Instead, infer components observing behavior during crawling. experimental results show our can index quickly completely industrial simply out reach traditional methods.