作者: Pei Wang , Yongjun He , Ryan Shea , Jiannan Wang , Eugene Wu
关键词: Computer science 、 Task (project management) 、 Web API 、 Information retrieval 、 Component (UML) 、 Resource (project management) 、 Deep Web 、 Crawling 、 Systems architecture 、 Purchasing
摘要: Data scientists often spend more than 80% of their time on data preparation. enrichment, the act extending a local database with new attributes from external sources, is among most time-consuming tasks. Existing enrichment works are resource intensive: data-intensive by relying web tables or knowledge bases, monetarily-intensive purchasing entire datasets, time-intensive fully crawling web-based source. In this work, we explore targeted alternative that uses resources (in terms API calls) proportional to size interest. We build Deeper, system powered deep web. The goal Deeper help link hidden so they can easily enrich database. find challenging problem how crawl This different typical problem, whose rather only content relating task. demonstrate limitations straightforward solutions and propose an effective strategy. also present architecture discuss implement each component. During demo, will use publication aim show (1) end-to-end solution, (2) proposed strategy superior ones.