Deeper: A Data Enrichment System Powered by Deep Web

作者: Pei Wang , Yongjun He , Ryan Shea , Jiannan Wang , Eugene Wu

DOI: 10.1145/3183713.3193569

关键词: Computer scienceTask (project management)Web APIInformation retrievalComponent (UML)Resource (project management)Deep WebCrawlingSystems architecturePurchasing

摘要: Data scientists often spend more than 80% of their time on data preparation. enrichment, the act extending a local database with new attributes from external sources, is among most time-consuming tasks. Existing enrichment works are resource intensive: data-intensive by relying web tables or knowledge bases, monetarily-intensive purchasing entire datasets, time-intensive fully crawling web-based source. In this work, we explore targeted alternative that uses resources (in terms API calls) proportional to size interest. We build Deeper, system powered deep web. The goal Deeper help link hidden so they can easily enrich database. find challenging problem how crawl This different typical problem, whose rather only content relating task. demonstrate limitations straightforward solutions and propose an effective strategy. also present architecture discuss implement each component. During demo, will use publication aim show (1) end-to-end solution, (2) proposed strategy superior ones.

参考文章(11)
Mohan Yang, Bolin Ding, Surajit Chaudhuri, Kaushik Chakrabarti, Finding patterns in a knowledge base using keywords to compose table answers Proceedings of the VLDB Endowment. ,vol. 7, pp. 1809- 1820 ,(2014) , 10.14778/2733085.2733088
Fan Wang, Gagan Agrawal, Effective and efficient sampling methods for deep web aggregation queries Proceedings of the 14th International Conference on Extending Database Technology - EDBT/ICDT '11. pp. 425- 436 ,(2011) , 10.1145/1951365.1951416
Ziv Bar-Yossef, Maxim Gurevich, Random sampling from a search engine's index Journal of the ACM. ,vol. 55, pp. 1- 74 ,(2008) , 10.1145/1411509.1411514
Mohamed Yakout, Kris Ganjam, Kaushik Chakrabarti, Surajit Chaudhuri, InfoGather Proceedings of the 2012 international conference on Management of Data - SIGMOD '12. pp. 97- 108 ,(2012) , 10.1145/2213836.2213848
Mingyang Zhang, Nan Zhang, Gautam Das, Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation international conference on management of data. pp. 793- 804 ,(2011) , 10.1145/1989323.1989406
Jiawei Han, Jian Pei, Yiwen Yin, Mining frequent patterns without candidate generation international conference on management of data. ,vol. 29, pp. 1- 12 ,(2000) , 10.1145/335191.335372
Meihui Zhang, Kaushik Chakrabarti, InfoGather+: semantic matching and annotation of numeric and time-varying attributes in web tables international conference on management of data. pp. 145- 156 ,(2013) , 10.1145/2463676.2465276
Michael J. Cafarella, Alon Halevy, Nodira Khoussainova, Data integration for the relational web Proceedings of the VLDB Endowment. ,vol. 2, pp. 1090- 1101 ,(2009) , 10.14778/1687627.1687750
Paraschos Koutris, Prasang Upadhyaya, Magdalena Balazinska, Bill Howe, Dan Suciu, Query-Based Data Pricing Journal of the ACM. ,vol. 62, pp. 43- ,(2015) , 10.1145/2770870
Magdalena Balazinska, Bill Howe, Dan Suciu, Data markets in the cloud Proceedings of the VLDB Endowment. ,vol. 4, pp. 1482- 1485 ,(2011) , 10.14778/3402755.3402801