作者: Mourad Ouzzani , Saravanan Thirumuruganathan , Nan Tang
DOI:
关键词: Data science 、 Variety (cybernetics) 、 Data curation 、 Artificial intelligence 、 Deep learning 、 Data management 、 Computer science 、 Process (engineering)
摘要: Past. Data curation - the process of discovering, integrating, and cleaning data is one oldest management problems. Unfortunately, it still most time consuming least enjoyable work scientists. So far, successful stories are mainly ad-hoc solutions that either domain-specific (for example, ETL rules) or task-specific entity resolution). Present. The power current not keeping up with ever changing ecosystem in terms volume, velocity, variety veracity, due to high human cost, instead machine needed for providing mentioned above. Meanwhile, deep learning making strides achieving remarkable successes areas such as image recognition, natural language processing, speech recognition. This largely its ability understanding features neither nor task-specific. Future. need keep pace fast-changing ecosystem, where main hope devise domain-agnostic task-agnostic solutions. To this end, we start a new research project, called AutoDC, unleash potential towards self-driving curation. We will discuss how different concepts can be adapted extended solve various showcase some low-hanging fruits about early encounters between happening AutoDC. believe directions pointed out by only drive AutoDC democratizing curation, but also serve cornerstone researchers practitioners move realm