作者: Reham Afifi Abd El Aziz , Doaa Elzanfaly , Marwa Salah Farhan
DOI:
关键词:
摘要: Data integration is a major challenge in the era of big data analytics. Inaccurate integration can lead to incorrect analysis results. Entity resolution, which identifies similar entities across different data sources, is a crucial step in the integration process. Existing blocking techniques used to group similar entities before the matching step often neglect semantic criteria, resulting in reduced blocking quality. To address this, a new blocking architecture is proposed in this paper. The architecture incorporates a semantic similarity layer using natural language processing and deep learning techniques. The architecture is schema-agnostic and treats datasets as unstructured records to improve accuracy. Experimental results on benchmark dataset demonstrate the effectiveness of the proposed architecture in terms of recall, reduction ratio, and F-measure.