作者: Qingyu Chen , Yu Wan , Xiuzhen Zhang , Yang Lei , Justin Zobel
DOI: 10.1145/3131611
关键词:
摘要: The massive volumes of data in biological sequence databases provide a remarkable resource for large-scale studies. However, the underlying quality these resources is critical concern. A particular challenge duplication, which multiple records have similar sequences, creating high level redundancy that impacts database storage, curation, and search. Biological deduplication has two direct applications: where detected duplicates are removed to improve curation efficiency, search, duplicate sequences may be flagged but remain available support analysis.Clustering methods been widely applied deduplication. Since an exhaustive all-by-all pairwise comparison cannot scale volume data, heuristic approaches recruited, such as use simple similarity thresholds. In this article, we present between CD-HIT UCLUST, best-known clustering tools Our contributions include detailed assessment remaining after deduplication, application standard evaluation metrics quantify cohesion separation clusters generated by each method, case study assesses intracluster function annotation consistency demonstrate impact factors on practical methods. results show trade-off efficiency accuracy becomes acute when low threshold values used cluster sizes large. This leads recommendations users more effective uses