Comparative Analysis of Sequence Clustering Methods for Deduplication of Biological Databases

作者: Qingyu Chen , Yu Wan , Xiuzhen Zhang , Yang Lei , Justin Zobel

DOI: 10.1145/3131611

关键词:

摘要: The massive volumes of data in biological sequence databases provide a remarkable resource for large-scale studies. However, the underlying quality these resources is critical concern. A particular challenge duplication, which multiple records have similar sequences, creating high level redundancy that impacts database storage, curation, and search. Biological deduplication has two direct applications: where detected duplicates are removed to improve curation efficiency, search, duplicate sequences may be flagged but remain available support analysis.Clustering methods been widely applied deduplication. Since an exhaustive all-by-all pairwise comparison cannot scale volume data, heuristic approaches recruited, such as use simple similarity thresholds. In this article, we present between CD-HIT UCLUST, best-known clustering tools Our contributions include detailed assessment remaining after deduplication, application standard evaluation metrics quantify cohesion separation clusters generated by each method, case study assesses intracluster function annotation consistency demonstrate impact factors on practical methods. results show trade-off efficiency accuracy becomes acute when low threshold values used cluster sizes large. This leads recommendations users more effective uses

参考文章(94)
Olatz Arbelaitz, Ibai Gurrutxaga, Javier Muguerza, Jesús M. Pérez, Iñigo Perona, An extensive comparative study of cluster validity indices Pattern Recognition. ,vol. 46, pp. 243- 256 ,(2013) , 10.1016/J.PATCOG.2012.07.021
Chun-Wei Tung, PupDB: a database of pupylated proteins BMC Bioinformatics. ,vol. 13, pp. 40- 40 ,(2012) , 10.1186/1471-2105-13-40
Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, Samuel Ieong, Diversifying search results Proceedings of the Second ACM International Conference on Web Search and Data Mining - WSDM '09. pp. 5- 14 ,(2009) , 10.1145/1498759.1498766
Gaston K. Mazandu, Nicola J. Mulder, Information content-based gene ontology semantic similarity approaches: toward a unified framework theory. BioMed Research International. ,vol. 2013, pp. 292063- 292063 ,(2013) , 10.1155/2013/292063
Chun-Wei Tung, Shinn-Ying Ho, Computational Identification of Ubiquitylation Sites From Protein Sequences BMC Bioinformatics. ,vol. 9, pp. 310- 310 ,(2008) , 10.1186/1471-2105-9-310
Antonio Di Marco, Roberto Navigli, Clustering and Diversifying Web Search Results with Graph-Based Word Sense Induction Computational Linguistics. ,vol. 39, pp. 709- 754 ,(2013) , 10.1162/COLI_A_00148
Carlo Batini, Cinzia Cappiello, Chiara Francalanci, Andrea Maurino, Methodologies for data quality assessment and improvement ACM Computing Surveys. ,vol. 41, pp. 16- ,(2009) , 10.1145/1541880.1541883
Jing HU, Xianghe Yan, BS-KNN: An Effective Algorithm for Predicting Protein Subchloroplast Localization Evolutionary Bioinformatics. ,vol. 8, pp. 79- 87 ,(2012) , 10.4137/EBO.S8681
I. Letunic, T. Doerks, P. Bork, SMART 6: recent updates and new developments Nucleic Acids Research. ,vol. 37, pp. 229- 232 ,(2009) , 10.1093/NAR/GKN808