Scalable Protein Sequence Similarity Search using Locality-Sensitive Hashing and MapReduce

作者: Federico M. Lauro , Srikumar Venugopal , Freddie Sunarso

DOI:

关键词:

摘要: Metagenomics is the study of environments through genetic sampling their microbiota. Metagenomic studies produce large datasets that are estimated to grow at a faster rate than available computational capacity. A key step in metagenome data sequence similarity searching which computationally intensive over datasets. Tools such as BLAST require dedicated computing infrastructure perform analysis and may not be every researcher. In this paper, we propose novel approach called ScalLoPS performs on protein using LSH (Locality-Sensitive Hashing) implemented MapReduce distributed framework. designed scale across resources sourced from cloud providers. We present design implementation followed by evaluation with derived both traditional well metagenomic studies. Our experiments show method approximates quality results while improving scalability search.

参考文章(40)
A. E. Darling, W. C. Feng, L. Carey, The design, implementation, and evaluation of mpiBLAST "Submitted to: ClusterWorld Conference&Expo 2003". ,(2003)
Piotr Indyk, Aristides Gionis, Rajeev Motwani, Similarity Search in High Dimensions via Hashing very large data bases. pp. 518- 529 ,(1999)
S. Sun, J. Chen, W. Li, I. Altintas, A. Lin, S. Peltier, K. Stocks, E. E. Allen, M. Ellisman, J. Grethe, J. Wooley, Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource Nucleic Acids Research. ,vol. 39, pp. 546- 551 ,(2011) , 10.1093/NAR/GKQ1102
O Aisling, D Jurate, DS Roy, 'Big data', Hadoop and cloud computing in genomics Journal of Biomedical Informatics. ,vol. 46, pp. 774- 781 ,(2013) , 10.1016/J.JBI.2013.07.001
Alessandro Ferreira Leite, Alba Cristina Magalhaes Alves de Melo, Executing a biological sequence comparison application on a federated cloud environment ieee international conference on high performance computing, data, and analytics. pp. 1- 9 ,(2012) , 10.1109/HIPC.2012.6507500
John C. Wooley, Yuzhen Ye, Metagenomics: Facts and Artifacts, and Computational Challenges* Journal of Computer Science and Technology. ,vol. 25, pp. 71- 81 ,(2010) , 10.1007/S11390-010-9306-4
John C. Wooley, Adam Godzik, Iddo Friedberg, A Primer on Metagenomics PLoS Computational Biology. ,vol. 6, pp. e1000667- ,(2010) , 10.1371/JOURNAL.PCBI.1000667
Martin C Frith, Michiaki Hamada, Paul Horton, Parameters for accurate genome alignment BMC Bioinformatics. ,vol. 11, pp. 80- 80 ,(2010) , 10.1186/1471-2105-11-80
Moses S. Charikar, Similarity estimation techniques from rounding algorithms symposium on the theory of computing. pp. 380- 388 ,(2002) , 10.1145/509907.509965
Nicholas J. Loman, Chrystala Constantinidou, Jacqueline Z. M. Chan, Mihail Halachev, Martin Sergeant, Charles W. Penn, Esther R. Robinson, Mark J. Pallen, High-throughput bacterial genome sequencing: an embarrassment of choice, a world of opportunity Nature Reviews Microbiology. ,vol. 10, pp. 599- 606 ,(2012) , 10.1038/NRMICRO2850