VariantSpark: population scale clustering of genotype information

作者: Aidan R. O’Brien , Neil F. W. Saunders , Yi Guo , Fabian A. Buske , Rodney J. Scott

DOI: 10.1186/S12864-015-2269-7

关键词:

摘要: Genomic information is increasingly used in medical practice giving rise to the need for efficient analysis methodology able cope with thousands of individuals and millions variants. The widely Hadoop MapReduce architecture associated machine learning library, Mahout, provide means tackling computationally challenging tasks. However, many genomic analyses do not fit Map-Reduce paradigm. We therefore utilise recently developed Spark engine, along its MLlib, which offers more flexibility parallelisation population-scale bioinformatics resulting tool, VariantSpark provides an interface from MLlib standard variant format (VCF), seamless genome-wide sampling variants a pipeline visualising results. To demonstrate capabilities VariantSpark, we clustered than 3,000 80 Million each determine population structure dataset. % faster Spark-based genome clustering approach, adam, comparable implementation using Hadoop/Mahout, as well Admixture, commonly tool determining individual ancestries. It over 90 traditional implementations R Python. benefits speed, resource consumption scalability enables open up usage advanced, algorithms data.

参考文章(28)
Ted Dunning, Ellen Friedman, Robin Anil, Sean Owen, Mahout in Action ,(2011)
Jorge L. Reyes-Ortiz, Luca Oneto, Davide Anguita, Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf Procedia Computer Science. ,vol. 53, pp. 121- 130 ,(2015) , 10.1016/J.PROCS.2015.07.286
Ben Langmead, Michael C Schatz, Jimmy Lin, Mihai Pop, Steven L Salzberg, Searching for SNPs with cloud computing Genome Biology. ,vol. 10, pp. 1- 10 ,(2009) , 10.1186/GB-2009-10-11-R134
Yael Laitman, Bing-Jian Feng, Itay M Zamir, Jeffrey N Weitzel, Paul Duncan, Danielle Port, Eswary Thirthagiri, Soo-Hwang Teo, Gareth Evans, Ayse Latif, William G Newman, Ruth Gershoni-Baruch, Jamal Zidan, Shani Shimon-Paluch, David Goldgar, Eitan Friedman, None, Haplotype analysis of the 185delAG BRCA1 mutation in ethnically diverse populations European Journal of Human Genetics. ,vol. 21, pp. 212- 216 ,(2013) , 10.1038/EJHG.2012.124
Judy Qiu, Jaliya Ekanayake, Thilina Gunarathne, Jong Youl Choi, Seung-Hee Bae, Hui Li, Bingjing Zhang, Tak-Lon Wu, Yang Ruan, Saliya Ekanayake, Adam Hughes, Geoffrey Fox, Hybrid cloud and cluster computing paradigms for life science applications BMC Bioinformatics. ,vol. 11, pp. 1- 6 ,(2010) , 10.1186/1471-2105-11-S12-S3
Kyung Dae Ko, Tarek El-Ghazawi, Dongkyu Kim, Hiroki Morizono, Predicting the severity of motor neuron disease progression using electronic health record data with a cloud computing Big Data approach. computational intelligence in bioinformatics and computational biology. ,vol. 2014, pp. 1- 6 ,(2014) , 10.1109/CIBCB.2014.6845506
Lincoln D Stein, The case for cloud computing in genome informatics Genome Biology. ,vol. 11, pp. 207- 207 ,(2010) , 10.1186/GB-2010-11-5-207
X Zheng, J Shen, C Cox, J C Wakefield, M G Ehm, M R Nelson, B S Weir, HIBAG—HLA genotype imputation with attribute bagging Pharmacogenomics Journal. ,vol. 14, pp. 192- 200 ,(2014) , 10.1038/TPJ.2013.18
Xuan Guo, Yu Meng, Ning Yu, Yi Pan, Cloud computing for detecting high-order genome-wide epistatic interaction via dynamic clustering BMC Bioinformatics. ,vol. 15, pp. 102- 102 ,(2014) , 10.1186/1471-2105-15-102