作者: Aidan R. O’Brien , Neil F. W. Saunders , Yi Guo , Fabian A. Buske , Rodney J. Scott
DOI: 10.1186/S12864-015-2269-7
关键词:
摘要: Genomic information is increasingly used in medical practice giving rise to the need for efficient analysis methodology able cope with thousands of individuals and millions variants. The widely Hadoop MapReduce architecture associated machine learning library, Mahout, provide means tackling computationally challenging tasks. However, many genomic analyses do not fit Map-Reduce paradigm. We therefore utilise recently developed Spark engine, along its MLlib, which offers more flexibility parallelisation population-scale bioinformatics resulting tool, VariantSpark provides an interface from MLlib standard variant format (VCF), seamless genome-wide sampling variants a pipeline visualising results. To demonstrate capabilities VariantSpark, we clustered than 3,000 80 Million each determine population structure dataset. % faster Spark-based genome clustering approach, adam, comparable implementation using Hadoop/Mahout, as well Admixture, commonly tool determining individual ancestries. It over 90 traditional implementations R Python. benefits speed, resource consumption scalability enables open up usage advanced, algorithms data.