作者: Zhemin Zhou , Nina Luhmann , Nabil-Fareed Alikhan , Christopher Quince , Mark Achtman
DOI: 10.1007/978-3-319-89929-9_15
关键词: Statistical model 、 Taxonomic rank 、 False positive paradox 、 Genome 、 Computational biology 、 Identification (information) 、 Computer science 、 Genetic diversity 、 Data structure 、 Metagenomics
摘要: Exploring the genetic diversity of microbes within environment through metagenomic sequencing first requires classifying these reads into taxonomic groups. Current methods compare data with existing biased and limited reference databases. Several recent evaluation studies demonstrate that current either lack sufficient sensitivity for species-level assignments or suffer from false positives, overestimating number species in metagenome. Both are especially problematic identification low-abundance microbial species, e. g. detecting pathogens ancient samples. We present a new method, SPARSE, which improves reads. SPARSE balances databases by grouping genomes similarity-based hierarchical clusters, implemented as an efficient incremental structure. assigns to clusters using probabilistic model, specifically penalizes non-specific mappings unknown sources hence reduces false-positive assignments. Our on simulated datasets two demonstrated improved precision comparison other classification. In third simulation, our method successfully differentiated multiple co-existing Escherichia coli strains same sample. real archaeological datasets, identified \({\le }0.02\%\) abundance, consistent published findings required additional data. missed targeted reported non-existent ones.