作者: Qian Zhang , Se-Ran Jun , Michael Leuze , David Ussery , Intawat Nookaew
DOI: 10.1038/SREP40712
关键词:
摘要: The development of rapid, economical genome sequencing has shed new light on the classification viruses. As October 2016, National Center for Biotechnology Information (NCBI) database contained >2 million viral sequences and a reference set ~4000 that cover wide range known families. Whole-genome can be used to improve provide insight into "tree life". However, due lack evolutionary conservation amongst diverse viruses, it is not feasible build tree life using traditional phylogenetic methods based conserved proteins. In this study, we an alignment-free method uses k-mers as genomic features large-scale comparison complete genomes available in RefSeq. To determine optimal feature length, k (an essential step constructing meaningful dendrogram), designed comprehensive strategy combines three approaches: (1) cumulative relative entropy, (2) average number common among genomes, (3) Shannon diversity index. This was all 3,905 resulting dendrogram shows consistency with taxonomy ICTV Baltimore