作者: Raphaël Léonard , Denis Baurain , Frédéric Kerff , Eric Sauvage , Damien Sirjacobs
DOI:
关键词:
摘要: The fast-growing number of available prokaryotic genomes, along with their uneven taxonomic distribution, is a prob- lem when trying to assemble broadly sampled genome sets for phylogenomics and comparative genomics. Indeed, most of the new genomes belong to the same subset of hyper-sampled phyla, such as Proteobacteria and Firmicutes, or even to single species, such as Escherichia coli (almost 2000 genomes as of Sept 2015), while the continuous flow of newly discovered phyla prompts for regular updates. This situation makes it difficult to maintain sets of representative genomes combining lesser known phyla, for which only few species are available, and sound subsets of highly abundant phyla. An automated straightforward method is required but none are publicly available. The LZ distance, in conjunction with the quality of the annotations, can be used to create an automated approach for selecting a subset of represen- tative genomes without redundancy. We are planning to release this tool on a website that will be made publicly available.