作者: Christophe Dessimoz , Marc Robinson-Rechavi , Marc Robinson-Rechavi , Alex Warwick Vesztrocy , Alex Warwick Vesztrocy
DOI: 10.1093/BIOINFORMATICS/BTAB219
关键词: Phylogenetics 、 Heuristic 、 Alpha (programming language) 、 Sequence (medicine) 、 Tree (data structure) 、 Protein subfamily 、 Subfamily 、 Protein family 、 Computational biology 、 Computer science
摘要: MOTIVATION Assigning new sequences to known protein families and subfamilies is a prerequisite for many functional, comparative evolutionary genomics analyses. Such assignment commonly achieved by looking the closest sequence in reference database, using method such as BLAST. However, ignoring gene phylogeny can be misleading because query does not necessarily belong same subfamily its sequence. For example, hemoglobin which branched out prior alpha/beta duplication could alpha or beta sequence, whereas it neither. To overcome this problem, phylogeny-driven tools have emerged but rely on trees, whose inference computationally expensive. RESULTS Here, we first show that multiple animal plant datasets, 18 62% of assignments are misassigned, typically an over-specific subfamily. Then, introduce OMAmer, novel alignment-free method, limits suited phylogenomic databases with thousands genomes. OMAmer based innovative evolutionarily-informed k-mers mapping ancestral subfamilies. Whilst able reject non-homologous family-level assignments, provides better quicker subfamily-level than approaches relying whether inferred exactly Smith-Waterman fast heuristic DIAMOND. AVAILABILITY available from Python Package Index (as omamer), source code precomputed database at https://github.com/DessimozLab/omamer. SUPPLEMENTARY INFORMATION Supplementary data Bioinformatics online.