OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches.

作者: Christophe Dessimoz , Marc Robinson-Rechavi , Marc Robinson-Rechavi , Alex Warwick Vesztrocy , Alex Warwick Vesztrocy

DOI: 10.1093/BIOINFORMATICS/BTAB219

关键词: PhylogeneticsHeuristicAlpha (programming language)Sequence (medicine)Tree (data structure)Protein subfamilySubfamilyProtein familyComputational biologyComputer science

摘要: MOTIVATION Assigning new sequences to known protein families and subfamilies is a prerequisite for many functional, comparative evolutionary genomics analyses. Such assignment commonly achieved by looking the closest sequence in reference database, using method such as BLAST. However, ignoring gene phylogeny can be misleading because query does not necessarily belong same subfamily its sequence. For example, hemoglobin which branched out prior alpha/beta duplication could alpha or beta sequence, whereas it neither. To overcome this problem, phylogeny-driven tools have emerged but rely on trees, whose inference computationally expensive. RESULTS Here, we first show that multiple animal plant datasets, 18 62% of assignments are misassigned, typically an over-specific subfamily. Then, introduce OMAmer, novel alignment-free method, limits suited phylogenomic databases with thousands genomes. OMAmer based innovative evolutionarily-informed k-mers mapping ancestral subfamilies. Whilst able reject non-homologous family-level assignments, provides better quicker subfamily-level than approaches relying whether inferred exactly Smith-Waterman fast heuristic DIAMOND. AVAILABILITY available from Python Package Index (as omamer), source code precomputed database at https://github.com/DessimozLab/omamer. SUPPLEMENTARY INFORMATION Supplementary data Bioinformatics online.

参考文章(39)
Daniel A. Dalquen, Adrian M. Altenhoff, Gaston H. Gonnet, Christophe Dessimoz, The Impact of Gene Duplication, Insertion, Deletion, Lateral Gene Transfer and Sequencing Error on Orthology Inference: A Simulation Study PLoS ONE. ,vol. 8, pp. e56925- ,(2013) , 10.1371/JOURNAL.PONE.0056925
J. C. Opazo, F. G. Hoffmann, J. F. Storz, Differential loss of embryonic globin genes during the radiation of placental mammals Proceedings of the National Academy of Sciences of the United States of America. ,vol. 105, pp. 12950- 12955 ,(2008) , 10.1073/PNAS.0804392105
Liisa B. Koski, G. Brian Golding, The closest BLAST hit is often not the nearest neighbor. Journal of Molecular Evolution. ,vol. 52, pp. 540- 542 ,(2001) , 10.1007/S002390010184
Li Li, Christian J Stoeckert, David S Roos, OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Research. ,vol. 13, pp. 2178- 2189 ,(2003) , 10.1101/GR.1224503
Erik L.L. Sonnhammer, Gabriel Östlund, InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic Nucleic Acids Research. ,vol. 43, pp. 234- 239 ,(2015) , 10.1093/NAR/GKU1203
Toni Gabaldón, Eugene V. Koonin, Functional and evolutionary implications of gene orthology Nature Reviews Genetics. ,vol. 14, pp. 360- 366 ,(2013) , 10.1038/NRG3456
Benjamin Buchfink, Chao Xie, Daniel H Huson, Fast and sensitive protein alignment using DIAMOND Nature Methods. ,vol. 12, pp. 59- 60 ,(2015) , 10.1038/NMETH.3176
S Altschula, Warren Gisha, Webb Millerb, E Meyersc, D Lipmana, None, Basic Local Alignment Search Tool Journal of Molecular Biology. ,vol. 215, pp. 403- 410 ,(1990) , 10.1016/S0022-2836(05)80360-2