Uncertainty in homology inferences: Assessing and improving genomic sequence alignment

作者: G. Lunter , A. Rocco , N. Mimouni , A. Heger , A. Caldeira

DOI: 10.1101/GR.6725608

关键词:

摘要: Sequence alignment underpins all of comparative genomics, yet it remains an incompletely solved problem. In particular, the statistical uncertainty within inferred alignments is often disregarded, while parametric or phylogenetic inferences are considered meaningless without confidence estimates. Here, we report on a theoretical and simulation study pairwise genomic DNA at human–mouse divergence. We find that >15% aligned bases incorrect in existing whole-genome alignments, identify three types error, each leading to systematic biases algorithms considered. Careful modeling evolutionary process improves quality; however, these improvements modest compared with remaining errors, even exact knowledge model, emphasizing need for approaches account uncertainty. develop new algorithm, Marginalized Posterior Decoding (MPD), which explicitly accounts uncertainties, less biased more accurate than other consider, reduces proportion misaligned by third best algorithm. To our knowledge, this first nonheuristic algorithm sequence show robust over classic Needleman–Wunsch Despite this, considerable improved alignments. conclude probabilistic treatment essential, both improve quality quantify This becoming increasingly relevant growing appreciation importance noncoding DNA, whose relies heavily Alignment errors inevitable, should be when drawing conclusions from Software assist researchers doing provided http://genserv.anat.ox.ac.uk/grape/.

参考文章(64)
Michael Zuker, Suboptimal sequence alignment in molecular biology. Alignment with error analysis. Journal of Molecular Biology. ,vol. 221, pp. 403- 420 ,(1991) , 10.1016/0022-2836(91)80062-Y
THOMAS H. JUKES, CHARLES R. CANTOR, CHAPTER 24 – Evolution of Protein Molecules Mammalian Protein Metabolism#R##N#Volume III. pp. 21- 132 ,(1969) , 10.1016/B978-1-4832-3211-9.50009-7
Anders Krogh, Two Methods for Improving Performance of a HMM and their Application for Gene Finding intelligent systems in molecular biology. ,vol. 5, pp. 179- 186 ,(1997)
Bernard Gerstman, Jose Parra, Evolution at the Nucleotide Level Bulletin of the American Physical Society. ,(2005)
A. Loytynoja, N. Goldman, An algorithm for progressive multiple alignment of sequences with insertions Proceedings of the National Academy of Sciences of the United States of America. ,vol. 102, pp. 10557- 10562 ,(2005) , 10.1073/PNAS.0409137102
Yanni Sun, Jeremy Buhler, Choosing the best heuristic for seeded alignment of DNA sequences BMC Bioinformatics. ,vol. 7, pp. 133- 133 ,(2006) , 10.1186/1471-2105-7-133
Gerton Lunter, Chris P Ponting, Jotun Hein, Genome-wide identification of human functional DNA using a neutral indel model. PLOS Computational Biology. ,vol. 2, ,(2005) , 10.1371/JOURNAL.PCBI.0020005
S. Batzoglou, The many faces of sequence alignment Briefings in Bioinformatics. ,vol. 6, pp. 6- 22 ,(2005) , 10.1093/BIB/6.1.6
S ALTSCHUL, B ERICKSON, Locally optimal subalignments using nonlinear similarity functions. Bulletin of Mathematical Biology. ,vol. 48, pp. 633- 660 ,(1986) , 10.1016/S0092-8240(86)90012-1
B. Morgenstern, DIALIGN: multiple DNA and protein sequence alignment at BiBiServ. Nucleic Acids Research. ,vol. 32, pp. 33- 36 ,(2004) , 10.1093/NAR/GKH373