作者: G. Lunter , A. Rocco , N. Mimouni , A. Heger , A. Caldeira
DOI: 10.1101/GR.6725608
关键词:
摘要: Sequence alignment underpins all of comparative genomics, yet it remains an incompletely solved problem. In particular, the statistical uncertainty within inferred alignments is often disregarded, while parametric or phylogenetic inferences are considered meaningless without confidence estimates. Here, we report on a theoretical and simulation study pairwise genomic DNA at human–mouse divergence. We find that >15% aligned bases incorrect in existing whole-genome alignments, identify three types error, each leading to systematic biases algorithms considered. Careful modeling evolutionary process improves quality; however, these improvements modest compared with remaining errors, even exact knowledge model, emphasizing need for approaches account uncertainty. develop new algorithm, Marginalized Posterior Decoding (MPD), which explicitly accounts uncertainties, less biased more accurate than other consider, reduces proportion misaligned by third best algorithm. To our knowledge, this first nonheuristic algorithm sequence show robust over classic Needleman–Wunsch Despite this, considerable improved alignments. conclude probabilistic treatment essential, both improve quality quantify This becoming increasingly relevant growing appreciation importance noncoding DNA, whose relies heavily Alignment errors inevitable, should be when drawing conclusions from Software assist researchers doing provided http://genserv.anat.ox.ac.uk/grape/.