On Optimal Read Trimming in Next Generation Sequencing and Its Complexity

作者: Ivo Hedtke , Ioana Lemnian , Matthias Müller-Hannemann , Ivo Grosse

DOI: 10.1007/978-3-319-07953-0_7

关键词: HeuristicsBlock (data storage)TrimmingConstrained optimization problemAlgorithmAlmost surelyComputer scienceQuality (business)

摘要: Read trimming is a fundamental first step of the analysis next generation sequencing (NGS) data. Traditionally, read performed heuristically, and algorithmic work in this area has been neglected. Here, we address topic formulate three constrained optimization problems for block-based trimming, i.e., truncating same low-quality positions at both ends all reads removing truncated reads. We find that are \(\mathcal{NP}\)-hard. However, non-random distribution quality scores NGS data sets makes it tempting to speculate constraints typically satisfied by fulfilling Based on speculation, propose relaxed develop efficient polynomial-time algorithms them. (i) omitted indeed almost always (ii) yield higher number untrimmed bases than traditional heuristics.

参考文章(10)
Teofilo F. Gonzalez, Clustering to minimize the maximum intercluster distance Theoretical Computer Science. ,vol. 38, pp. 293- 306 ,(1985) , 10.1016/0304-3975(85)90224-5
Cristian Del Fabbro, Simone Scalabrin, Michele Morgante, Federico M. Giorgi, An Extensive Evaluation of Read Trimming Effects on Illumina NGS Data Analysis PLoS ONE. ,vol. 8, pp. e85024- 13 ,(2013) , 10.1371/JOURNAL.PONE.0085024
Daniel C. Koboldt, Karyn Meltz Steinberg, David E. Larson, Richard K. Wilson, Elaine R. Mardis, The Next-Generation Sequencing Revolution and Its Impact on Genomics Cell. ,vol. 155, pp. 27- 38 ,(2013) , 10.1016/J.CELL.2013.09.006
Ron Edgar, Michael Domrachev, Alex E Lash, Gene Expression Omnibus: NCBI gene expression and hybridization array data repository Nucleic Acids Research. ,vol. 30, pp. 207- 210 ,(2002) , 10.1093/NAR/30.1.207
Anaïs F Bardet, Qiye He, Julia Zeitlinger, Alexander Stark, A computational pipeline for comparative ChIP-seq analyses Nature Protocols. ,vol. 7, pp. 45- 61 ,(2012) , 10.1038/NPROT.2011.420
Vipul Bhargava, Steven R. Head, Phillip Ordoukhanian, Mark Mercola, Shankar Subramaniam, Technical Variations in Low-Input RNA-seq Methodologies Scientific Reports. ,vol. 4, pp. 3678- 3678 ,(2015) , 10.1038/SREP03678
Brent Ewing, Phil Green, Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities Genome Research. ,vol. 8, pp. 186- 194 ,(1998) , 10.1101/GR.8.3.186
Brent Ewing, LaDeana Hillier, Michael C. Wendl, Phil Green, Base-calling of automated sequencer traces using Phred. I. accuracy assessment Genome Research. ,vol. 8, pp. 175- 185 ,(1998) , 10.1101/GR.8.3.175
R. Schmieder, R. Edwards, Quality control and preprocessing of metagenomic datasets Bioinformatics. ,vol. 27, pp. 863- 864 ,(2011) , 10.1093/BIOINFORMATICS/BTR026