Alignment by the numbers: sequence assembly using reduced dimensionality numerical representations

作者: Avraam Tapinos , Bede Constantinides , Douglas B Kell , David L Robertson

DOI: 10.1101/011940

关键词: Nucleic acidSequence variationCurse of dimensionalityDNA sequencingAlgorithmSequence assemblySequence alignmentSequence analysisPopulationNucleotideGenomeMetagenomicsAlignment-free sequence analysisComputer scienceData mining

摘要: DNA sequencing instruments are enabling genomic analyses of unprecedented scope and scale, widening the gap between our abilities to generate interpret sequence data. Established methods for computational analysis generally consider nucleotide-level resolution sequences, while these approaches sufficiently accurate, increasingly ambitious data-intensive rendering them impractical demanding applications such as genome metagenome assembly. Comparable analytical challenges encountered in other fields involving sequential data signal processing time series analysis. By representing nucleic acid composition numerically it is possible apply dimensionality reduction from sequences nucleotides, their approximate representation. To explore applicability decomposition assembly, we implemented a short read aligner evaluated its performance against simulated high diversity viral alongside four existing aligners. Using prototype implementation, representations reduced overall alignment by up 14-fold compared that uncompressed without any accuracy. Despite using heavily approximated representations, implementation yielded alignments similar accuracy aligners, outperforming all tools tested at levels variation. Our approach was also applied de novo assembly diverse population. We have demonstrated full not prerequisite accurate may be retained or even enhanced through appropriate sequences.

参考文章(48)
Heikki Mannila, King-Ip Lin, Gautam Das, Padhraic Smyth, Gopal Renganathan, Rule discovery from time series knowledge discovery and data mining. pp. 16- 22 ,(1998)
Rakesh Agrawal, Christos Faloutsos, Arun Swami, None, Efficient Similarity Search In Sequence Databases FODO '93 Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms. pp. 69- 84 ,(1993) , 10.1007/3-540-57301-1_5
Theophano Mitsa, Temporal Data Mining ,(2010)
Konstantin Berlin, Sergey Koren, Chen-Shan Chin, James P Drake, Jane M Landolin, Adam M Phillippy, Assembling large genomes with single-molecule sequencing and locality-sensitive hashing Nature Biotechnology. ,vol. 33, pp. 623- 630 ,(2015) , 10.1038/NBT.3238
Donald B. Percival, Andrew T. Walden, Wavelet Methods for Time Series Analysis ,(2006)
Michel Verleysen, Damien François, The Curse of Dimensionality in Data Mining and Time Series Prediction Computational Intelligence and Bioinspired Systems. ,vol. 3512, pp. 758- 770 ,(2005) , 10.1007/11494669_93
E.A. Cheever, D.B. Searls, W. Karunaratne, G.C. Overton, Using signal processing techniques for DNA sequence comparison northeast bioengineering conference. pp. 173- 174 ,(1989) , 10.1109/NEBC.1989.36756
Pierre Geurts, Pattern Extraction for Time Series Classification european conference on principles of data mining and knowledge discovery. pp. 115- 127 ,(2001) , 10.1007/3-540-44794-6_10
B.D. Silverman, R. Linsker, A measure of DNA periodicity. Journal of Theoretical Biology. ,vol. 118, pp. 295- 300 ,(1986) , 10.1016/S0022-5193(86)80060-1