作者: Avraam Tapinos , Bede Constantinides , Douglas B Kell , David L Robertson
DOI: 10.1101/011940
关键词: Nucleic acid 、 Sequence variation 、 Curse of dimensionality 、 DNA sequencing 、 Algorithm 、 Sequence assembly 、 Sequence alignment 、 Sequence analysis 、 Population 、 Nucleotide 、 Genome 、 Metagenomics 、 Alignment-free sequence analysis 、 Computer science 、 Data mining
摘要: DNA sequencing instruments are enabling genomic analyses of unprecedented scope and scale, widening the gap between our abilities to generate interpret sequence data. Established methods for computational analysis generally consider nucleotide-level resolution sequences, while these approaches sufficiently accurate, increasingly ambitious data-intensive rendering them impractical demanding applications such as genome metagenome assembly. Comparable analytical challenges encountered in other fields involving sequential data signal processing time series analysis. By representing nucleic acid composition numerically it is possible apply dimensionality reduction from sequences nucleotides, their approximate representation. To explore applicability decomposition assembly, we implemented a short read aligner evaluated its performance against simulated high diversity viral alongside four existing aligners. Using prototype implementation, representations reduced overall alignment by up 14-fold compared that uncompressed without any accuracy. Despite using heavily approximated representations, implementation yielded alignments similar accuracy aligners, outperforming all tools tested at levels variation. Our approach was also applied de novo assembly diverse population. We have demonstrated full not prerequisite accurate may be retained or even enhanced through appropriate sequences.