Aligned genomic data compression via improved modeling.

作者: Idoia Ochoa , Mikel Hernaez , Tsachy Weissman

DOI: 10.1142/S0219720014420025

关键词:

摘要: With the release of latest Next-Generation Sequencing (NGS) machine, HiSeq X by Illumina, cost sequencing whole genome a human is expected to drop mere $1000. This milestone in history marks era affordable individuals and opens doors personalized medicine. In accord, unprecedented volumes genomic data will require storage for processing. There be dire need not only compressing aligned data, but also generating compressed files that can fed directly downstream applications facilitate analysis inference on data. Several approaches this challenge have been proposed literature; however, focus thus far has low coverage regime most suggested compressors are based effective modeling We demonstrate benefit reads. Specifically, we show that, working with models designed improve considerably over best compression ratio achieved previously algorithms. Our results indicate pareto-optimal barrier rate speed claimed Bonfield Mahoney (2013) [Bonfield JK Mahoneys MV, Compression FASTQ SAM format PLOS ONE, 8(3):e59190, 2013.] does apply high Furthermore, our improved splitting manner conducive operations domain applications.

参考文章(12)
Khalid Sayood, Introduction to data compression ,(1996)
E. Pennisi, Will Computers Crash Genomics Science. ,vol. 331, pp. 666- 668 ,(2011) , 10.1126/SCIENCE.331.6018.666
Idoia Ochoa, Himanshu Asnani, Dinesh Bharadia, Mainak Chowdhury, Tsachy Weissman, Golan Yona, QualComp: a new lossy compressor for quality scores based on rate distortion theory BMC Bioinformatics. ,vol. 14, pp. 187- 187 ,(2013) , 10.1186/1471-2105-14-187
Fabien Campagne, Kevin C. Dorff, Nyasha Chambwe, James T. Robinson, Jill P. Mesirov, Compression of Structured High-Throughput Sequencing Data PLoS ONE. ,vol. 8, pp. e79871- ,(2013) , 10.1371/JOURNAL.PONE.0079871
Z. Zhu, Y. Zhang, Z. Ji, S. He, X. Yang, High-throughput DNA sequence data compression Briefings in Bioinformatics. ,vol. 16, pp. 1- 15 ,(2015) , 10.1093/BIB/BBT087
James K. Bonfield, Matthew V. Mahoney, Compression of FASTQ and SAM format sequencing data. PLOS ONE. ,vol. 8, ,(2013) , 10.1371/JOURNAL.PONE.0059190
H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, , The Sequence Alignment/Map format and SAMtools Bioinformatics. ,vol. 25, pp. 2078- 2079 ,(2009) , 10.1093/BIOINFORMATICS/BTP352
Rodrigo Cánovas, Alistair Moffat, Andrew Turpin, Lossy compression of quality scores in genomic data. Bioinformatics. ,vol. 30, pp. 2130- 2136 ,(2014) , 10.1093/BIOINFORMATICS/BTU183
M. Hsi-Yang Fritz, R. Leinonen, G. Cochrane, E. Birney, Efficient storage of high throughput DNA sequencing data using reference-based compression Genome Research. ,vol. 21, pp. 734- 740 ,(2011) , 10.1101/GR.114819.110
Daniel C. Jones, Walter L. Ruzzo, Xinxia Peng, Michael G. Katze, Compression of next-generation sequencing reads aided by highly efficient de novo assembly Nucleic Acids Research. ,vol. 40, ,(2012) , 10.1093/NAR/GKS754