作者: Idoia Ochoa , Mikel Hernaez , Tsachy Weissman
DOI: 10.1142/S0219720014420025
关键词:
摘要: With the release of latest Next-Generation Sequencing (NGS) machine, HiSeq X by Illumina, cost sequencing whole genome a human is expected to drop mere $1000. This milestone in history marks era affordable individuals and opens doors personalized medicine. In accord, unprecedented volumes genomic data will require storage for processing. There be dire need not only compressing aligned data, but also generating compressed files that can fed directly downstream applications facilitate analysis inference on data. Several approaches this challenge have been proposed literature; however, focus thus far has low coverage regime most suggested compressors are based effective modeling We demonstrate benefit reads. Specifically, we show that, working with models designed improve considerably over best compression ratio achieved previously algorithms. Our results indicate pareto-optimal barrier rate speed claimed Bonfield Mahoney (2013) [Bonfield JK Mahoneys MV, Compression FASTQ SAM format PLOS ONE, 8(3):e59190, 2013.] does apply high Furthermore, our improved splitting manner conducive operations domain applications.