作者: Charles D. Warden , Aaron W. Adamson , Susan L. Neuhausen , Xiwei Wu
DOI: 10.7717/PEERJ.600
关键词:
摘要: The Genome Analysis Toolkit (GATK) is commonly used for variant calling of single nucleotide polymorphisms (SNPs) and small insertions deletions (indels) from short-read sequencing data aligned against a reference genome. There have been number comparisons GATK, but an equally comprehensive comparison VarScan not yet performed. More specifically, we compare (1) the effects different pre-processing steps prior to with both GATK VarScan, (2) variants called increasingly conservative parameters, (3) filtered unfiltered calls (for UnifiedGenotyper HaplotypeCaller). Variant was performed on three datasets (1 targeted exon dataset 2 exome datasets), each approximately dozen subjects. In most cases, (e.g., indel realignment quality score base recalibration using GATK) had only modest impact calls, importance varied between callers. Based upon concordance statistics presented in this study, recommend users focus "high-quality" by filtering out flagged as low-quality. We also found that running set parameters (referred "VarScan-Cons") resulted reproducible list variants, high (>97%) high-quality HaplotypeCaller. These result decreased sensitivity, VarScan-Cons could still recover 84-88% SNPs datasets. This study provides limited evidence has false positive rate among novel (relative SNPs) HaplotypeCaller increased indels indels). broadly, believe metrics can be useful assessing context specific experimental design. As example, are two additional