作者: Jennifer A. Tom , Jens Reeder , William F. Forrest , Robert R. Graham , Julie Hunkapiller
DOI: 10.1186/S12859-017-1756-Z
关键词:
摘要: Large sample sets of whole genome sequencing with deep coverage are being generated, however assembling datasets from different sources inevitably introduces batch effects. These effects not well understood and can be due to changes in the protocol or bioinformatics tools used process data. No systematic algorithms heuristics exist detect filter remove associations impacted by We describe key quality metrics, provide a freely available software package compute them, demonstrate that identification is aided principal components analysis these metrics. To mitigate effects, we developed new site-specific filters identified removed variants falsely associated phenotype effect. include filtering based on: haplotype genotype correction, differential test, removing sites missing rate greater than 30% after setting genotypes scores less 20 missing. This method 96.1% unconfirmed genome-wide significant SNP 97.6% indel associations. performed analyses that: 1) known disease as 2 out 16 confirmed an AMD candidate were filtered, representing reduction power 12.5%, 2) In absence only small proportion across (type I error 3%), 3) independent dataset, 90.2% 89.8% Researchers currently do have effective identify validated methods address this deficiency.