Identifying and mitigating batch effects in whole genome sequencing data.

作者: Jennifer A. Tom , Jens Reeder , William F. Forrest , Robert R. Graham , Julie Hunkapiller

DOI: 10.1186/S12859-017-1756-Z

关键词:

摘要: Large sample sets of whole genome sequencing with deep coverage are being generated, however assembling datasets from different sources inevitably introduces batch effects. These effects not well understood and can be due to changes in the protocol or bioinformatics tools used process data. No systematic algorithms heuristics exist detect filter remove associations impacted by We describe key quality metrics, provide a freely available software package compute them, demonstrate that identification is aided principal components analysis these metrics. To mitigate effects, we developed new site-specific filters identified removed variants falsely associated phenotype effect. include filtering based on: haplotype genotype correction, differential test, removing sites missing rate greater than 30% after setting genotypes scores less 20 missing. This method 96.1% unconfirmed genome-wide significant SNP 97.6% indel associations. performed analyses that: 1) known disease as 2 out 16 confirmed an AMD candidate were filtered, representing reduction power 12.5%, 2) In absence only small proportion across (type I error 3%), 3) independent dataset, 90.2% 89.8% Researchers currently do have effective identify validated methods address this deficiency.

参考文章(51)
Na Cai, Tim B Bigdeli, Warren Kretzschmar, Yihan Li, Jieqin Liang, Li Song, Jingchu Hu, Qibin Li, Wei Jin, Zhenfei Hu, Guangbiao Wang, Linmao Wang, Puyi Qian, Yuan Liu, Tao Jiang, Yao Lu, Xiuqing Zhang, Ye Yin, Yingrui Li, Xun Xu, Jingfang Gao, Mark Reimers, Todd Webb, Brien Riley, Silviu Bacanu, Roseann E Peterson, Yiping Chen, Hui Zhong, Zhengrong Liu, Gang Wang, Jing Sun, Hong Sang, Guoqing Jiang, Xiaoyan Zhou, Yi Li, Wei Zhang, Xueyi Wang, Xiang Fang, Runde Pan, Guodong Miao, Qiwen Zhang, Jian Hu, Fengyu Yu, Bo Du, Wenhua Sang, Keqing Li, Guibing Chen, Min Cai, Lijun Yang, Donglin Yang, Baowei Ha, Xiaohong Hong, Hong Deng, Gongying Li, Kan Li, Yan Song, Shugui Gao, Jinbei Zhang, Zhaoyu Gan, Huaqing Meng, Jiyang Pan, Chengge Gao, Kerang Zhang, Ning Sun, Youhui Li, Qihui Niu, Yutang Zhang, Tieqiao Liu, Chunmei Hu, Zhen Zhang, Luxian Lv, Jicheng Dong, Xiaoping Wang, Ming Tao, Xumei Wang, Jing Xia, Han Rong, Qiang He, Tiebang Liu, Guoping Huang, Qiyi Mei, Zhenming Shen, Ying Liu, Jianhua Shen, Tian Tian, Xiaojuan Liu, Wenyuan Wu, Danhua Gu, Guangyi Fu, Jianguo Shi, Yunchun Chen, Xiangchao Gan, Lanfen Liu, Lina Wang, Fuzhong Yang, Enzhao Cong, Jonathan Marchini, Huanming Yang, Jian Wang, Shenxun Shi, Richard Mott, Qi Xu, Jun Wang, Kenneth S Kendler, Jonathan Flint, Sparse whole-genome sequencing identifies two loci for major depressive disorder Nature. ,vol. 523, pp. 588- 591 ,(2015) , 10.1038/NATURE14659
Jonathan S. Packer, Evan K. Maxwell, Colm O’Dushlaine, Alexander E. Lopez, Frederick E. Dewey, Rostislav Chernomorsky, Aris Baras, John D. Overton, Lukas Habegger, Jeffrey G. Reid, CLAMMS: a scalable algorithm for calling common and rare copy number variants from exome sequencing data. Bioinformatics. ,vol. 32, pp. 133- 135 ,(2015) , 10.1093/BIOINFORMATICS/BTV547
Marylyn D. Ritchie, Stephen Turner, Loren L. Armstrong, Yuki Bradford, Christopher S. Carlson, Dana C. Crawford, Andrew T. Crenshaw, Mariza Andrade, Kimberly F. Doheny, Jonathan L. Haines, Geoffrey Hayes, Gail Jarvik, Lan Jiang, Iftikhar J. Kullo, Rongling Li, Hua Ling, Teri A. Manolio, Martha Matsumoto, Catherine A. McCarty, Andrew N. McDavid, Daniel B. Mirel, Justin E. Paschall, Elizabeth W. Pugh, Luke V. Rasmussen, Russell A. Wilke, Rebecca L. Zuvich, Quality control procedures for genome-wide association studies. Current protocols in human genetics. ,vol. 68, ,(2011) , 10.1002/0471142905.HG0119S68
K Miclaus, M Chierici, C Lambert, L Zhang, S Vega, H Hong, S Yin, C Furlanello, R Wolfinger, F Goodsaid, Variability in GWAS analysis: the impact of genotype calling algorithm inconsistencies. Pharmacogenomics Journal. ,vol. 10, pp. 324- 335 ,(2010) , 10.1038/TPJ.2010.46
Erika Check Hayden, Technology: The $1,000 genome Nature. ,vol. 507, pp. 294- 295 ,(2014) , 10.1038/507294A
Chris C. A. Spencer, Zhan Su, Peter Donnelly, Jonathan Marchini, Designing Genome-Wide Association Studies: Sample Size, Power, Imputation, and the Choice of Genotyping Chip PLoS Genetics. ,vol. 5, pp. e1000477- ,(2009) , 10.1371/JOURNAL.PGEN.1000477
Bogdan Pasaniuc, Nadin Rohland, Paul J McLaren, Kiran Garimella, Noah Zaitlen, Heng Li, Namrata Gupta, Benjamin M Neale, Mark J Daly, Pamela Sklar, Patrick F Sullivan, Sarah Bergen, Jennifer L Moran, Christina M Hultman, Paul Lichtenstein, Patrik Magnusson, Shaun M Purcell, David W Haas, Liming Liang, Shamil Sunyaev, Nick Patterson, Paul I W de Bakker, David Reich, Alkes L Price, Extremely low-coverage sequencing and imputation increases power for genome-wide association studies Nature Genetics. ,vol. 44, pp. 631- 635 ,(2012) , 10.1038/NG.2283
Daniel F Gudbjartsson, Hannes Helgason, Sigurjon A Gudjonsson, Florian Zink, Asmundur Oddson, Arnaldur Gylfason, Soren Besenbacher, Gisli Magnusson, Bjarni V Halldorsson, Eirikur Hjartarson, Gunnar Th Sigurdsson, Simon N Stacey, Michael L Frigge, Hilma Holm, Jona Saemundsdottir, Hafdis Th Helgadottir, Hrefna Johannsdottir, Gunnlaugur Sigfusson, Gudmundur Thorgeirsson, Jon Th Sverrisson, Solveig Gretarsdottir, G Bragi Walters, Thorunn Rafnar, Bjarni Thjodleifsson, Einar S Bjornsson, Sigurdur Olafsson, Hildur Thorarinsdottir, Thora Steingrimsdottir, Thora S Gudmundsdottir, Asgeir Theodors, Jon G Jonasson, Asgeir Sigurdsson, Gyda Bjornsdottir, Jon J Jonsson, Olafur Thorarensen, Petur Ludvigsson, Hakon Gudbjartsson, Gudmundur I Eyjolfsson, Olof Sigurdardottir, Isleifur Olafsson, David O Arnar, Olafur Th Magnusson, Augustine Kong, Gisli Masson, Unnur Thorsteinsdottir, Agnar Helgason, Patrick Sulem, Kari Stefansson, Large-scale whole-genome sequencing of the Icelandic population Nature Genetics. ,vol. 47, pp. 435- 444 ,(2015) , 10.1038/NG.3247