Large-scale Inference of Population Structure in Presence of Missingness using PCA.

作者: Anders Albrechtsen , Siyang Liu , Jonas Meisner , Mingxi Huang

DOI: 10.1093/BIOINFORMATICS/BTAB027

关键词: Data miningInferenceScale (map)Principal component analysisPopulation structureMissing dataComputer science

摘要: MOTIVATION Principal component analysis (PCA) is a commonly used tool in genetics to capture and visualize population structure. Due technological advances sequencing, such as the widely non-invasive prenatal test, massive datasets of ultra-low coverage sequencing are being generated. These characterized by having large amount missing genotype information. RESULTS We present EMU, method for inferring structure presence rampant non-random missingness. show through simulations that several PCA methods can not handle data arisen from various sources, which leads biased results individuals projected into PC space based on their In terms accuracy, EMU outperforms an existing also accommodates missingness while competitively fast. further tested around 100K Phase 1 dataset Chinese Millionome Project, were shallowly sequenced 0.08x. From this we able Han reproduce previous matter CPU hours instead years. EMU's capability accurately infer will be increasing importance with rising number large-scale genetic datasets. AVAILABILITY written Python freely available at https://github.com/rosemeis/emu. SUPPLEMENTARY INFORMATION Supplementary Bioinformatics online.

参考文章(30)
Jonathan Marchini, Lon R Cardon, Michael S Phillips, Peter Donnelly, None, The effects of human population structure on large genetic association studies. Nature Genetics. ,vol. 36, pp. 512- 517 ,(2004) , 10.1038/NG1337
Daniel F Gudbjartsson, Hannes Helgason, Sigurjon A Gudjonsson, Florian Zink, Asmundur Oddson, Arnaldur Gylfason, Soren Besenbacher, Gisli Magnusson, Bjarni V Halldorsson, Eirikur Hjartarson, Gunnar Th Sigurdsson, Simon N Stacey, Michael L Frigge, Hilma Holm, Jona Saemundsdottir, Hafdis Th Helgadottir, Hrefna Johannsdottir, Gunnlaugur Sigfusson, Gudmundur Thorgeirsson, Jon Th Sverrisson, Solveig Gretarsdottir, G Bragi Walters, Thorunn Rafnar, Bjarni Thjodleifsson, Einar S Bjornsson, Sigurdur Olafsson, Hildur Thorarinsdottir, Thora Steingrimsdottir, Thora S Gudmundsdottir, Asgeir Theodors, Jon G Jonasson, Asgeir Sigurdsson, Gyda Bjornsdottir, Jon J Jonsson, Olafur Thorarensen, Petur Ludvigsson, Hakon Gudbjartsson, Gudmundur I Eyjolfsson, Olof Sigurdardottir, Isleifur Olafsson, David O Arnar, Olafur Th Magnusson, Augustine Kong, Gisli Masson, Unnur Thorsteinsdottir, Agnar Helgason, Patrick Sulem, Kari Stefansson, Large-scale whole-genome sequencing of the Icelandic population Nature Genetics. ,vol. 47, pp. 435- 444 ,(2015) , 10.1038/NG.3247
RAVI VARADHAN, CHRISTOPHE ROLAND, Simple and Globally Convergent Methods for Accelerating the Convergence of Any EM Algorithm Scandinavian Journal of Statistics. ,vol. 35, pp. 335- 353 ,(2008) , 10.1111/J.1467-9469.2007.00585.X
Fredrik Hallgren, Elin Fornander, Nadin Rohland, Dominique Delsate, Michael Francken, Jean-Michel Guinet, Joachim Wahl, George Ayodo, Hamza A. Babiker, Graciela Bailliet, Elena Balanovska, Oleg Balanovsky, Ramiro Barrantes, Gabriel Bedoya, Haim Ben-Ami, Judit Bene, Fouad Berrada, Claudio M. Bravi, Francesca Brisighelli, George B. J. Busby, Francesco Cali, Mikhail Churnosov, David E. C. Cole, Daniel Corach, Larissa Damba, George van Driem, Stanislav Dryomov, Jean-Michel Dugoujon, Sardana A. Fedorova, Irene Gallego Romero, Marina Gubina, Michael Hammer, Brenna M. Henn, Tor Hervig, Ugur Hodoglugil, Aashish R. Jha, Sena Karachanak-Yankova, Rita Khusainova, Elza Khusnutdinova, Rick Kittles, Toomas Kivisild, William Klitz, Vaidutis Kučinskas, Alena Kushniarevich, Leila Laredj, Sergey Litvinov, Theologos Loukidis, Robert W. Mahley, Béla Melegh, Ene Metspalu, Julio Molina, Joanna Mountain, Klemetti Näkkäläjärvi, Desislava Nesheva, Thomas Nyambo, Ludmila Osipova, Jüri Parik, Fedor Platonov, Olga Posukh, Valentino Romano, Francisco Rothhammer, Igor Rudan, Ruslan Ruizbakiev, Hovhannes Sahakyan, Antti Sajantila, Antonio Salas, Elena B. Starikovskaya, Ayele Tarekegn, Draga Toncheva, Shahlo Turdikulova, Ingrida Uktveryte, Olga Utevska, René Vasquez, Mercedes Villena, Mikhail Voevoda, Cheryl A. Winkler, Levon Yepiskoposyan, Pierre Zalloua, Tatijana Zemunik, Alan Cooper, Cristian Capelli, Mark G. Thomas, Andres Ruiz-Linares, Sarah A. Tishkoff, Lalji Singh, Kumarasamy Thangaraj, Richard Villems, David Comas, Rem Sukernik, Mait Metspalu, Matthias Meyer, Evan E. Eichler, Joachim Burger, Montgomery Slatkin, Svante Pääbo, Janet Kelso, David Reich, Johannes Krause, Iosif Lazaridis, Nick Patterson, Alissa Mittnik, Gabriel Renaud, Swapan Mallick, Karola Kirsanow, Peter H. Sudmant, Joshua G. Schraiber, Sergi Castellano, Mark Lipson, Bonnie Berger, Christos Economou, Ruth Bollongino, Qiaomei Fu, Kirsten I. Bos, Susanne Nordenfelt, Heng Li, Cesare de Filippo, Kay Prüfer, Susanna Sawyer, Cosimo Posth, Wolfgang Haak, Ancient human genomes suggest three ancestral populations for present-day Europeans Nature. ,vol. 513, pp. 409- 413 ,(2014) , 10.1038/NATURE13673
Howard M Cann, Claudia De Toma, Lucien Cazes, Marie-Fernande Legrand, Valerie Morel, Laurence Piouffre, Julia Bodmer, Walter F Bodmer, Batsheva Bonne-Tamir, Anne Cambon-Thomsen, Zhu Chen, Jiayou Chu, Carlo Carcassi, Licinio Contu, Ruofu Du, Laurent Excoffier, GB Ferrara, Jonathan S Friedlaender, Helena Groot, David Gurwitz, Trefor Jenkins, Rene J Herrera, Xiaoyi Huang, Judith Kidd, Kenneth K Kidd, Andre Langaney, Alice A Lin, S Qasim Mehdi, Peter Parham, Alberto Piazza, Maria Pia Pistillo, Yaping Qian, Qunfang Shu, Jiujin Xu, S Zhu, James L Weber, Henry T Greely, Marcus W Feldman, Gilles Thomas, Jean Dausset, L Luca Cavalli-Sforza, A Human Genome Diversity Cell Line Panel Science. ,vol. 296, pp. 261- 262 ,(2002) , 10.1126/SCIENCE.296.5566.261B
Jonathan K. Pritchard, Matthew Stephens, Peter Donnelly, Inference of population structure using multilocus genotype data Genetics. ,vol. 155, pp. 945- 959 ,(2000) , 10.1093/GENETICS/155.2.945
James J Lee, Shaun M Purcell, Shashaank Vattikuti, Laurent CAM Tellier, Carson C Chow, Christopher C Chang, None, Second-generation PLINK: rising to the challenge of larger and richer datasets GigaScience. ,vol. 4, pp. 7- 7 ,(2015) , 10.1186/S13742-015-0047-8
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Andreas Müller, Joel Nothman, Gilles Louppe, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay, Scikit-learn: Machine Learning in Python Journal of Machine Learning Research. ,vol. 12, pp. 2825- 2830 ,(2011)