作者: Anders Albrechtsen , Siyang Liu , Jonas Meisner , Mingxi Huang
DOI: 10.1093/BIOINFORMATICS/BTAB027
关键词: Data mining 、 Inference 、 Scale (map) 、 Principal component analysis 、 Population structure 、 Missing data 、 Computer science
摘要: MOTIVATION Principal component analysis (PCA) is a commonly used tool in genetics to capture and visualize population structure. Due technological advances sequencing, such as the widely non-invasive prenatal test, massive datasets of ultra-low coverage sequencing are being generated. These characterized by having large amount missing genotype information. RESULTS We present EMU, method for inferring structure presence rampant non-random missingness. show through simulations that several PCA methods can not handle data arisen from various sources, which leads biased results individuals projected into PC space based on their In terms accuracy, EMU outperforms an existing also accommodates missingness while competitively fast. further tested around 100K Phase 1 dataset Chinese Millionome Project, were shallowly sequenced 0.08x. From this we able Han reproduce previous matter CPU hours instead years. EMU's capability accurately infer will be increasing importance with rising number large-scale genetic datasets. AVAILABILITY written Python freely available at https://github.com/rosemeis/emu. SUPPLEMENTARY INFORMATION Supplementary Bioinformatics online.