Estimating the number of unseen variants in the human genome.

作者: I. Ionita-Laza , C. Lange , N. M. Laird

DOI: 10.1073/PNAS.0807815106

关键词:

摘要: The different genetic variation discovery projects (The SNP Consortium, the International HapMap Project, 1000 Genomes etc.) aim to identify as much possible of underlying in various human populations. question we address this article is how many new variants are yet be found. This an instance species problem ecology, where goal estimate number a closed population. We use parametric beta-binomial model that allows us calculate expected with desired minimum frequency discovered dataset individuals specified size. method can also used predict necessary sequence order capture all (or fraction of) frequency. apply three datasets: ENCODE dataset, SeattleSNPs and National Institute Environmental Health Sciences SNPs dataset. Consistent previous descriptions, our results show African population most diverse terms exist, Asian populations least diverse, European in-between. In addition, clear distinction between Chinese Japanese populations, being less diverse. To find common (frequency at 1%) need sequenced small (∼350) does not differ among populations; data that, subject accuracy, Project likely these high proportion rarer ones 0.1 1%). reveal rule diminishing returns: (∼150) sufficient 80% 0.1%, while larger (> 3,000 individuals) those variants. Finally, higher diversity environmental response genes compared average genome, especially

参考文章(18)
Joseph G. Hacia, Jian-Bing Fan, Oliver Ryder, Li Jin, Keith Edgemon, Ghassan Ghandour, R. Aeryn Mayer, Bryan Sun, Linda Hsie, Christiane M. Robbins, Lawrence C. Brody, David Wang, Eric S. Lander, Robert Lipshutz, Stephen P.A. Fodor, Francis S. Collins, Determination of ancestral alleles for human single-nucleotide polymorphisms using high-density oligonucleotide arrays Nature Genetics. ,vol. 22, pp. 164- 167 ,(1999) , 10.1038/9674
Tushar R Bhangale, Mark J Rieder, Deborah A Nickerson, Estimating coverage and power for genetic association studies using near-complete variation data Nature Genetics. ,vol. 40, pp. 841- 843 ,(2008) , 10.1038/NG.180
R. H. Duerr, K. D. Taylor, S. R. Brant, J. D. Rioux, M. S. Silverberg, M. J. Daly, A. H. Steinhart, C. Abraham, M. Regueiro, A. Griffiths, T. Dassopoulos, A. Bitton, H. Yang, S. Targan, L. W. Datta, E. O. Kistner, L. P. Schumm, A. T. Lee, P. K. Gregersen, M. M. Barmada, J. I. Rotter, D. L. Nicolae, J. H. Cho, A Genome-Wide Association Study Identifies IL23R as an Inflammatory Bowel Disease Gene Science. ,vol. 314, pp. 1461- 1463 ,(2006) , 10.1126/SCIENCE.1135245
Deborah A. Nickerson, Mark J. Rieder, Dana C. Crawford, Christopher S. Carlson, Robert J. Livingston, An overview of the environmental genome project. Environmental Health Perspectives. ,vol. 113, pp. 42- 53 ,(2005) , 10.1289/EHP.7922
Robert Welch, Amy Hutchinson, Junwen Wang, Kai Yu, Nilanjan Chatterjee, Nick Orr, Walter C Willett, Graham A Colditz, Regina G Ziegler, Christine D Berg, Saundra S Buys, Catherine A McCarty, Heather Spencer Feigelson, Eugenia E Calle, Michael J Thun, Richard B Hayes, Margaret Tucker, Daniela S Gerhard, Joseph F Fraumeni, Robert N Hoover, Gilles Thomas, Stephen J Chanock, David J Hunter, Peter Kraft, Kevin B Jacobs, David G Cox, Meredith Yeager, Susan E Hankinson, Sholom Wacholder, Zhaoming Wang, A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer Nature Genetics. ,vol. 39, pp. 870- 874 ,(2007) , 10.1038/NG2075
G.A. Watterson, H.A. Guess, Is the most frequent allele the oldest? Theoretical Population Biology. ,vol. 11, pp. 141- 160 ,(1977) , 10.1016/0040-5809(77)90023-5
BRADLEY EFRON, RONALD THISTED, Estimating the number of unseen species: How many words did Shakespeare know? Biometrika. ,vol. 63, pp. 435- 447 ,(1976) , 10.1093/BIOMET/63.3.435
Kaspar Mossman, The Wellcome Trust Case Control Consortium, U.K. Scientific American. ,vol. 298, pp. 42- 42 ,(2008) , 10.1038/SCIENTIFICAMERICAN0108-42A