An algorithm for learning maximum entropy probability models of disease risk that efficiently searches and sparingly encodes multilocus genomic interactions

作者: David J. Miller , Yanxin Zhang , Guoqiang Yu , Yongmei Liu , Li Chen

DOI: 10.1093/BIOINFORMATICS/BTP435

关键词: Machine learningSample size determinationComputational biologyArtificial neural networkSupport vector machinePrinciple of maximum entropyConditional probabilityGenome-wide association studyEntropy (information theory)Artificial intelligenceBayesian information criterionMathematics

摘要: Motivation: In both genome-wide association studies (GWAS) and pathway analysis, the modest sample size relative to number of genetic markers presents formidable computational, statistical methodological challenges for accurately identifying markers/interactions building phenotype-predictive models. Results: We address these objectives via maximum entropy conditional probability modeling (MECPM), coupled with a novel model structure search. Unlike neural networks support vector machines (SVMs), MECPM makes explicit is determined by interactions that confer power. Our method identifies marker subset multiple k-way between markers. Additional key aspects are: (i) evaluation select up five-way while retaining relatively low complexity; (ii) flexible single nucleotide polymorphism (SNP) coding (dominant, recessive) within each interaction; (iii) no mathematical interaction form assumed; (iv) order selection based on Bayesian Information Criterion, which fairly compares at different orders automatically sets experiment-wide significance level; (v) directly yields model. was compared panel methods datasets 1000 SNPs eight embedded penetrance function (i.e. ground-truth) interactions, including five-way, involving less than 20 SNPs. achieved improved sensitivity specificity detecting ground-truth previous methods. Availability: http://www.cbil.ece.vt.edu/ResearchOngoingSNP.htm Contact: djmiller@engr.psu.edu Supplementary information:Supplementary data are available Bioinformatics online.

参考文章(29)
Robert Tibshirani, Trevor Hastie, Jerome H. Friedman, The Elements of Statistical Learning ,(2001)
Ajita Bhat, Paul R. Lucek, Jurg Ott, Analysis of complex traits using neural networks. Genetic Epidemiology. ,vol. 17, ,(1999) , 10.1002/GEPI.1370170781
Jonathan Marchini, Peter Donnelly, Lon R Cardon, None, Genome-wide strategies for detecting multiple loci that influence complex diseases Nature Genetics. ,vol. 37, pp. 413- 417 ,(2005) , 10.1038/NG1537
Robert Welch, Amy Hutchinson, Junwen Wang, Kai Yu, Nilanjan Chatterjee, Nick Orr, Walter C Willett, Graham A Colditz, Regina G Ziegler, Christine D Berg, Saundra S Buys, Catherine A McCarty, Heather Spencer Feigelson, Eugenia E Calle, Michael J Thun, Richard B Hayes, Margaret Tucker, Daniela S Gerhard, Joseph F Fraumeni, Robert N Hoover, Gilles Thomas, Stephen J Chanock, David J Hunter, Peter Kraft, Kevin B Jacobs, David G Cox, Meredith Yeager, Susan E Hankinson, Sholom Wacholder, Zhaoming Wang, A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer Nature Genetics. ,vol. 39, pp. 870- 874 ,(2007) , 10.1038/NG2075
David B. Allison, Xiangqin Cui, Grier P. Page, Mahyar Sabripour, Microarray data analysis: from disarray to consolidation and consensus. Nature Reviews Genetics. ,vol. 7, pp. 55- 65 ,(2006) , 10.1038/NRG1749
Yu Zhang, Jun S Liu, Bayesian inference of epistatic interactions in case-control studies Nature Genetics. ,vol. 39, pp. 1167- 1173 ,(2007) , 10.1038/NG2110
J. Rissanen, Paper: Modeling by shortest data description Automatica. ,vol. 14, pp. 465- 471 ,(1978) , 10.1016/0005-1098(78)90005-5
Jason H. Moore, Joshua C. Gilbert, Chia-Ti Tsai, Fu-Tien Chiang, Todd Holden, Nate Barney, Bill C. White, A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility Journal of Theoretical Biology. ,vol. 241, pp. 252- 261 ,(2006) , 10.1016/J.JTBI.2005.11.036
Alan Agresti, Categorical data analysis Contemporary Sociology. ,vol. 22, pp. 459- ,(1993) , 10.1002/0471249688