作者: Michael Molla , P Andreae , Jude Shavlik
DOI:
关键词:
摘要: Microarray expression data is being generated by the gigabyte all over the world with undoubted exponential increases to come. Annotated genomic data is also rapidly pouring into public databases. Our goal is to develop automated ways of combining these two sources of information to produce insight into the operation of cells under various conditions. Our approach is to use machine-learning techniques to identify characteristics of genes that are upregulated or down-regulated in a particular microarray experiment. We seek models that are both accurate and easy to interpret. This paper explores the effectiveness of two algorithms for this task: PFOIL (a standard machine-learning rule-building algorithm) and GORB (a new rulebuilding algorithm devised by us). We use a permutation test to evaluate the statistical significance of the learned models. The paper reports on experiments using actual E. coli microarray data, discussing the strengths and weaknesses of the two algorithms and demonstrating the trade-offs between accuracy and comprehensibility.