作者: Corinne Dahinden , G Parmiggiani , Mark C Emerick , Peter Bühlmann
DOI:
关键词:
摘要: Human protein diversity is partly due to a process called alternative splicing which enables different exon/intron combinations arising from a single gene. We know that the prevalence of these combinations is development and tissue specific but we are far from understanding the mechanism of alternative splicing and what causes the spliceosome to produce an array of different proteins from the same genetic information in changing frequencies over time and different tissues. A first step in the understanding of this process is the statistical analysis of the exon interaction structure. If we know which exons interact with each other, we might be able to draw conclusions about the associated functional domains. At present, the most advanced molecular technique to investigate this issue is to generate large-scale single-gene transcriptome data, so-called full-length cDNA libraries. Not all theoretically possible exon/intron combinations can be observed in these libraries, both due to functional restrictions at the protein level as well as to the sheer number of possible combinations. Statistically this poses the challenge of learning interactions in sparse contingency tables. To this end, we develop methods to perform model selection and parameter estimation in high-dimensional log-linear models. These include Bayesian methods as well as penalization approaches which generalize to this context the Lasso algorithm. We compare these procedures in a simulation study and we apply the proposed methods to full-length cDNA libraries, yielding valuable insight into the biological process of alternative splicing.