Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification

作者: Sandve Gk , Scheffer L , Kanduri C , Motwani K , Pavlović M

DOI: 10.1101/2021.05.23.445346

关键词:

摘要: BackgroundMachine learning (ML) methodology development for classification of immune states in adaptive receptor repertoires (AIRR) has seen a recent surge interest. However, so far, there does not exist systematic evaluation scenarios where classical ML methods (such as penalized logistic regression) already perform adequately AIRR classification. This hinders investigative reorientation to those further method more sophisticated approaches may be required. ResultsTo identify baseline is able well classification, we generated collection synthetic benchmark datasets encompassing wide range dataset architecture-associated and state-associated sequence pattern (signal) complexity. We trained {approx}1300 models with varying assumptions regarding signal on {approx}850 total {approx}210000 containing {approx}42 billion TCR{beta} CDR3 amino acid sequences, thereby surpassing the sample sizes current state-of-the-art setups by two orders magnitude. found that L1-penalized regression achieved high prediction accuracy even when occurs only 1 out 50000 AIR sequences. ConclusionsWe provide reference guide new by: (i) identifying characterised complexity, achieve (ii) facilitating realistic expectations performance given training properties assumptions. Our study serves template defining specialized comprehensive benchmarking methods.

参考文章(62)
Victor Greiff, Pooja Bhat, Skylar C. Cook, Ulrike Menzel, Wenjing Kang, Sai T. Reddy, A bioinformatic framework for immune repertoire diversity profiling enables detection of immunological status Genome Medicine. ,vol. 7, pp. 49- 49 ,(2015) , 10.1186/S13073-015-0169-8
Nathalie Japkowicz, Shaju Stephen, The class imbalance problem: A systematic study intelligent data analysis. ,vol. 6, pp. 429- 449 ,(2002) , 10.3233/IDA-2002-6504
M. Valiev, E.J. Bylaska, N. Govind, K. Kowalski, T.P. Straatsma, H.J.J. Van Dam, D. Wang, J. Nieplocha, E. Apra, T.L. Windus, W.A. de Jong, NWChem: a comprehensive and scalable open-source solution for large scale molecular simulations Computer Physics Communications. ,vol. 181, pp. 1477- 1489 ,(2010) , 10.1016/J.CPC.2010.04.018
Susumu Tonegawa, Somatic generation of antibody diversity Nature. ,vol. 302, pp. 575- 581 ,(1983) , 10.1038/302575A0
Vanessa Venturi, David A. Price, Daniel C. Douek, Miles P. Davenport, The molecular basis for public T-cell responses? Nature Reviews Immunology. ,vol. 8, pp. 231- 238 ,(2008) , 10.1038/NRI2260
Jorg J.A. Calis, Brad R. Rosenberg, Characterizing immune repertoires by high throughput sequencing: strategies and applications. Trends in Immunology. ,vol. 35, pp. 581- 590 ,(2014) , 10.1016/J.IT.2014.09.004
George Georgiou, Gregory C Ippolito, John Beausang, Christian E Busse, Hedda Wardemann, Stephen R Quake, The promise and challenge of high-throughput sequencing of the antibody repertoire. Nature Biotechnology. ,vol. 32, pp. 158- 168 ,(2014) , 10.1038/NBT.2782
Mark M. Davis, Pamela J. Bjorkman, T-cell antigen receptor genes and T-cell recognition. Nature. ,vol. 334, pp. 395- 402 ,(1988) , 10.1038/334395A0
Donna L. Farber, Naomi A. Yudanin, Nicholas P. Restifo, Human memory T cells: generation, compartmentalization and homeostasis Nature Reviews Immunology. ,vol. 14, pp. 24- 35 ,(2014) , 10.1038/NRI3567
J. Glanville, W. Zhai, J. Berka, D. Telman, G. Huerta, G. R. Mehta, I. Ni, L. Mei, P. D. Sundar, G. M. R. Day, D. Cox, A. Rajpal, J. Pons, Precise determination of the diversity of a combinatorial antibody library gives insight into the human immunoglobulin repertoire Proceedings of the National Academy of Sciences of the United States of America. ,vol. 106, pp. 20216- 20221 ,(2009) , 10.1073/PNAS.0909775106