作者: Sandve Gk , Scheffer L , Kanduri C , Motwani K , Pavlović M
DOI: 10.1101/2021.05.23.445346
关键词:
摘要: BackgroundMachine learning (ML) methodology development for classification of immune states in adaptive receptor repertoires (AIRR) has seen a recent surge interest. However, so far, there does not exist systematic evaluation scenarios where classical ML methods (such as penalized logistic regression) already perform adequately AIRR classification. This hinders investigative reorientation to those further method more sophisticated approaches may be required. ResultsTo identify baseline is able well classification, we generated collection synthetic benchmark datasets encompassing wide range dataset architecture-associated and state-associated sequence pattern (signal) complexity. We trained {approx}1300 models with varying assumptions regarding signal on {approx}850 total {approx}210000 containing {approx}42 billion TCR{beta} CDR3 amino acid sequences, thereby surpassing the sample sizes current state-of-the-art setups by two orders magnitude. found that L1-penalized regression achieved high prediction accuracy even when occurs only 1 out 50000 AIR sequences. ConclusionsWe provide reference guide new by: (i) identifying characterised complexity, achieve (ii) facilitating realistic expectations performance given training properties assumptions. Our study serves template defining specialized comprehensive benchmarking methods.