作者: Krishnan A , Hawkins Nt , Yannakopoulos A , Guare La , Maldaver M
DOI: 10.1101/2021.05.10.443525
关键词: Natural language 、 Metadata 、 Natural language processing 、 Text annotation 、 Supervised learning 、 String searching algorithm 、 Sample (statistics) 、 Artificial intelligence 、 Computer science 、 Classifier (UML) 、 Ontology (information science)
摘要: There are currently >1.3 million human -omics samples that publicly available. However, this valuable resource remains acutely underused because discovering samples, say from a particular tissue of interest, ever-growing data collection is still significant challenge. The major impediment sample attributes such as tissue/cell-type origin routinely described using non-standard, varied terminologies written in unstructured natural language. Here, we propose natural-language-processing-based machine learning approach (NLP-ML) to infer and cell-type annotations for based only on their free-text metadata. NLP-ML works by creating numerical representations text descriptions these features supervised classifier predicts terms structured ontology. Our significantly substantially outperforms an advanced annotation method (MetaSRA) uses graph-based reasoning baseline (TAGGER) annotates exact string matching. Model similarities between related tissues demonstrate models capture biologically-meaningful signals text, does the ability classify tissue-associated biological processes diseases alone. nearly accurate gene-expression profiles predicting but have distinct capability irrespective experiment type Python prediction code trained available at https://github.com/krishnanlab/txt2onto.