Systematic tissue annotations of –omics samples by modeling unstructured metadata

作者: Krishnan A , Hawkins Nt , Yannakopoulos A , Guare La , Maldaver M

DOI: 10.1101/2021.05.10.443525

关键词: Natural languageMetadataNatural language processingText annotationSupervised learningString searching algorithmSample (statistics)Artificial intelligenceComputer scienceClassifier (UML)Ontology (information science)

摘要: There are currently >1.3 million human -omics samples that publicly available. However, this valuable resource remains acutely underused because discovering samples, say from a particular tissue of interest, ever-growing data collection is still significant challenge. The major impediment sample attributes such as tissue/cell-type origin routinely described using non-standard, varied terminologies written in unstructured natural language. Here, we propose natural-language-processing-based machine learning approach (NLP-ML) to infer and cell-type annotations for based only on their free-text metadata. NLP-ML works by creating numerical representations text descriptions these features supervised classifier predicts terms structured ontology. Our significantly substantially outperforms an advanced annotation method (MetaSRA) uses graph-based reasoning baseline (TAGGER) annotates exact string matching. Model similarities between related tissues demonstrate models capture biologically-meaningful signals text, does the ability classify tissue-associated biological processes diseases alone. nearly accurate gene-expression profiles predicting but have distinct capability irrespective experiment type Python prediction code trained available at https://github.com/krishnanlab/txt2onto.

参考文章(42)
Alvis Brazma, Pascal Hingamp, John Quackenbush, Gavin Sherlock, Paul Spellman, Chris Stoeckert, John Aach, Wilhelm Ansorge, Catherine A. Ball, Helen C. Causton, Terry Gaasterland, Patrick Glenisson, Frank C.P. Holstege, Irene F. Kim, Victor Markowitz, John C. Matese, Helen Parkinson, Alan Robinson, Ugis Sarkans, Steffen Schulze-Kremer, Jason Stewart, Ronald Taylor, Jaak Vilo, Martin Vingron, Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nature Genetics. ,vol. 29, pp. 365- 371 ,(2001) , 10.1038/NG1201-365
Jonathan Bard, Seung Y Rhee, Michael Ashburner, An ontology for cell types Genome Biology. ,vol. 6, pp. 1- 5 ,(2005) , 10.1186/GB-2005-6-2-R21
Nikolay Kolesnikov, Emma Hastings, Maria Keays, Olga Melnichuk, Y. Amy Tang, Eleanor Williams, Miroslaw Dylag, Natalja Kurbatova, Marco Brandizi, Tony Burdett, Karyn Megy, Ekaterina Pilicheva, Gabriella Rustici, Andrew Tikhonov, Helen Parkinson, Robert Petryszak, Ugis Sarkans, Alvis Brazma, ArrayExpress update—simplifying data submissions Nucleic Acids Research. ,vol. 43, pp. 1113- 1116 ,(2015) , 10.1093/NAR/GKU1057
Y. Zhu, S. Davis, R. Stephens, P. S. Meltzer, Y. Chen, GEOmetadb: powerful alternative search engine for the Gene Expression Omnibus. Bioinformatics. ,vol. 24, pp. 2798- 2800 ,(2008) , 10.1093/BIOINFORMATICS/BTN520
M. N. McCall, B. M. Bolstad, R. A. Irizarry, Frozen robust multiarray analysis (fRMA) Biostatistics. ,vol. 11, pp. 242- 253 ,(2010) , 10.1093/BIOSTATISTICS/KXP059
Christopher J Mungall, Carlo Torniai, Georgios V Gkoutos, Suzanna E Lewis, Melissa A Haendel, Uberon, an integrative multi-species anatomy ontology Genome Biology. ,vol. 13, pp. 1- 20 ,(2012) , 10.1186/GB-2012-13-1-R5
L. Gautier, L. Cope, B. M. Bolstad, R. A. Irizarry, affy---analysis of Affymetrix GeneChip data at the probe level Bioinformatics. ,vol. 20, pp. 307- 315 ,(2004) , 10.1093/BIOINFORMATICS/BTG405
Tanya Barrett, Stephen E. Wilhite, Pierre Ledoux, Carlos Evangelista, Irene F. Kim, Maxim Tomashevsky, Kimberly A. Marshall, Katherine H. Phillippy, Patti M. Sherman, Michelle Holko, Andrey Yefanov, Hyeseung Lee, Naigong Zhang, Cynthia L. Robertson, Nadezhda Serova, Sean Davis, Alexandra Soboleva, NCBI GEO: archive for functional genomics data sets—update Nucleic Acids Research. ,vol. 41, pp. 991- 995 ,(2012) , 10.1093/NAR/GKS1193
Young-suk Lee, Arjun Krishnan, Qian Zhu, Olga G. Troyanskaya, Ontology-aware classification of tissue and cell-type signals in gene expression profiles across platforms and technologies Bioinformatics. ,vol. 29, pp. 3036- 3044 ,(2013) , 10.1093/BIOINFORMATICS/BTT529
Alan R Aronson, François-Michel Lang, An overview of MetaMap: historical perspective and recent advances Journal of the American Medical Informatics Association. ,vol. 17, pp. 229- 236 ,(2010) , 10.1136/JAMIA.2009.002733