Leveraging Wikipedia knowledge to classify multilingual biomedical documents

作者: Marcos Antonio Mouriño García , Roberto Pérez Rodríguez , Luis Anido Rifón

DOI: 10.1016/J.ARTMED.2018.04.007

关键词: Artificial intelligenceComputer scienceMultilingualismGermanIcelandicNatural language processingInterlanguageEncyclopedias as TopicRomanianMachine translationClassifier (UML)

摘要: Abstract This article presents a classifier that leverages Wikipedia knowledge to represent documents as vectors of concepts weights, and analyses its suitability for classifying biomedical written in any language when it is trained only with English documents. We propose the cross-language concept matching technique, which relies on interlanguage links convert between languages. The performance compared based machine translation, two classifiers MetaMap. To perform experiments, we created multilingual corpus. first one, Multi-Lingual UVigoMED (ML-UVigoMED) composed 23,647 about topics English, German, French, Spanish, Italian, Galician, Romanian, Icelandic. second English-French-Spanish-German (EFSG-UVigoMED) 19,210 abstract extracted from MEDLINE German. approach proposed superior state-of-the art benchmark. conclude leveraging great advantage tasks classification

参考文章(47)
Clement Jonquet, Nigam H. Shah, Mark A. Musen, The open biomedical annotator. AMIA Summit on Translational Bioinformatics. ,vol. 2009, pp. 56- 60 ,(2009)
Magnus Sahlgren, The Distributional Hypothesis The Italian Journal of Linguistics. ,vol. 20, pp. 33- 54 ,(2008)
Xiaohua Zhou, Xiaodan Zhang, Xiaohua Hu, Semantic Smoothing for Bayesian Text Classification with Small Training Data. siam international conference on data mining. pp. 289- 300 ,(2008)
Kilian Weinberger, Matt Kusner, Nicholas Kolkin, Yu Sun, From Word Embeddings To Document Distances international conference on machine learning. pp. 957- 966 ,(2015)
Longwen Gao, Shuigeng Zhou, Jihong Guan, Effectively classifying short texts by structured sparse representation with dictionary filtering Information Sciences. ,vol. 323, pp. 130- 142 ,(2015) , 10.1016/J.INS.2015.06.033
Anna Huang, David Milne, Eibe Frank, Ian H. Witten, Clustering Documents Using a Wikipedia-Based Concept Representation Advances in Knowledge Discovery and Data Mining. pp. 628- 636 ,(2009) , 10.1007/978-3-642-01307-2_62
Zakaria Elberrichi, Amel Belaggoun, Malika Taibi, Multilingual Medical Documents Classification Based on MesH Domain Ontology arXiv: Information Retrieval. ,(2012)
Bin Zheng, David C McLean, Xinghua Lu, Identifying biological concepts from a protein-related corpus with a probabilistic topic model BMC Bioinformatics. ,vol. 7, pp. 58- 58 ,(2006) , 10.1186/1471-2105-7-58
Alan R. Aronson, Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program american medical informatics association annual symposium. pp. 17- 21 ,(2001)