Word Embeddings for Entity-annotated Texts

作者: Satya Almasian , Andreas Spitz , Michael Gertz

DOI: 10.1007/978-3-030-15712-8_20

关键词:

摘要: Learned vector representations of words are useful tools for many information retrieval and natural language processing tasks due to their ability capture lexical semantics. However, while such involve or even rely on named entities as central components, popular word embedding models have so far failed include first-class citizens. While it seems intuitive that annotating in the training corpus should result more intelligent features downstream tasks, performance issues arise when approaches naively applied entity annotated corpora. Not only resulting embeddings less than expected, but one also finds non-entity degrades comparison those trained raw, unannotated corpus. In this paper, we investigate jointly train a large with automatically linked entities. We discuss two distinct generation embeddings, namely state-of-the-art raw-text versions corpus, well node co-occurrence graph representation compare classical variety similarity, analogy, clustering evaluation entity-specific tasks. Our findings show takes an create acceptable common test cases. Based these results, how text can restore performance.

参考文章(48)
Rada Mihalcea, Paul Tarau, TextRank: Bringing Order into Text empirical methods in natural language processing. pp. 404- 411 ,(2004)
Hinrich Schütze, Christopher D. Manning, Prabhakar Raghavan, Introduction to Information Retrieval ,(2005)
Felix Hill, Kyunghyun Cho, Anna Korhonen, Yoshua Bengio, Learning to Understand Phrases by Embedding the Dictionary Transactions of the Association for Computational Linguistics. ,vol. 4, pp. 17- 30 ,(2016) , 10.1162/TACL_A_00080
Felix Hill, Roi Reichart, Anna Korhonen, Simlex-999: Evaluating semantic models with genuine similarity estimation Computational Linguistics. ,vol. 41, pp. 665- 695 ,(2015) , 10.1162/COLI_A_00237
Qiaozhu Mei, Meng Qu, Mingzhe Wang, Jian Tang, Ming Zhang, Jun Yan, LINE: Large-scale Information Network Embedding the web conference. pp. 1067- 1077 ,(2015) , 10.1145/2736277.2741093
Jeff Mitchell, Mirella Lapata, Composition in Distributional Models of Semantics Cognitive Science. ,vol. 34, pp. 1388- 1429 ,(2010) , 10.1111/J.1551-6709.2010.01106.X
Kristina Toutanova, Dan Klein, Christopher D. Manning, Yoram Singer, Feature-rich part-of-speech tagging with a cyclic dependency network Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - NAACL '03. pp. 173- 180 ,(2003) , 10.3115/1073445.1073478
David Nadeau, Satoshi Sekine, A survey of named entity recognition and classification Lingvisticae Investigationes. ,vol. 30, pp. 3- 26 ,(2007) , 10.1075/LI.30.1.03NAD
Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, Shaul Markovitch, A word at a time Proceedings of the 20th international conference on World wide web - WWW '11. pp. 337- 346 ,(2011) , 10.1145/1963405.1963455
Jannik Strötgen, Michael Gertz, Multilingual and cross-domain temporal tagging language resources and evaluation. ,vol. 47, pp. 269- 298 ,(2013) , 10.1007/S10579-012-9179-Y