Named Entity Recognition in Tweets: An Experimental Study

作者: Alan Ritter , Sam Clark , Mausam , Oren Etzioni

DOI:

关键词: Computer sciencePipeline (software)Named-entity recognitionF1 scoreArtificial intelligenceNatural language processingChunking (computing)Chunking (psychology)

摘要: People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-building pipeline beginning with part-of-speech tagging, through chunking, to named-entity recognition. Our novel T-ner system doubles F1 score compared Stanford NER system. leverages redundancy inherent tweets achieve performance, using LabeledLDA exploit Freebase dictionaries as source distant supervision. outperforms co-training, increasing 25% over ten common entity types. Our are available at: http://github.com/aritter/twitter_nlp

参考文章(39)
Tara McIntosh, Unsupervised Discovery of Negative Categories in Lexicon Bootstrapping empirical methods in natural language processing. pp. 356- 365 ,(2010)
Oren Etzioni, Doug Downey, Matthew Broadhead, Locating complex named entities in web text international joint conference on artificial intelligence. pp. 2733- 2739 ,(2007)
Eduard Hovy, Zornitsa Kozareva, Not All Seeds Are Equal: Measuring the Quality of Text Mining Seeds north american chapter of the association for computational linguistics. pp. 618- 626 ,(2010)
Mitch Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, None, Building a large annotated corpus of English: the penn treebank Computational Linguistics. ,vol. 19, pp. 313- 330 ,(1993) , 10.21236/ADA273556
Eduard Hovy, Congxing Cai, Donald Metzler, Stephan Gouws, Contextual Bearing on Linguistic Variation in Social Media Proceedings of the Workshop on Language in Social Media (LSM 2011). pp. 20- 29 ,(2011)
Dustin Hillard, Sameer Singh, Chris Leggetter, Minimally-Supervised Extraction of Entities from Text Advertisements north american chapter of the association for computational linguistics. pp. 73- 81 ,(2010)
David M Blei, Andrew Y Ng, Michael I Jordan, None, Latent dirichlet allocation Journal of Machine Learning Research. ,vol. 3, pp. 993- 1022 ,(2003) , 10.5555/944919.944937
Daniel Ramage, David Hall, Ramesh Nallapati, Christopher D. Manning, Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora empirical methods in natural language processing. pp. 248- 256 ,(2009) , 10.3115/1699510.1699543
Kristina Toutanova, Dan Klein, Christopher D. Manning, Yoram Singer, Feature-rich part-of-speech tagging with a cyclic dependency network Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - NAACL '03. pp. 173- 180 ,(2003) , 10.3115/1073445.1073478
T. L. Griffiths, M. Steyvers, Finding scientific topics Proceedings of the National Academy of Sciences of the United States of America. ,vol. 101, pp. 5228- 5235 ,(2004) , 10.1073/PNAS.0307752101