TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text

作者: Kalina Bontcheva , Niraj Aswani , Leon Derczynski , Adam Funk , Diana Maynard

DOI: 10.6084/M9.FIGSHARE.1003767.V2

关键词:

摘要: Twitter is the largest source of microblog text, responsible for gigabytes human discourse every day. Processing text difficult: genre noisy, documents have little context, and utterances are very short. As such, conventional NLP tools fail when faced with tweets other text. We present TwitIE, an open-source pipeline customised to at stage. Additionally, it includes Twitter-specific data import metadata handling. This paper introduces each stage TwitIE pipeline, which a modification GATE ANNIE news An evaluation against some state-of-the-art systems also presented.

参考文章(29)
Deepayan Chakrabarti, Kunal Punera, Event Summarization Using Tweets international conference on weblogs and social media. ,(2011)
Patrick Paroubek, Alexander Pak, Twitter as a Corpus for Sentiment Analysis and Opinion Mining language resources and evaluation. ,(2010)
Robert J. Gaizauskas, Mark Hepple, Yikun Guo, Angus Roberts, Combining Terminology Resources and Statistical Methods for Entity Recognition: an Evaluation language resources and evaluation. ,(2008)
Paul Cook, Timothy Baldwin, Bo Han, Automatically Constructing a Normalisation Dictionary for Microblogs empirical methods in natural language processing. pp. 421- 432 ,(2012)
J.M. Trenkle, W.B. Cavnar, N-gram-based text categorization ,(1994)
Mor Naaman, Hila Becker, Luis Gravano, Beyond Trending Topics: Real-World Event Identification on Twitter international conference on weblogs and social media. ,(2011) , 10.7916/D81V5NVX
Pascal Hitzler, Krzysztof Janowicz, None, Semantic Web - Interoperability, Usability, Applicability Social Work. ,vol. 1, pp. 1- 2 ,(2010) , 10.3233/SW-2010-0017
Mitch Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, None, Building a large annotated corpus of English: the penn treebank Computational Linguistics. ,vol. 19, pp. 313- 330 ,(1993) , 10.21236/ADA273556
Simon Carter, Wouter Weerkamp, Manos Tsagkias, Microblog language identification: overcoming the limitations of short, unedited and idiomatic text language resources and evaluation. ,vol. 47, pp. 195- 215 ,(2013) , 10.1007/S10579-012-9195-Y
Kristina Toutanova, Dan Klein, Christopher D. Manning, Yoram Singer, Feature-rich part-of-speech tagging with a cyclic dependency network Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - NAACL '03. pp. 173- 180 ,(2003) , 10.3115/1073445.1073478