作者: Alan Ritter , Sam Clark , Mausam , Oren Etzioni
DOI:
关键词: Computer science 、 Pipeline (software) 、 Named-entity recognition 、 F1 score 、 Artificial intelligence 、 Natural language processing 、 Chunking (computing) 、 Chunking (psychology)
摘要: People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-building pipeline beginning with part-of-speech tagging, through chunking, to named-entity recognition. Our novel T-ner system doubles F1 score compared Stanford NER system. leverages redundancy inherent tweets achieve performance, using LabeledLDA exploit Freebase dictionaries as source distant supervision. outperforms co-training, increasing 25% over ten common entity types. Our are available at: http://github.com/aritter/twitter_nlp