Learning to Tag Multilingual Texts Through Observation

作者: Chinatsu Aone , Scott W. Bennett

DOI:

关键词: Natural language processingMachine translationHidden Markov modelArtificial neural networkToponymyComputer scienceArtificial intelligenceInteractive LearningInformation extractionPattern languageProper noun

摘要: This paper describes RoboTag, an advanced prototype for a machine learningbased multilingual information extraction system. First, we describe general client/server architecture used in learning from observation. Then give detailed description of our novel decision-tree tagging approach. RoboTag performance the proper noun task English and Japanese is compared against humantagged keys to best hand-coded pattern (as reported MUC MET evaluation results). Related work future directions are presented. 1 I n t r o d u c i The ability tag names such as organization, person, place texts has great value tasks like extraction, retrieval, translation (Aone, Charocopos, Gorlinsky, 1997). most successful systems currently rely on handcoded patterns identify desired (Adv, 1995; Def, 1996). approach achieves its using different rule sets each language/domain pair. Several these have improved ease use, particularly speed write pattern/evaluate performance/refine loop which plays central role development process. One name assist creation rules by making it easier developer mark parts surrounding context include pattern. boosts productivity hand-coding but still requires significant amount effort key A step up this determine how generalize so that more broadly applicable or suggest highvalue inclusion Nevertheless, skilled with thorough knowledge particular language essential. Our goal developing was make possible end-user build system simply giving examples what should be tagged, rather than requiring user understand language. uses algorithm discover features training common. construct procedure can find additional, previously unseen extraction. It important (for confidence users) induced easily explained terms makes decisions. one factors led us consider decision trees (Quinlan, 1993) component Other potential statistical approaches problem (e.g., Neural Nets Hidden Markov Models) did not offer advantage. well instrumented exploration parameters inspection procedures. discuss overall l~oboTag Next, focus employed learning. We then present experimental results compare both human-tagged systems. Lastly, related discussed. 2 Architecture design motivated interactive had process large number provide visualize allow feedback To end, designed architecture. client interface enhancement manual annotation tool. works multiple languages includes support singleand double-byte coding schemes. paper. server por-

参考文章(5)
Eric David Brill, A corpus-based approach to language learning University of Pennsylvania. ,(1993)
Chinatsu Aone, Nicholas Charocopos, James Gorlinsky, An Intelligent Multilingual Information Browsing and Retrieval System Using Information Extraction conference on applied natural language processing. pp. 332- 339 ,(1997) , 10.3115/974557.974606
George R. Krupka, SRA Proceedings of the 6th conference on Message understanding - MUC6 '95. pp. 221- 235 ,(1995) , 10.3115/1072399.1072419
J. Ross Quinlan, C4.5: Programs for Machine Learning ,(1992)
Daniel M. Bikel, Scott Miller, Richard Schwartz, Ralph Weischedel, Nymble: a High-Performance Learning Name-finder conference on applied natural language processing. pp. 194- 201 ,(1997) , 10.3115/974557.974586