Parsing with automatically acquired, wide-coverage, robust, probabilistic LFG approximations

作者: Aoife Cahill

DOI:

关键词: Rule-based machine translationNatural language processingArtificial intelligenceGrammar inductionProbabilistic logicTreebankParsingGrammarProgramming languageComputer scienceAnnotationDependency (UML)

摘要: Traditionally, rich, constraint-based grammatical resources have been hand-coded. Scaling such beyond toy fragments to unrestricted, real text is knowledge-intensive, timeconsuming and expensive. The work reported in this thesis part of a larger project automate as much possible the construction wide-coverage, deep, from treebanks. Penn-II treebank large collection parse-annotated newspaper text. We designed Lexical-Functional Grammar (LFG) (Kaplan Bresnan, 1982) f-structure annotation algorithm automatically annotate with information approximating basic predicate-argument or dependency structures (Cahill et al., 2002c, 2004a). then use f-structure-annotated resource extract grammars lexical for parsing new into f-structures. implemented Treebank Tool Suite (TTS) support linguistic that seeds automatic van Genabith, 2002) F-Structure Annotation (FSAT) validate visualise results annotation. two PCFG-based probabilistic architectures unseen f-structures: pipeline integrated model. Both parse raw basic, but possibly incomplete, (“proto f-structures”) long distance dependencies (LDDs) unresolved 2002c). method resolving LDDs at level based on finite approximation functional uncertainty equations Zaenen, 1989) acquired f structure-annotated 2004b). To date, best result achieved by our own induced f-score 80.33% against PARC 700, an improvement 0.73% over handcrafted grammar 2004). processing architecture developed highly flexible: using external, state-of-the-art technologies (Charniak, 2000) model, we achieve 81.79% 2.19% Kaplan al. (2004). also ported induction methodology German TIGER 2003a). treebank-based, constraintbased acquisition. resulting LFG approximations wider coverage (measured terms complete spanning parse) comparable better than those hand-crafted grammars, with, believe, considerably less development effort. believe approach successfully addresses knowledge-acquisition bottleneck (familiar rule-based approaches Al NLP) development. Our can provide attractive, multilingual, acquisition paradigm.

参考文章(47)
Josef van Genabith, Aoife Cahill, TTS - A Treebank Tool Suite language resources and evaluation. ,(2002)
Josef van Genabith, Ruth O'Donovan, Andy Way, Christian Rohrer, Aoife Cahill, Mairéad McCarthy, Martin Forst, Treebank-based multilingual unification-grammar development ,(2003)
Virach Sornlertlamvanich, Takenobu Tokunaga, Hozumi Tanaka, Kentaro Unui, A New Formalization of Probabilistic GLR Parsing international workshop/conference on parsing technologies. pp. 123- 134 ,(1997)
Martin Forst, Treebank Conversion - Establishing a testsuite for a broad-coverage LFG from the TIGER treebank Proceedings of 4th International Workshop on Linguistically Interpreted Corpora (LINC-03) at EACL 2003. ,(2003)
Josef van Genabith, Andy Way, Aoife Cahill, Mairéad McCarthy, Automatic annotation of the Penn-treebank with LFG f-structureinformation ,(2002)
Eugene Charniak, A maximum-entropy-inspired parser north american chapter of the association for computational linguistics. pp. 132- 139 ,(2000)
Miriam Butt, Tracy Holloway King, Marıa-Eugenia Nino, Frédérique Segond, A grammar writer's cookbook CSLI Publications. ,(1999)
Ronald M. Kaplan, The Formal Architecture of Lexical-Functional Grammar. Journal of Information Science and Engineering. ,vol. 5, pp. 305- 322 ,(1989)
Mark Johnson, PCFG models of linguistic tree representations Computational Linguistics. ,vol. 24, pp. 613- 632 ,(1998)