作者: Aoife Cahill
DOI:
关键词: Rule-based machine translation 、 Natural language processing 、 Artificial intelligence 、 Grammar induction 、 Probabilistic logic 、 Treebank 、 Parsing 、 Grammar 、 Programming language 、 Computer science 、 Annotation 、 Dependency (UML)
摘要: Traditionally, rich, constraint-based grammatical resources have been hand-coded. Scaling such beyond toy fragments to unrestricted, real text is knowledge-intensive, timeconsuming and expensive. The work reported in this thesis part of a larger project automate as much possible the construction wide-coverage, deep, from treebanks. Penn-II treebank large collection parse-annotated newspaper text. We designed Lexical-Functional Grammar (LFG) (Kaplan Bresnan, 1982) f-structure annotation algorithm automatically annotate with information approximating basic predicate-argument or dependency structures (Cahill et al., 2002c, 2004a). then use f-structure-annotated resource extract grammars lexical for parsing new into f-structures. implemented Treebank Tool Suite (TTS) support linguistic that seeds automatic van Genabith, 2002) F-Structure Annotation (FSAT) validate visualise results annotation. two PCFG-based probabilistic architectures unseen f-structures: pipeline integrated model. Both parse raw basic, but possibly incomplete, (“proto f-structures”) long distance dependencies (LDDs) unresolved 2002c). method resolving LDDs at level based on finite approximation functional uncertainty equations Zaenen, 1989) acquired f structure-annotated 2004b). To date, best result achieved by our own induced f-score 80.33% against PARC 700, an improvement 0.73% over handcrafted grammar 2004). processing architecture developed highly flexible: using external, state-of-the-art technologies (Charniak, 2000) model, we achieve 81.79% 2.19% Kaplan al. (2004). also ported induction methodology German TIGER 2003a). treebank-based, constraintbased acquisition. resulting LFG approximations wider coverage (measured terms complete spanning parse) comparable better than those hand-crafted grammars, with, believe, considerably less development effort. believe approach successfully addresses knowledge-acquisition bottleneck (familiar rule-based approaches Al NLP) development. Our can provide attractive, multilingual, acquisition paradigm.