Extracting Tree Adjoining Grammars from Bracketed Corpora

作者: Aravind Joshi , Chung-hye Han , Anoop Sarkar , Martha Palmer , Fei Xia

DOI:

关键词:

摘要: ExtractingTreeAdjoiningGrammarsfromBracketedCorp oraFeiXiaDepartmentofComputerandInformationScienceUniversityofPennsylvania3401WalnutStreet,Suite400APhiladelphiaPA19104,USAfxia@linc.cis.upenn.eduAbstractInthispap er,werep ortourorkonextractinglexi-calizedtreeadjoininggrammars(LTAGs)frompar-tiallybracketedcorp ora.Thealgorithm rstfullybracketsthecorp ora,thenextractselementarytrees(etrees),and nally ltersoutinvalidusinglinguisticknowledge.Weshowthatthesetofex-tractedetreesmaynotb ecompleteenoughtocoverthewholelanguage,butthiswillnothaveabigim-pactonparsing.1Intro ductionLexicalizedTreeAdjoiningGrammar(LAG)isatree-rewritingformalism.Itismoreexpressivethanacontext-freegrammar(CFG),1andthereforeb et-terformalismforrepresentingvariousphenomenainnaturallanguages.Inthelastdecade,ithasb eenappliedtovariousNLPtaskssuchasparsing(Srini-vas,1997),machinetranslation(Palmeretal.,1998),informationretrieval(ChandrasekarandSrinivas,1997),generation(StoneandDoran,1997;McCoyetal.,1992),andsummarizationapplications(Bald-winetal.,1997).Awide-coverageLTGforapar-ticularnaturallanguageoftencontainsthousandsoftreesandtakesyearstobuild.Therehasb eenworkonextractingCFGs(Shi-raietal.,1995;Charniak,1996;Krotovandoth-ers,1998)andlexicalizedtreegrammars(Neumann,1998;Srinivas,1997)frombracketedcorp ora.Inthispap er,weprop oseanewmetho dforlearningLTAGsfromsuchcorp ora.22PennTreebankandLAG2.1PennTreebankInthispap er,weusetheEnglishPennTreebankasourbracketedcorpus,whichincludesab out1millionTheauthorwishestothankChung-hyeHan,AravindJoshi,MarthaPalmer,CarlosProlo,Ano opSarkar,andthreeanonymousreviewersformanyhelpfulcomments.1LTAGismoreexpressivethanCFformalismb othinweakandstronggenerativecapacity,e.g.itcanhandlecrossdep endencyelegantly.2Arelatedworkis(Srinivas,1997),butitsgoalnottolearnanewLTAGbuttoextracttheusefulinformation,suchasdep endencyandfrequencyoftrees,foranexistingLTAG.

参考文章(0)