Data Driven Models for Language Evolution

作者: Antonella Delmestri

DOI:

关键词: AncestorLanguage familyCognateTree (data structure)Natural language processingNatural languageString metricIdentification (biology)Artificial intelligenceComputer scienceVariation (linguistics)

摘要: Natural languages that originate from a common ancestor are genetically related, words the core of any language and cognates sharing same etymology. Cognate identification, therefore, represents foundation upon which evolutionary history may be discovered, while linguistic phylogenetic inference aims to estimate genetic relationships exist between them. In this thesis, using several techniques originally developed for biological sequence analysis, we have designed data driven orthographic learning system measuring string similarity successfully applied it tasks cognate identification inference. Our has outperformed best comparable phonetic models previously reported in literature, with results statistically significant remarkably stable, regardless variation training dataset dimension. When Indo-European family, whose higher structure does not yet consensus, our method estimated phylogenies compatible benchmark tree reproduced correctly all established major groups subgroups present dataset.

参考文章(104)
Kevin Knight, Philipp Koehn, Knowledge Sources for Word-Level Translation Models empirical methods in natural language processing. pp. 27- 35 ,(2001)
Paul M. Lewis, Ethnologue : languages of the world SIL International. ,(2009)
Roger K. Moore, Computer Speech and Language Elsevier Publishing Company. ,(1986)
D Sankhoff, J Kruskal, Time Warps, String Edits, and Macromolecules CSLI Publications. ,(1999)
Russell D. Gray, Clare J. Holden, Rapid radiation, borrowing and dialect continua in the Bantu languages McDonald Institute for Archaeological Research. pp. 19- 31 ,(2006)
Russell D. Gray, Geoff K. Nicholls, Quantifying uncertainty in a stochastic model of vocabulary evolution McDonald Institute for Archaeological Research. pp. 161- 171 ,(2006)
I. Dan Melamed, Bitext maps and alignment via pattern recognition Computational Linguistics. ,vol. 25, pp. 107- 130 ,(1999)
Elena Deza, Michel-Marie Deza, Dictionary of distances ,(2006)