作者: Antonella Delmestri
DOI:
关键词: Ancestor 、 Language family 、 Cognate 、 Tree (data structure) 、 Natural language processing 、 Natural language 、 String metric 、 Identification (biology) 、 Artificial intelligence 、 Computer science 、 Variation (linguistics)
摘要: Natural languages that originate from a common ancestor are genetically related, words the core of any language and cognates sharing same etymology. Cognate identification, therefore, represents foundation upon which evolutionary history may be discovered, while linguistic phylogenetic inference aims to estimate genetic relationships exist between them. In this thesis, using several techniques originally developed for biological sequence analysis, we have designed data driven orthographic learning system measuring string similarity successfully applied it tasks cognate identification inference. Our has outperformed best comparable phonetic models previously reported in literature, with results statistically significant remarkably stable, regardless variation training dataset dimension. When Indo-European family, whose higher structure does not yet consensus, our method estimated phylogenies compatible benchmark tree reproduced correctly all established major groups subgroups present dataset.