Using support vector machines and state-of-the-art algorithms for phonetic alignment to identify cognates in multi-lingual wordlists

作者: Gerhard Jäger , Johann-Mattis List , Pavel Sofroniev

DOI: 10.18653/V1/E17-1113

关键词: CognateIdentification (information)Phonetic transcriptionScope (computer science)Artificial intelligenceNatural language processingWord (computer architecture)Support vector machineState (computer science)Machine learningComputer science

摘要: Most current approaches in phylogenetic linguistics require as input multilingual word lists partitioned into sets of etymologically related words (cognates). Cognate identification is so far done manually by experts, which time consuming and yet only available for a small number well-studied language families. Automatizing this step will greatly expand the empirical scope methods linguistics, raw wordlists (in phonetic transcription) are much easier to obtain than cognate have been fully identified annotated, even under-studied languages. A couple different proposed past, but they either disappointing regarding their performance or not applicable larger datasets. Here we present new approach that uses support vector machines unify state-of-the-art alignment detection within single framework. Training evaluating these method on typologically broad collection gold-standard data shows it be superior existing state art.

参考文章(31)
Hans Geisler, Akzent und Lautwandel in der Romania Gunter Narr Verlag. ,(1992)
Brett Kessler, The significance of word lists ,(2001)
Søren Wichmann, Eric W. Holman, Languages with longer words have more lexical change Approaches to Measuring Linguistic Differences. pp. 249- 281 ,(2013) , 10.1515/9783110305258.249
Christopher D. Manning, Hinrich Schütze, Foundations of Statistical Natural Language Processing ,(1999)
B. Andreopoulos, A. An, X. Wang, M. Schroeder, A roadmap of clustering algorithms: finding a match for a biomedical application Briefings in Bioinformatics. ,vol. 10, pp. 297- 314 ,(2008) , 10.1093/BIB/BBN058
Mark Pagel, Quentin D. Atkinson, Andrew Meade, Frequency of word-use predicts rates of lexical evolution throughout Indo-European history Nature. ,vol. 449, pp. 717- 720 ,(2007) , 10.1038/NATURE06176