Similarity Dependent Chinese Restaurant Process for Cognate Identification in Multilingual Wordlists

作者: Taraka Rama

DOI: 10.18653/V1/K18-1027

关键词: Task (project management)Word listComputer scienceArtificial intelligenceCognateChinese restaurant processSimilarity (network science)Identification (information)Natural language processingCluster analysisLanguage family

摘要: We present and evaluate two similarity dependent Chinese Restaurant Process (sd-CRP) algorithms at the task of automated cognate detection. The sd-CRP clustering do not require any predefined threshold for detecting sets in a multilingual word list. performance on six language families (more than 750 languages) find that both variants performs as well InfoMap better UPGMA inferring clusters. presented this paper are family agnostic can be applied to linguistically under-studied family.

参考文章(39)
Søren Wichmann, Eric W. Holman, Languages with longer words have more lexical change Approaches to Measuring Linguistic Differences. pp. 249- 281 ,(2013) , 10.1515/9783110305258.249
Grzegorz Kondrak, Identification of Cognates and Recurrent Sound Correspondences in Word Lists Trait. Autom. des Langues. ,vol. 50, pp. 201- 235 ,(2009)
David Hall, Dan Klein, Large-Scale Cognate Recovery empirical methods in natural language processing. pp. 344- 354 ,(2011)
Johann-Mattis List, SCA: phonetic alignment based on sound classes ESSLLI'10 Proceedings of the 2010 international conference on New Directions in Logic, Language and Computation. ,vol. 7415, pp. 32- 51 ,(2010) , 10.1007/978-3-642-31467-4_3
R. D. Gray, A. J. Drummond, S. J. Greenhill, Language Phylogenies Reveal Expansion Pulses and Pauses in Pacific Settlement Science. ,vol. 323, pp. 479- 483 ,(2009) , 10.1126/SCIENCE.1166858
Will Chang, Chundra Cathcart, David Hall, Andrew Garrett, Ancestry-constrained phylogenetic analysis supports the Indo-European steppe hypothesis Language. ,vol. 91, pp. 194- 244 ,(2015) , 10.1353/LAN.2015.0005
Enrique Amigó, Julio Gonzalo, Javier Artiles, Felisa Verdejo, A comparison of extrinsic clustering evaluation metrics based on formal constraints Information Retrieval. ,vol. 12, pp. 461- 486 ,(2009) , 10.1007/S10791-008-9066-8
Saul B. Needleman, Christian D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins Journal of Molecular Biology. ,vol. 48, pp. 443- 453 ,(1970) , 10.1016/0022-2836(70)90057-4