Can characters reveal your native language? A language-independent approach to native language identification

作者: Radu Tudor Ionescu , Marius Popescu , Aoife Cahill

DOI: 10.3115/V1/D14-1142

关键词:

摘要: A common approach in text mining tasks such as categorization, authorship identification or plagiarism detection is to rely on features like words, part-of-speech tags, stems, some other high-level linguistic features. In this work, an that uses character n-grams proposed for the task of native language identification. Instead doing standard feature selection, combines several string kernels using multiple kernel learning. Kernel Ridge Regression and Discriminant Analysis are independently used learning stage. The empirical results obtained all experiments conducted work indicate achieves state art performance identification, reaching accuracy 1.7% above top scoring system 2013 NLI Shared Task. Furthermore, has important advantage it independent theory neutral. cross-corpus experiment, shows can also be topic independent, improving by 32.3%.

参考文章(24)
Dominique Estival, Tanja Gaustad, Will Radford, Ben Hutchinson, Son Bao Pham, Author Profiling for English and Arabic Emails ,(2008)
Robert Tibshirani, Trevor Hastie, Jerome H. Friedman, The Elements of Statistical Learning ,(2001)
Text classification using string kernels Journal of Machine Learning Research. ,vol. 2, pp. 419- 444 ,(2002) , 10.1162/153244302760200687
Moshe Koppel, Jonathan Schler, Kfir Zigdon, Automatically Determining an Anonymous Author’s Native Language Intelligence and Security Informatics. pp. 209- 217 ,(2005) , 10.1007/11427995_17
Nello Cristianini, John Shawe-Taylor, Kernel Methods for Pattern Analysis ,(2004)
Fanny Meunier, Sylviane Granger, Estelle Dagneaux, The International Corpus of Learner English. Handbook and CD-ROM ,(2002)
Yves Bestgen, Scott Jarvis, Steve Pepper, Maximizing Classification Accuracy in Native Language Identification workshop on innovative use of nlp for building educational applications. pp. 111- 118 ,(2013)
Liviu P. Dinu, On the classification and aggregation of hierarchies with different constitutive elements Fundamenta Informaticae. ,vol. 55, pp. 39- 50 ,(2002)
Andrea Vedaldi, Andrew Zisserman, Efficient additive kernels via explicit feature maps computer vision and pattern recognition. pp. 3539- 3546 ,(2010) , 10.1109/CVPR.2010.5539949
Daniel Blanchard, Joel Tetreault, Derrick Higgins, Aoife Cahill, Martin Chodorow, TOEFL11: A CORPUS OF NON‐NATIVE ENGLISH ETS Research Report Series. ,vol. 2013, pp. 15- ,(2013) , 10.1002/J.2333-8504.2013.TB02331.X