Syntactic N-grams as machine learning features for natural language processing

作者: Grigori Sidorov , Francisco Velasquez , Efstathios Stamatatos , Alexander Gelbukh , Liliana Chanona-Hernández

DOI: 10.1016/J.ESWA.2013.08.015

关键词: Natural language processingClassifier (linguistics)SyntaxComputer scienceC4.5 algorithmPart of speechSupport vector machineClassifier (UML)Machine learningTree (data structure)Syntactic predicateParsingNaive Bayes classifierArtificial intelligence

摘要: In this paper we introduce and discuss a concept of syntactic n-grams (sn-grams). Sn-grams differ from traditional in the manner how construct them, i.e., what elements are considered neighbors. case sn-grams, neighbors taken by following relations trees, not taking words as they appear text, sn-grams constructed paths trees. manner, allow bringing knowledge into machine learning methods; still, previous parsing is necessary for their construction. can be applied any natural language processing (NLP) task where used. We describe were to authorship attribution. used baseline words, part speech (POS) tags characters; three classifiers applied: support vector machines (SVM), naive Bayes (NB), tree classifier J48. give better results with SVM classifier.

参考文章(25)
J. Grieve, Quantitative Authorship Attribution: An Evaluation of Techniques Literary and Linguistic Computing. ,vol. 22, pp. 251- 270 ,(2007) , 10.1093/LLC/FQM020
Hans Van Halteren, Author verification by linguistic profiling: An exploration of the parameter space ACM Transactions on Speech and Language Processing. ,vol. 4, pp. 1- 17 ,(2007) , 10.1145/1187415.1187416
Apoorv Agarwal, Fadi Biadsy, Kathleen R. Mckeown, Contextual Phrase-Level Polarity Analysis Using Lexical Affect Scoring and Syntactic N-Grams meeting of the association for computational linguistics. pp. 24- 32 ,(2009) , 10.3115/1609067.1609069
Fabrizio Sebastiani, Machine learning in automated text categorization ACM Computing Surveys. ,vol. 34, pp. 1- 47 ,(2002) , 10.1145/505282.505283
D. I. HOLMES, The Evolution of Stylometry in Humanities Scholarship Literary and Linguistic Computing. ,vol. 13, pp. 111- 117 ,(1998) , 10.1093/LLC/13.3.111
Efstathios Stamatatos, A survey of modern authorship attribution methods Journal of the Association for Information Science and Technology. ,vol. 60, pp. 538- 556 ,(2009) , 10.1002/ASI.V60:3
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten, The WEKA data mining software ACM SIGKDD Explorations Newsletter. ,vol. 11, pp. 10- 18 ,(2009) , 10.1145/1656274.1656278
Thamar Solorio, Manuel Montes-y-Gomez, Hugo Jair Escalante, Local Histograms of Character N-grams for Authorship Attribution meeting of the association for computational linguistics. pp. 288- 298 ,(2011)
A. Abbasi, Hsinchun Chen, Applying authorship analysis to extremist-group Web forum messages IEEE Intelligent Systems. ,vol. 20, pp. 67- 75 ,(2005) , 10.1109/MIS.2005.81
Jonathan Schler, Moshe Koppel, Elisheva Bonchek-Dokow, Measuring Differentiability: Unmasking Pseudonymous Authors Journal of Machine Learning Research. ,vol. 8, pp. 1261- 1276 ,(2007)