Mining language variation using word using and collocation characteristics

作者: Peng Tang , Tommy W.S. Chow

DOI: 10.1016/J.ESWA.2014.05.018

关键词: Entropy (information theory)ClosenessCollocationComputer scienceArtificial intelligenceSentenceText miningModern EnglishNatural language processing

摘要: Two textual metrics ''Frequency Rank'' (FR) and ''Intimacy'' are proposed in this paper to measure the word using collocation characteristics which two important aspects of text style. The FR, derived from local index numbers terms a sentences ordered by global frequency terms, provides single-term-level information. Intimacy models relationship between others, i.e. closeness term is other same sentence. features Rank Ratio (FRR)'' ''Overall (OI)'' for capturing language variation employing metrics. Using features, among documents can be visualized space. Three corpora consisting diverse topics, genres, regions, dates writing designed collected evaluate algorithms. Extensive simulations conducted verify feasibility performance our implementation. Both theoretical analyses based on entropy demonstrate method. We also show algorithm used visualizing several western languages. Variation modern English over time recognizable when analysis Finally, method compared conventional classification implementations. comparative results indicate outperforms others.

参考文章(54)
A. Q. Morton, The Authorship of Greek Prose Journal of the Royal Statistical Society. Series A (General). ,vol. 128, pp. 169- 224 ,(1965) , 10.2307/2344178
Violeta Seretan, Luka Nerima, Eric Wehrli, Extraction of multi-word collocations using syntactic bigram composition pp. 424- 431 ,(2003)
Ti Alkire, Carol G. Rosen, Romance Languages: A Historical Introduction ,(2010)
Violeta Seretan, Syntax-Based Collocation Extraction ,(2010)
Hinrich Schütze, Christopher D. Manning, Prabhakar Raghavan, Introduction to Information Retrieval ,(2005)
Nicholas Kushmerick, Aidan Finn, Learning to classify documents according to genre: Special Topic Section on Computational Analysis of Style Journal of the Association for Information Science and Technology. ,vol. 57, pp. 1506- 1518 ,(2006) , 10.1002/ASI.V57:11
Emmerich Kelih, Gordana Antić, Peter Grzybek, Ernst Stadlober, Classification of Author and/or Genre? The Impact of Word Length GfKl. pp. 498- 505 ,(2005) , 10.1007/3-540-28084-7_58
Thorsten Joachims, Making large-scale support vector machine learning practical Advances in kernel methods. pp. 169- 184 ,(1999)