作者: Peng Tang , Tommy W.S. Chow
DOI: 10.1016/J.ESWA.2014.05.018
关键词: Entropy (information theory) 、 Closeness 、 Collocation 、 Computer science 、 Artificial intelligence 、 Sentence 、 Text mining 、 Modern English 、 Natural language processing
摘要: Two textual metrics ''Frequency Rank'' (FR) and ''Intimacy'' are proposed in this paper to measure the word using collocation characteristics which two important aspects of text style. The FR, derived from local index numbers terms a sentences ordered by global frequency terms, provides single-term-level information. Intimacy models relationship between others, i.e. closeness term is other same sentence. features Rank Ratio (FRR)'' ''Overall (OI)'' for capturing language variation employing metrics. Using features, among documents can be visualized space. Three corpora consisting diverse topics, genres, regions, dates writing designed collected evaluate algorithms. Extensive simulations conducted verify feasibility performance our implementation. Both theoretical analyses based on entropy demonstrate method. We also show algorithm used visualizing several western languages. Variation modern English over time recognizable when analysis Finally, method compared conventional classification implementations. comparative results indicate outperforms others.