作者: Arvind Narayanan , Hristo Paskov , Neil Zhenqiang Gong , John Bethencourt , Emil Stefanov
DOI: 10.1109/SP.2012.46
关键词: Natural language processing 、 Computer science 、 Speech recognition 、 Stylometry 、 Writing style 、 Context (language use) 、 Classifier (linguistics) 、 Handwriting recognition 、 Identification (information) 、 Variety (linguistics) 、 Artificial intelligence
摘要: We study techniques for identifying an anonymous author via linguistic stylometry, i.e., comparing the writing style against a corpus of texts known authorship. experimentally demonstrate effectiveness our with as many 100,000 candidate authors. Given increasing availability samples online, result has serious implications anonymity and free speech - blogger or whistleblower may be unmasked unless they take steps to obfuscate their style. While there is huge body literature on authorship recognition based style, almost none it studied corpora more than few hundred The problem becomes qualitatively different at large scale, we show, from prior work fail both in terms accuracy performance. variety classifiers, "lazy" "eager," show how handle number classes. also develop novel confidence estimation classifier outputs. Finally, stylometric written contexts. In over 20% cases, classifiers can correctly identify given authors; about 35% cases correct one top 20 guesses. If allow option not making guess, are able increase precision guess 80% only halving recall.