On the Feasibility of Internet-Scale Author Identification

作者: Arvind Narayanan , Hristo Paskov , Neil Zhenqiang Gong , John Bethencourt , Emil Stefanov

DOI: 10.1109/SP.2012.46

关键词: Natural language processingComputer scienceSpeech recognitionStylometryWriting styleContext (language use)Classifier (linguistics)Handwriting recognitionIdentification (information)Variety (linguistics)Artificial intelligence

摘要: We study techniques for identifying an anonymous author via linguistic stylometry, i.e., comparing the writing style against a corpus of texts known authorship. experimentally demonstrate effectiveness our with as many 100,000 candidate authors. Given increasing availability samples online, result has serious implications anonymity and free speech - blogger or whistleblower may be unmasked unless they take steps to obfuscate their style. While there is huge body literature on authorship recognition based style, almost none it studied corpora more than few hundred The problem becomes qualitatively different at large scale, we show, from prior work fail both in terms accuracy performance. variety classifiers, "lazy" "eager," show how handle number classes. also develop novel confidence estimation classifier outputs. Finally, stylometric written contexts. In over 20% cases, classifiers can correctly identify given authors; about 35% cases correct one top 20 guesses. If allow option not making guess, are able increase precision guess 80% only halving recall.

参考文章(52)
Wei-Jen Li, Ke Wang, S.J. Stolfo, B. Herzog, Fileprints: identifying file types by n-gram analysis systems man and cybernetics. pp. 64- 71 ,(2005) , 10.1109/IAW.2005.1495935
Li Zhuang, Feng Zhou, J. D. Tygar, Keyboard acoustic emanations revisited Proceedings of the 12th ACM conference on Computer and communications security - CCS '05. pp. 373- 382 ,(2005) , 10.1145/1102120.1102169
Gary Kacmarcik, Michael Gamon, Obfuscating Document Stylometry to Preserve Author Anonymity meeting of the association for computational linguistics. pp. 444- 451 ,(2006) , 10.3115/1273073.1273131
Arvind Narayanan, Vitaly Shmatikov, Robust De-anonymization of Large Sparse Datasets ieee symposium on security and privacy. pp. 111- 125 ,(2008) , 10.1109/SP.2008.33
A.K. Jain, A. Ross, S. Prabhakar, An introduction to biometric recognition IEEE Transactions on Circuits and Systems for Video Technology. ,vol. 14, pp. 4- 20 ,(2004) , 10.1109/TCSVT.2003.818349
R. Plamondon, S.N. Srihari, Online and off-line handwriting recognition: a comprehensive survey IEEE Transactions on Pattern Analysis and Machine Intelligence. ,vol. 22, pp. 63- 84 ,(2000) , 10.1109/34.824821
Gilbert Wondracek, Thorsten Holz, Engin Kirda, Christopher Kruegel, A Practical Attack to De-anonymize Social Network Users ieee symposium on security and privacy. pp. 223- 238 ,(2010) , 10.1109/SP.2010.21
Liang Wang, Tieniu Tan, Huazhong Ning, Weiming Hu, Silhouette analysis-based gait recognition for human identification IEEE Transactions on Pattern Analysis and Machine Intelligence. ,vol. 25, pp. 1505- 1518 ,(2003) , 10.1109/TPAMI.2003.1251144
David L. Wallace, Frederick Mosteller, Inference and disputed authorship : The Federalist ,(1964)
Fabian Monrose, Aviel D. Rubin, Keystroke dynamics as a biometric for authentication Future Generation Computer Systems. ,vol. 16, pp. 351- 359 ,(2000) , 10.1016/S0167-739X(99)00059-X