KiloGrams: Very Large N-Grams for Malware Classification

作者: Charles Nicholas , Hyrum S. Anderson , Edward Raff , Richard Zak , Mark McLean

DOI: 10.13016/M2KIB7-UPXE

关键词:

摘要: N-grams have been a common tool for information retrieval and machine learning applications decades. In nearly all previous works, only few values of $n$ are tested, with $n > 6$ being exceedingly rare. Larger not tested due to computational burden or the fear overfitting. this work, we present method find top-$k$ most frequent $n$-grams that is 60$\times$ faster small $n$, can tackle large $n\geq1024$. Despite unprecedented size considered, show how these features still predictive ability malware classification tasks. More important, provide benefits in producing interpretable by analysis, be used create general purpose signatures compatible industry standard tools like Yara. Furthermore, counts file may added as publicly available human-engineered rival efficacy professionally-developed when train gradient-boosted decision tree models on EMBER dataset.

参考文章(40)
Yuval Shahar, Lior Rokach, Gil Tahan, Mal-ID: automatic malware detection using common segment analysis and meta-features Journal of Machine Learning Research. ,vol. 13, pp. 949- 979 ,(2012)
Nick Cercone, Tony Abou-Assaleh, Vlado Keselj, Ray Sweidan, Detection of New Malicious Code Using N-grams Signatures. conference on privacy, security and trust. pp. 193- 196 ,(2004)
J.M. Trenkle, W.B. Cavnar, N-gram-based text categorization ,(1994)
Ahmed Metwally, Divyakant Agrawal, Amr El Abbadi, Efficient Computation of Frequent and Top-k Elements in Data Streams Database Theory - ICDT 2005. pp. 398- 412 ,(2004) , 10.1007/978-3-540-30570-5_27
David M. Chess, Gregory B. Sorkin, Gerald J. Tesauro, Jeffrey O. Kephart, William C. Arnold, Steve R. White, Biologically inspired defenses against computer viruses international joint conference on artificial intelligence. pp. 985- 996 ,(1995)
Janez Demšar, Statistical Comparisons of Classifiers over Multiple Data Sets Journal of Machine Learning Research. ,vol. 7, pp. 1- 30 ,(2006)
Charles K. Nicholas, Dan Shen, Junli Liu, Ethan Millar, Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System Journal of Digital Information. ,vol. 1, ,(2000)
Abedelaziz Mohaisen, Omar Alrawi, Unveiling Zeus: automated classification of malware samples the web conference. pp. 829- 832 ,(2013) , 10.1145/2487788.2488056
Amr H Ibrahim, MB Abdelhalim, Hanady Hussein, Ahmed Fahmy, None, Analysis of x86 instruction set usage for Windows 7 applications international conference on computer technology and development. pp. 511- 516 ,(2010) , 10.1109/ICCTD.2010.5645851
Jonathan D. Cohen, Recursive hashing functions for n-grams ACM Transactions on Information Systems. ,vol. 15, pp. 291- 320 ,(1997) , 10.1145/256163.256168