作者: Sandeep Bhatkar , Kang G. Shin , Kent Griffin , Xin Hu
DOI:
关键词:
摘要: The current lack of automatic and speedy labeling a large number (thousands) malware samples seen everyday delays the generation signatures has become major challenge for anti-virus industries. In this paper, we design, implement evaluate novel, scalable framework, called MutantX-S, that can efficiently cluster into families based on programs' static features, i.e., code instruction sequences. MutantX-S is unique combination several novel techniques to address practical challenges clustering. Specifically, it exploits format ×86 architecture represents program as sequence opcodes, facilitating extraction N-gram features. It also hashing trick recently developed in machine learning community reduce dimensionality extracted feature vectors, thus significantly lowering memory requirement computation costs. Our comprehensive evaluation prototype using database more than 130,000 shown its ability correctly over 80% within 2 hours, achieving good balance between accuracy scalability. Applying created at different times, demonstrate achieves high predicting labels previously unknown malware.