Automatic Application Identification from Billions of Files

作者: Kyle Soska , Chris Gates , Kevin A. Roundy , Nicolas Christin

DOI: 10.1145/3097983.3098196

关键词:

摘要: Understanding how to group a set of binary files into the piece software they belong is highly desirable for profiling, malware detection, or enterprise audits, among many other applications. Unfortunately, it also extremely challenging: there absolutely no uniformity in ways different applications rely on files, binaries are signed, versioning schemes used across pieces software. In this paper, we show that, by combining information gleaned from large number endpoints (millions computers), can accomplish large-scale application identification automatically and reliably. Our approach relies collecting metadata billions every day, summarizing much smaller "sketches", performing approximate k-nearest neighbor clustering non-metric space representations derived these sketches. We design implement our proposed system using Apache Spark, that process matter hours, thus could be daily processing. further manages successfully identify which with very high precision, adequate recall.

参考文章(27)
Leonid Boytsov, Bilegsaikhan Naidan, Engineering Efficient and Effective Non-metric Space Library similarity search and applications. pp. 280- 293 ,(2013) , 10.1007/978-3-642-41062-8_28
Martin Ester, Aoying Zhou, Weining Qian, Feng Cao, Density-Based Clustering over an Evolving Data Stream with Noise. siam international conference on data mining. pp. 328- 339 ,(2006)
Kyle Soska, Nicolas Christin, Automatically detecting vulnerable websites before they turn malicious usenix security symposium. pp. 625- 640 ,(2014)
Moses Charikar, Kevin Chen, Martin Farach-Colton, Finding Frequent Items in Data Streams international colloquium on automata languages and programming. ,vol. 312, pp. 693- 703 ,(2002) , 10.1016/S0304-3975(03)00400-6
Ahmed Metwally, Divyakant Agrawal, Amr El Abbadi, Efficient Computation of Frequent and Top-k Elements in Data Streams Database Theory - ICDT 2005. pp. 398- 412 ,(2004) , 10.1007/978-3-540-30570-5_27
Hans-Peter Kriegel, Martin Ester, Jörg Sander, Xiaowei Xu, A density-based algorithm for discovering clusters in large spatial Databases with Noise knowledge discovery and data mining. pp. 226- 231 ,(1996)
Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, David P. Woodruff, Beating CountSketch for heavy hitters in insertion streams symposium on the theory of computing. pp. 740- 753 ,(2016) , 10.1145/2897518.2897558
Thanh N. Tran, Ron Wehrens, Lutgarde M.C. Buydens, KNN-kernel density-based clustering for high-dimensional multivariate data Computational Statistics & Data Analysis. ,vol. 51, pp. 513- 525 ,(2006) , 10.1016/J.CSDA.2005.10.001
Graham Cormode, Marios Hadjieleftheriou, Finding frequent items in data streams very large data bases. ,vol. 1, pp. 1530- 1541 ,(2008) , 10.14778/1454159.1454225
Jeffrey K. Uhlmann, Satisfying general proximity / similarity queries with metric trees Information Processing Letters. ,vol. 40, pp. 175- 179 ,(1991) , 10.1016/0020-0190(91)90074-R