Automatic Application Identification from Billions of Files

作者： Kyle Soska , Chris Gates , Kevin A. Roundy , Nicolas Christin

关键词:

摘要: Understanding how to group a set of binary files into the piece software they belong is highly desirable for profiling, malware detection, or enterprise audits, among many other applications. Unfortunately, it also extremely challenging: there absolutely no uniformity in ways different applications rely on files, binaries are signed, versioning schemes used across pieces software. In this paper, we show that, by combining information gleaned from large number endpoints (millions computers), can accomplish large-scale application identification automatically and reliably. Our approach relies collecting metadata billions every day, summarizing much smaller "sketches", performing approximate k-nearest neighbor clustering non-metric space representations derived these sketches. We design implement our proposed system using Apache Spark, that process matter hours, thus could be daily processing. further manages successfully identify which with very high precision, adequate recall.

uni-trier.de 本地加速

acm.org 本地加速

sci-hub.st HTML 下载加速

参考文章(27)

Leonid Boytsov, Bilegsaikhan Naidan, Engineering Efficient and Effective Non-metric Space Library similarity search and applications. pp. 280- 293 ,(2013) , 10.1007/978-3-642-41062-8_28

Martin Ester, Aoying Zhou, Weining Qian, Feng Cao, Density-Based Clustering over an Evolving Data Stream with Noise. siam international conference on data mining. pp. 328- 339 ,(2006)

Kyle Soska, Nicolas Christin, Automatically detecting vulnerable websites before they turn malicious usenix security symposium. pp. 625- 640 ,(2014)

Moses Charikar, Kevin Chen, Martin Farach-Colton, Finding Frequent Items in Data Streams international colloquium on automata languages and programming. ,vol. 312, pp. 693- 703 ,(2002) , 10.1016/S0304-3975(03)00400-6

Ahmed Metwally, Divyakant Agrawal, Amr El Abbadi, Efficient Computation of Frequent and Top-k Elements in Data Streams Database Theory - ICDT 2005. pp. 398- 412 ,(2004) , 10.1007/978-3-540-30570-5_27

Hans-Peter Kriegel, Martin Ester, Jörg Sander, Xiaowei Xu, A density-based algorithm for discovering clusters in large spatial Databases with Noise knowledge discovery and data mining. pp. 226- 231 ,(1996)

Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, David P. Woodruff, Beating CountSketch for heavy hitters in insertion streams symposium on the theory of computing. pp. 740- 753 ,(2016) , 10.1145/2897518.2897558

Thanh N. Tran, Ron Wehrens, Lutgarde M.C. Buydens, KNN-kernel density-based clustering for high-dimensional multivariate data Computational Statistics & Data Analysis. ,vol. 51, pp. 513- 525 ,(2006) , 10.1016/J.CSDA.2005.10.001

Graham Cormode, Marios Hadjieleftheriou, Finding frequent items in data streams very large data bases. ,vol. 1, pp. 1530- 1541 ,(2008) , 10.14778/1454159.1454225

10.

Jeffrey K. Uhlmann, Satisfying general proximity / similarity queries with metric trees Information Processing Letters. ,vol. 40, pp. 175- 179 ,(1991) , 10.1016/0020-0190(91)90074-R

Automatic Application Identification from Billions of Files

来源期刊

我的账户

Automatic Application Identification from Billions of Files

来源期刊

相似文章 3

Endpoint Detection and Response: Why Use Machine Learning?

HAC-T and Fast Search for Similarity in Security

Scalable Malware Clustering using Multi-Stage Tree Parallelization

我的账户