作者: Kyle Soska , Chris Gates , Kevin A. Roundy , Nicolas Christin
关键词:
摘要: Understanding how to group a set of binary files into the piece software they belong is highly desirable for profiling, malware detection, or enterprise audits, among many other applications. Unfortunately, it also extremely challenging: there absolutely no uniformity in ways different applications rely on files, binaries are signed, versioning schemes used across pieces software. In this paper, we show that, by combining information gleaned from large number endpoints (millions computers), can accomplish large-scale application identification automatically and reliably. Our approach relies collecting metadata billions every day, summarizing much smaller "sketches", performing approximate k-nearest neighbor clustering non-metric space representations derived these sketches. We design implement our proposed system using Apache Spark, that process matter hours, thus could be daily processing. further manages successfully identify which with very high precision, adequate recall.