作者: Roberto Perdisci , ManChon U
关键词: Majority rule 、 Cluster analysis 、 Quality (business) 、 Set (abstract data type) 、 Malware 、 Unsupervised learning 、 Ground truth 、 Computer science 、 The Internet 、 Data mining
摘要: Malware clustering is commonly applied by malware analysts to cope with the increasingly growing number of distinct variants collected every day from Internet. While systems can be useful for a variety applications, assessing quality their results intrinsically hard. In fact, viewed as an unsupervised learning process over dataset which complete ground truth usually not available. Previous studies propose evaluate leveraging labels assigned samples multiple anti-virus scanners (AVs). However, methods proposed thus far require (semi-)manual adjustment and mapping between generated different AVs, are limited selecting reference sub-set agreement regarding reached across majority AVs. This approach may bias set towards "easy cluster" samples, potentially resulting in overoptimistic estimate accuracy results.In this paper we VAMO, system that provides fully automated quantitative analysis validity results. Unlike previous work, VAMO does seek voting-based consensus AV labels, discard such cannot reached. Rather, explicitly deals inconsistencies typical build more representative set, compared approaches. Furthermore, avoids need was required work. Through extensive evaluation controlled setting real-world application, show outperforms approaches, better way automatically assess