作者: Brad Miller , Alex Kantchelian , Michael Carl Tschantz , Sadia Afroz , Rekha Bachwani
DOI: 10.1007/978-3-319-40667-1_7
关键词:
摘要: We present and evaluate a large-scale malware detection system integrating machine learning with expert reviewers, treating reviewers as limited labeling resource. demonstrate that even in small numbers, can vastly improve the system's ability to keep pace evolving threats. conduct our evaluation on sample of VirusTotal submissions spanning 2.5i?źyears containing 1.1 million binaries 778i?źGB raw feature data. Without reviewer assistance, we achieve 72i?ź% at 0.5i?ź% false positive rate, performing comparable best vendors VirusTotal. Given budget 80 accurate reviews daily, 89i?ź% are able detect 42i?ź% malicious undetected upon initial submission Additionally, identify previously unnoticed temporal inconsistency training datasets. compare impact labels obtained same time data is first seen months later. find using well after samples appear, thus unavailable practice for current data, inflates measured by almost 20i?ź% points. release cluster-based implementation, list all hashes 3i?ź% entire dataset.