作者: Hugo Leonardo Duarte-Garcia , Carlos Domenick Morales-Medina , Aldo Hernandez-Suarez , Gabriel Sanchez-Perez , Karina Toscano-Medina
DOI: 10.1109/EUROSPW.2019.00033
关键词: Categorization 、 Feature extraction 、 Machine learning 、 Word2vec 、 Computer science 、 Artificial intelligence 、 Word embedding 、 Semi-supervised learning 、 Unsupervised learning 、 Cluster analysis 、 Malware
摘要: Due to the vertiginous growth of malicious actors, malware has been crafted, distributed and propagated around world with new sophisticated techniques. Classical detection procedures, mostly based on signatures heuristic searches, are now being replaced machine learning-based (ML) solutions. However, some challenges still present. Firstly, supervised approaches use anti-virus tags create hand-crafted datasets, resulting in a lack taxonomy uncertainty if given observation is classified proper label. Secondly, off-line feed-forward may result complex time consuming feature extraction tasks. In this work, we propose novel method that reinforces characterization by capturing rich relevance contextual patterns into an n-dimensional weighted word embedding vector (WEV) space. Results prove clustering similar WEVs via unsupervised learning, can be categorized four major families, improving less resources.