作者: Charles Nicholas , Hyrum S. Anderson , Edward Raff , Richard Zak , Mark McLean
DOI: 10.13016/M2KIB7-UPXE
关键词:
摘要: N-grams have been a common tool for information retrieval and machine learning applications decades. In nearly all previous works, only few values of $n$ are tested, with $n > 6$ being exceedingly rare. Larger not tested due to computational burden or the fear overfitting. this work, we present method find top-$k$ most frequent $n$-grams that is 60$\times$ faster small $n$, can tackle large $n\geq1024$. Despite unprecedented size considered, show how these features still predictive ability malware classification tasks. More important, provide benefits in producing interpretable by analysis, be used create general purpose signatures compatible industry standard tools like Yara. Furthermore, counts file may added as publicly available human-engineered rival efficacy professionally-developed when train gradient-boosted decision tree models on EMBER dataset.