NIPS Demonstration Cynomix: A Machine Learning Aided Workbench for Rapid Comprehension of Large Malware Corpora

Demonstration

Cynomix: A Machine Learning Aided Workbench for Rapid Comprehension of Large Malware Corpora

Joshua D Saxe · David Mentis · Chris Greamo

Harrah's Special Events Center, 2nd Floor -Tahoe A & B

[ Abstract ] [ Project Page ]

Abstract:

Although the number of malware samples active on the Internet has risen above ten million and is growing at an exponential rate, in operational contexts today most analysis of malware is still done by hand, sample by sample, by expert reverse engineers. As a result, most malware samples have not been analyzed or understood.

We will demonstrate a novel intelligent workbench for analysis of large malware corpora that sees beyond malware code obfuscation to identify code sharing relationships between malware samples. This allows our workbench tool to then propagate analyst annotations between samples when code is reused between samples. We believe our project can help to facilitate a paradigm shift in our approach to understanding the malware landscape, facilitating greater breadth and depth to the security community's understanding of the nature and evolution of malware.

Our system includes four components: a feature extraction component, a code-sharing estimation component, a malware behavioral trait identification component, and a visual interface which ties these components together. Our feature extraction component identifies semantically meaningful subsequences of malware system call behavior logs through a novel Markovian sequence extraction method which runs in linear time. Other features we extract are instruction n-grams from each sample's control flow graph, declared library and function imports, and printable strings information parsed from the binary sample files.

To estimate code sharing we use a novel ensemble similarity function that incorporates sample control flow graph information, sample system call log subsequence features, and sample binary file metadata. To compute pairwise similarities we use a locally sensitive hashing technique that allows our system to scale up to tens of thousands of samples. Our code sharing detection approach was evaluated last month by a test team at MIT Lincoln Laboratory and scored extremely well both in absolute terms as well as relative to the other algorithms under test.

Live content is unavailable. Log in and register to view live content