Learning a Portfolio-Based Checker for Provenance-Similarity of Binaries

This is an ongoing Independent Research & Development (IRAD) project at the Software Engineering Institute, Carnegie Mellon University. The goal of this project is to explore the use of supervised learning (a.k.a. classification) in detecting provenance-similarity between binaries, or executables. Broadly, two binaries are provenance-similar if they have been compiled from similar source code with similar compilers. Detecting provenance-similarity is a challenging area of research, with important applications ranging from code clone detection, understanding the impact of software updates, judging the provenance of untrusted software, and fighting against malware.

The project is being led by Sagar Chaki, Arie Gurfinkel, and Cory Cohen. Our current focus is on detecting similarity between functions. Intuitively, a function is a fragment of a binary derived by compiling a source-level procedure or method. We believe that functions are an ideal basis for judging binary similarity: they are the fundamental units of a binary's behavior. If two binaries have many functions in common, then they are very likely to be similar. The greater the share of common functions, the higher the degree of similarity. We have recently blogged about our work.

Benchmark

We are releasing a benchmark and some tools that we have developed, and are using as part of our project. Once you download and unpack the distribution (using tar -xvfj), read the README.txt file for further instructions. The benchmark is derived from some of the most downloaded open-source software available from Soureforge. We compiled the source code using three versions of Microsoft Visual Studio: 2003 .NET, 2005 and 2008. We then extracted functions from the resulting binaries using IdaPro, together with our custom extensions. Finally we extracted features using a custom Rose plugin. The benchmark is packaged as a SQLite3 database. The tools should run on a modern Linux distribution (we have used them on Ubuntu 8.04 and 9.04).

Publications

Contact