Learning a Portfolio-Based Checker for Provenance-Similarity of Binaries

This is an ongoing Independent Research & Development (IRAD) project at the Software Engineering Institute, Carnegie Mellon University. The goal of this project is to explore the use of supervised learning (a.k.a. classification) in detecting provenance-similarity between binaries, or executables. Broadly, two binaries are provenance-similar if they have been compiled from similar source code with similar compilers. Detecting provenance-similarity is a challenging area of research, with important applications ranging from code clone detection, understanding the impact of software updates, judging the provenance of untrusted software, and fighting against malware.

The project is being led by Sagar Chaki, Arie Gurfinkel, and Cory Cohen. Our current focus is on detecting similarity between functions. Intuitively, a function is a fragment of a binary derived by compiling a source-level procedure or method. We believe that functions are an ideal basis for judging binary similarity: they are the fundamental units of a binary's behavior. If two binaries have many functions in common, then they are very likely to be similar. The greater the share of common functions, the higher the degree of similarity. We have recently blogged about our work.


Please contact Sagar Chaki.