Author: Sabrina Ho
In this multi-series blog, I will be reviewing the paper — Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. This paper talks about the application of AI to cybersecurity including malware detection, vulnerability search, and more. Before we get into that, there are some basic concepts we need to understand first:
♦ Control flow graph: What is a control flow graph? And how does it relate to software?
♦ Firmware: It’s a special type of software that you find in lots of places, but why is it important in a cybersecurity context?
♦ Graph embeddings: This is a new technique for taking highly unstructured data (graphs,) and transforming it to a compact vector. We will explore a popular technique to accomplish this.
What are control flow graphs? If you have ever written any programs, you can represent them as control flow graphs. When I first started learning Python, I had to make a graph for every single program I wrote. Here is a simple program with its control flow graph on the right:
But there are other control flow graphs for other simple programs, here are some additional examples:
However, in reality, control flow graphs can be super complex, just like the one below. This is a control flow graph for a simple piece of low-level software written in assembly language (for more information, go here):
Because code can be represented as graphs, you might consider using graph matching algorithms to assess code similarity. Code similarity techniques have been used to detect code plagiarism and here we are evaluating its utility in detecting malware or vulnerabilities. Unfortunately, graph matching can be very expensive. Its time complexity and cost can be as bad as O(n³), which is inevitably slow and can sometimes be inaccurate.
Instead, we can convert graphs into embeddings and perform code similarity and code search by just using those embeddings, which is much faster. What kind of code would be useful to turn into embeddings, and why?
Firmware is one kind of software that can be found in all kinds of devices — from your cell phone to IoT. Because these devices are cheap and many of them are not going to be secured, they have become the largest target for hackers. The image below shows the example devices that contain firmware:
It would be cool if there was a fast and simple way to determine whether a piece of firmware is similar to malware or to code that is validated as vulnerable. That way we can quickly determine if the software is malicious or vulnerable to hacking. We can do this by translating the code into embeddings via its control flow graph.
So, how do we get an embedding from a graph such as control flow graph? We’ve seen graph embeddings in my previous blogs where I turn molecules into vectors. Many kinds of graphs can be turned into graph embedding. It is a powerful technique to understand social networks, molecules, neural networks, and lots of other systems. The image below is an example of a molecule converted into a vector form:
Graph embeddings can help us quickly assess the similarity of graphs without the cost of graph matching. Here is another example of a graph—a social network:
The previous two images shown above are graphs. Yet, the social network graph is a lot more complicated than the molecule graph.
There are many techniques to convert a graph into embeddings, and DeepWalk is one of them. The image below shows the phases of DeepWalk approach:
First, we take random walks on the graph to produce fixed-length sample paths. If we take enough random walks, we can accumulate a data set that resembles the patterns in the graph. Then, we can use this as the training data for machine learning, which learns the pattern and encodes them into embeddings such as vectors. (For more information on DeepWalk, go here)
Now that we have a good understanding of basic graph concepts, we can dive deeper into the paper that applies artificial intelligence and similarity search to malware and vulnerability detection in the next blog of the series.