Neural Nets Can Learn Function Type Signatures from Binaries [pdf]

dmix · on Aug 18, 2017

TLDR:

> In this paper, we present a new system called EKLAVYA which trains a recurrent neural network to recover function type signatures from disassembled binary code. EKLAVYA assumes no knowledge of the target instruction set semantics to make such inference.

> [..] we find by analyzing its model that it auto-learns relationships between instructions, compiler conventions, stack frame setup instructions, use-before-write patterns, and operations relevant to identifying types directly from binaries.

> In our evaluation on Linux binaries compiled with clang and gcc, for two different architectures (x86 and x64), EKLAVYA exhibits accuracy of around 84% and 81% for function argument count and type recovery tasks respectively.

self_awareness · on Aug 18, 2017

> "On our x86 and x64 datasets, EKLAVYA exhibits comparable accuracy with traditional heuristics-based methods"

ma2rten · on Aug 18, 2017

I've long had this idea to build a decompiler (a program that maps binaries back to source code) using machine learning. The problem in decompilation is that you loose information when you compile source code. Machine learning could help even recover things like most likely variable names. There is also tons of training data that can be easily generated.

arnioxux · on Aug 18, 2017

> recover things like most likely variable names

This is already being researched! See http://jsnice.org/ for recovering variable names from minified javascript.

I remember testing it by coding a binary search with single letter variable names. It managed to rename them to the more canonical "lo"/"mid"/"hi" names just by recognizing that the code pattern is typically used for a binary search.

fapjacks · on Aug 18, 2017

I was just about to post a link! Very interesting project, and the folks behind jsnice are super cool, too.

jameshart · on Aug 18, 2017

Neural nets can't magically create information that's not there; all it can do is combine the data in the binary with its learned 'experience', and make what amounts to educated guesses about what the decompiled code might be. That has value, obviously, but it's not 'recovering' anything, really, it's 'assuming'.

scarmig · on Aug 18, 2017

Or possibly improving upon. A variable being badly named isn't rare.

devdoomari · on Aug 18, 2017

yep that with a auto-spell checker would make a darn good tool! something like MS word spell checker that goes suggestion-by-suggestion and clicking yes/no/manual for var names :)

moyix · on Aug 18, 2017

Cast it as a reinforcement learning problem: you produce a candidate decompilation, try to compile it, and score it based on the similarity to the initial binary.

noonespecial · on Aug 18, 2017

I've had the same idea, but instead it could "decompile" spaghetti code with horrible variable names into something readable.

It seems like a reasonable (almost believable even) first stage to software finally coming to eat the jobs of its creators.

Swizec · on Aug 18, 2017

Our jobs will just move an abstraction level higher. We used to have to worry about registers and stuff, then we automated that. We used to worry about memory, then we automated that. We used to worry about CPU stuff, then we automated that for most applocations.

Eventually we're going to become technical PMs. As long as there will be fuzzy unclear problem descriptions, there will be jobs for people who can codify and standardize processes.

haikuginger · on Aug 18, 2017

"First they came for the assembly programmers..."

Filligree · on Aug 18, 2017

"And then they ate the universe, and there was no-one left to care."

Plenty of paperclips, though.

jhayward · on Aug 18, 2017

> Our jobs will just move an abstraction level higher.

This seems a little optimistic in terms of the average software developer's ability to think at higher and higher levels of abstraction.

It's not an easy skill, and not that many people are very strong in it.

Just look at how programmers today react to mathematical paradigms and explanations (category theory, anyone) in any practical setting.

wbl · on Aug 18, 2017

You say that, but I've written software unrolled loops in assembler on ARM because that's what we needed.

feelix · on Aug 18, 2017

I'd really appreciate something that takes my hacked-together code and cleans it up and makes it presentable. That's usually the second phase of the job anyway, which could be automated.

jayajay · on Aug 18, 2017

It would be interesting to automatically generate the training data and feed it to the ML module in real time.

You could hard-code an interactive agent that periodically peeked and coerced the input (between itself and the ML function). This way the ML module is coercible.

It would even be more cool if the ML function itself learned the interactive agent function as part of itself.

visarga · on Aug 18, 2017

In a completely different approach, a neural net was trained on input-output pairs and stack traces of a program. The neural net learned the algorithm. The advantage is that all algorithms could be "neuralized".

nradov · on Aug 18, 2017

How is that an advantage?

0xdeadbeefbabe · on Aug 18, 2017

The neuralized algorithms become really hard to debug? I think of it as a magicians curtain.

UncleMeat · on Aug 18, 2017

There's been dozens of such projects at this point. http://apk-deguard.com/ is one example.

placebo · on Aug 18, 2017

I've often wondered about the theoretical limit of a neural net to learn from examples - seems like a fascinating subject with a lots of implications. From a quick search, I found this paper: https://experts.illinois.edu/en/publications/computational-l... which is already very interesting. Are there any other good pointers on this?

bmc7505 · on Aug 18, 2017

According to the universal approximation theorem [1], a multilayer feedforward net with a single hidden layer can approximate any continuous function on R^n. There's also a nice visual proof. [2]

[1]: http://www.sciencedirect.com/science/article/pii/08936080899...

[2]: http://neuralnetworksanddeeplearning.com/chap4.html

drdeca · on Aug 18, 2017

Are there also theorems about what functions can be learned, or how quickly they can be learned, by a multilayer feed forward neural net using backpropagation?

jackpirate · on Aug 18, 2017

Yes, they use what's called the VC-dimension. Chapter 20 of the book Understanding Machine Learning (available for free online http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning...) has a good introduction to the basics of this theory.

The short story is that VC dimension describes "worst case" behavior of a learning algorithm. This is a good predictor of performance on really hard problems, but lots of modern problems turn out to be theoretically easy. For example, neural networks trained on images work much better than we would guess based on the VC dimension of neural networks. This is because the distribution of images is closer to a best case scenario than a worst case scenario.

Lots of research is being done to determine when we are in a best case scenario. The "simplest" is called the Rademacher complexity. (See Chapter 26 of the UML book.) Simplest is in quotes because most phd-level students in machine learning that I've met don't even understand what a rademacher complexity is unless they are specializing in theory. And the rademacher complexity doesn't really even come close to capturing why neural networks work well on images either.

leod · on Aug 18, 2017

True, but that does not show that the approximation can always be learned from examples, does it?

bmc7505 · on Aug 18, 2017

No, it does not even guarantee an upper bound on the number of hidden units required to approximate the function with a certain accuracy (even if it can be learned). So its usefulness is very limited.

incompatible · on Aug 18, 2017

COTS = "Common off the shelf", in case anyone is equally mystified. Google wasn't immediately helpful.

w_t_payne · on Aug 18, 2017

Commercial Off The Shelf, I thought.

vog · on Aug 18, 2017

Indeed, it is "Commercial Off The Shelf". I sometimes saw this in marketing speech, too.

incompatible · on Aug 18, 2017

Apparently "Consumer Off-The-Shelf" is another variant.

therandomoracle · on Aug 24, 2017

The authors have put out the datasets: https://github.com/shensq04/EKLAVYA

haberman · on Aug 18, 2017

This would only be useful for binaries where you don't have corresponding source, but the binary does still have symbols. Is this a common situation?

dscpls · on Aug 18, 2017

Are you sure? AFAICT their method only uses debugging symbols for training. I haven't read it all but it would be a big omission if they didn't mention that it only works on binaries with debugging symbols. When actually reversing they work on numeric "symbols" for the assembly instructions.

jdblair · on Aug 18, 2017

Its usually all you get if you are reverse engineering someone else's executable or firmware image.

haberman · on Aug 18, 2017

But they leave symbols in? They don't strip those also?

sillysaurus3 · on Aug 18, 2017

Indeed. If you start Emacs on OSX and run `symbols Emacs`, you get a lot of info:

  100139190 (    0x90) verror [FUNC, EXT...
  100139220 (   0x210) Fcommandp [FUNC, ...
  100139430 (    0xf0) Fautoload [FUNC, ...
  100139520 (    0x90) un_autoload [FUNC...
  1001395b0 (    0x80) Feval [FUNC, EXT,...
  100139630 (    0x50) record_in_backtra...
  100139680 (   0x1d0) apply_lambda [FUN...
  100139850 (   0x390) Fapply [FUNC, EXT...
  100139be0 (   0x5c0) Ffuncall [FUNC, E...
  10013a1a0 (    0x90) Frun_hooks [FUNC,...
  10013a230 (   0x1d0) run_hook_with_arg...
  10013a400 (    0x20) funcall_nil [FUNC...
  10013a420 (    0x20) Frun_hook_with_ar...
  10013a440 (    0x20) Frun_hook_with_ar...
  10013a460 (    0x30) Frun_hook_with_ar...
  10013a490 (    0x30) funcall_not [FUNC...

So why not strip these? Well, you can't. But the reason is interesting. Most programs run by using shared libraries, and these libraries need to provide a standard API that other programs can use. Hence the function names. Which of course is exactly the info that a hypothetical bad guy needs to make sense of the program.

But it goes beyond this. You could imagine statically linking a program against all of its dependencies, then stripping all possible symbols. The thing is, there is plenty of info in the binary to make sense of the program. Strings, for example. Error messages. Every piece of data you want to show to the user is an indication of what the surrounding function is doing.

IDA Pro is pretty incredible in that situation. Each time you figure out what a function is doing, you just give it whatever name you want. IDA updates the whole interface so that it uses that name everywhere, instead of the hex address. You end up with a neat, perfectly sensible program. Entire game engines have been meticulously reverse engineered this way.

However, viruses use a crafty technique to prevent analysis. You encrypt your program like an onion. Your outer program runs, and everyone can see that. But it's carrying around a blob of functionality that was encrypted using the target's information. E.g. their MAC address, or a subset of their list of installed programs, or anything that would uniquely identify your target separately from everyone else. Any time the virus is installed anywhere, it tries to decrypt itself using this information, which only succeeds when it's installed on the target machine.

This defeats all attempts at analysis. You can't analyze what you can't decrypt.

heavenlyblue · on Aug 18, 2017

This is an interesting concept. Can you provide any links to the practical info re. this?

It seems to be quite easy to do something similar; especially if you're specifically targeting a single PC with a very specific configuration.

How do you ensure that the amount of entropy in the key is enough to stop a person from finding the key? Assuming the person reverse-engineering the virus already knows the key generation routine, since he has access to the binary of the virus.

Hostnames and lists of installed programs should be prone to a dictionary attack, mac addresses are not even close to having enough bytes.

sillysaurus3 · on Aug 18, 2017

Yeah! I have to run for a bit but I'll be back. There's an incredible article about Flame I detailed a bit here:

https://news.ycombinator.com/item?id=15046089

It depends entirely whether you have persistent access to the target. If your target is airgapped, your only option is to know something about the target machine (i.e. have a spy on the inside) that gives you enough info to encrypt the virus. For example, they could install a special program so that it's listed in C:\Program Files, then the virus decrypts using the string "${mac_addr}${x}" for x in [list of installed programs]. So as an analyst, you won't have any idea what the magic program name was.

That technique was likely Stuxnet, not Flame, so that article might not contain any info about it. But it's amazing in its own way. If you have any other questions too, I love chatting about this stuff.

anfractuosity · on Aug 18, 2017

Heh, that's interesting regarding using the MAC address as the encryption key.

I'm a bit confused though, how would the virus get to the target with the encrypted payload, do you mean say on another computer, the MAC address for the target computer would be utilised, then the main payload of the virus encrypted, and the virus given to the target?

It's interesting using the MAC, because if you placed it in a VM it wouldn't run, unless you used the same MAC for the VM network adapter.

sillysaurus3 · on Aug 18, 2017

Yes, exactly.

If this kind of thing interests you, this article about Flame is absolutely fascinating:

http://www.symantec.com/content/en/us/enterprise/media/secur...

Hard to believe it was already almost six years ago.

> On both servers command and-control activity happens through a Web application called Newsforyou. It processes the W32.Flamer client interactions and provides a simple control panel. The control panel allows the attackers to upload packages of code to deliver to compromised clients, and download packages containing stolen client data. However, in a technique not previously seen before the uploaded and downloaded packages are encrypted, so infiltrating the command-and-control server does not reveal the code or the stolen client data. The command and-control server simply serves as a proxy for the data and the data is encrypted and decrypted offline by the attackers using keys unique to each client. This application also contains functionality to communicate with clients compromised by malware other than Flamer. The Web application was designed to be a framework for supporting different malware campaigns.

> In addition to avoiding the compromise of their operations, preventing both the uploading of rogue code and viewing of stolen data, the setup also maintains a clear distinction of roles. The roles include those responsible for setting up the server (admins), those responsible for uploading packages and downloading stolen data through the control panel (operators), and those holding the private key with the ability to decrypt the stolen data (attack coordinators). The operators themselves may actually be completely unaware of the contents in the stolen data. This is due to design of the process to use data security compartmentalization techniques, as shown in Figure 1.

> Despite these techniques, we were still able to determine that one of the servers delivered a module instructing Flamer to commit suicide and wipe itself off computers in late May 2012, an action we also witnessed through compromised honeypots.

> Finally, access to the control panel required a password which is stored as a hash. Despite brute force attempts at reversing the hash to plain text, we were unable to determine the plain text password.

The whole thing is filled with gems like that. They controlled the virus with a PHP webapp called "news for you!" presumably with a sly winky face emoji appended to the <title> tag.

anfractuosity · on Aug 18, 2017

Cheers, that's very interesting!

I bought this book a while ago - 'Malicious Cryptography: Exposing Crytovirology' I still need to read it more thoroughly but they present some interesting ideas.

kbwt · on Aug 18, 2017

Linux programs are notorious for containing a lot of symbol information and the toolchain doesn't exactly make it easy to get rid of them. To start with, you have to compile executables with '-fdefault-visibility=hidden'. Otherwise, strip cannot remove the symbols defined in the executable because it's assumed that another executable might want to load your executable as a shared object and import its functions/global variables.

Next, if you are using any kind of library, even statically linked, changing the default visibility for your executable isn't enough to get rid of the libraries' symbols, because those were already defaulted to public when the respective library was compiled.

Also when statically linking, you probably want to compile all the libraries with '-ffunction-sections -fdata-sections' and pass '--gc-sections' to the linker to prevent bloating your executable with unused functionality.