Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Neural Nets Can Learn Function Type Signatures from Binaries [pdf] (usenix.org)
213 points by bmc7505 on Aug 18, 2017 | hide | past | favorite | 43 comments


TLDR:

> In this paper, we present a new system called EKLAVYA which trains a recurrent neural network to recover function type signatures from disassembled binary code. EKLAVYA assumes no knowledge of the target instruction set semantics to make such inference.

> [..] we find by analyzing its model that it auto-learns relationships between instructions, compiler conventions, stack frame setup instructions, use-before-write patterns, and operations relevant to identifying types directly from binaries.

> In our evaluation on Linux binaries compiled with clang and gcc, for two different architectures (x86 and x64), EKLAVYA exhibits accuracy of around 84% and 81% for function argument count and type recovery tasks respectively.


> "On our x86 and x64 datasets, EKLAVYA exhibits comparable accuracy with traditional heuristics-based methods"


I've long had this idea to build a decompiler (a program that maps binaries back to source code) using machine learning. The problem in decompilation is that you loose information when you compile source code. Machine learning could help even recover things like most likely variable names. There is also tons of training data that can be easily generated.


> recover things like most likely variable names

This is already being researched! See http://jsnice.org/ for recovering variable names from minified javascript.

I remember testing it by coding a binary search with single letter variable names. It managed to rename them to the more canonical "lo"/"mid"/"hi" names just by recognizing that the code pattern is typically used for a binary search.


I was just about to post a link! Very interesting project, and the folks behind jsnice are super cool, too.


Neural nets can't magically create information that's not there; all it can do is combine the data in the binary with its learned 'experience', and make what amounts to educated guesses about what the decompiled code might be. That has value, obviously, but it's not 'recovering' anything, really, it's 'assuming'.


Or possibly improving upon. A variable being badly named isn't rare.


yep that with a auto-spell checker would make a darn good tool! something like MS word spell checker that goes suggestion-by-suggestion and clicking yes/no/manual for var names :)


Cast it as a reinforcement learning problem: you produce a candidate decompilation, try to compile it, and score it based on the similarity to the initial binary.


I've had the same idea, but instead it could "decompile" spaghetti code with horrible variable names into something readable.

It seems like a reasonable (almost believable even) first stage to software finally coming to eat the jobs of its creators.


Our jobs will just move an abstraction level higher. We used to have to worry about registers and stuff, then we automated that. We used to worry about memory, then we automated that. We used to worry about CPU stuff, then we automated that for most applocations.

Eventually we're going to become technical PMs. As long as there will be fuzzy unclear problem descriptions, there will be jobs for people who can codify and standardize processes.


"First they came for the assembly programmers..."


"And then they ate the universe, and there was no-one left to care."

Plenty of paperclips, though.


> Our jobs will just move an abstraction level higher.

This seems a little optimistic in terms of the average software developer's ability to think at higher and higher levels of abstraction.

It's not an easy skill, and not that many people are very strong in it.

Just look at how programmers today react to mathematical paradigms and explanations (category theory, anyone) in any practical setting.


You say that, but I've written software unrolled loops in assembler on ARM because that's what we needed.


I'd really appreciate something that takes my hacked-together code and cleans it up and makes it presentable. That's usually the second phase of the job anyway, which could be automated.


It would be interesting to automatically generate the training data and feed it to the ML module in real time.

You could hard-code an interactive agent that periodically peeked and coerced the input (between itself and the ML function). This way the ML module is coercible.

It would even be more cool if the ML function itself learned the interactive agent function as part of itself.


In a completely different approach, a neural net was trained on input-output pairs and stack traces of a program. The neural net learned the algorithm. The advantage is that all algorithms could be "neuralized".


How is that an advantage?


The neuralized algorithms become really hard to debug? I think of it as a magicians curtain.


There's been dozens of such projects at this point. http://apk-deguard.com/ is one example.


I've often wondered about the theoretical limit of a neural net to learn from examples - seems like a fascinating subject with a lots of implications. From a quick search, I found this paper: https://experts.illinois.edu/en/publications/computational-l... which is already very interesting. Are there any other good pointers on this?


According to the universal approximation theorem [1], a multilayer feedforward net with a single hidden layer can approximate any continuous function on R^n. There's also a nice visual proof. [2]

[1]: http://www.sciencedirect.com/science/article/pii/08936080899...

[2]: http://neuralnetworksanddeeplearning.com/chap4.html


Are there also theorems about what functions can be learned, or how quickly they can be learned, by a multilayer feed forward neural net using backpropagation?


Yes, they use what's called the VC-dimension. Chapter 20 of the book Understanding Machine Learning (available for free online http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning...) has a good introduction to the basics of this theory.

The short story is that VC dimension describes "worst case" behavior of a learning algorithm. This is a good predictor of performance on really hard problems, but lots of modern problems turn out to be theoretically easy. For example, neural networks trained on images work much better than we would guess based on the VC dimension of neural networks. This is because the distribution of images is closer to a best case scenario than a worst case scenario.

Lots of research is being done to determine when we are in a best case scenario. The "simplest" is called the Rademacher complexity. (See Chapter 26 of the UML book.) Simplest is in quotes because most phd-level students in machine learning that I've met don't even understand what a rademacher complexity is unless they are specializing in theory. And the rademacher complexity doesn't really even come close to capturing why neural networks work well on images either.


True, but that does not show that the approximation can always be learned from examples, does it?


No, it does not even guarantee an upper bound on the number of hidden units required to approximate the function with a certain accuracy (even if it can be learned). So its usefulness is very limited.


COTS = "Common off the shelf", in case anyone is equally mystified. Google wasn't immediately helpful.


Commercial Off The Shelf, I thought.


Indeed, it is "Commercial Off The Shelf". I sometimes saw this in marketing speech, too.


Apparently "Consumer Off-The-Shelf" is another variant.


The authors have put out the datasets: https://github.com/shensq04/EKLAVYA


This would only be useful for binaries where you don't have corresponding source, but the binary does still have symbols. Is this a common situation?


Are you sure? AFAICT their method only uses debugging symbols for training. I haven't read it all but it would be a big omission if they didn't mention that it only works on binaries with debugging symbols. When actually reversing they work on numeric "symbols" for the assembly instructions.


Its usually all you get if you are reverse engineering someone else's executable or firmware image.


But they leave symbols in? They don't strip those also?


Indeed. If you start Emacs on OSX and run `symbols Emacs`, you get a lot of info:

  100139190 (    0x90) verror [FUNC, EXT...
  100139220 (   0x210) Fcommandp [FUNC, ...
  100139430 (    0xf0) Fautoload [FUNC, ...
  100139520 (    0x90) un_autoload [FUNC...
  1001395b0 (    0x80) Feval [FUNC, EXT,...
  100139630 (    0x50) record_in_backtra...
  100139680 (   0x1d0) apply_lambda [FUN...
  100139850 (   0x390) Fapply [FUNC, EXT...
  100139be0 (   0x5c0) Ffuncall [FUNC, E...
  10013a1a0 (    0x90) Frun_hooks [FUNC,...
  10013a230 (   0x1d0) run_hook_with_arg...
  10013a400 (    0x20) funcall_nil [FUNC...
  10013a420 (    0x20) Frun_hook_with_ar...
  10013a440 (    0x20) Frun_hook_with_ar...
  10013a460 (    0x30) Frun_hook_with_ar...
  10013a490 (    0x30) funcall_not [FUNC...
So why not strip these? Well, you can't. But the reason is interesting. Most programs run by using shared libraries, and these libraries need to provide a standard API that other programs can use. Hence the function names. Which of course is exactly the info that a hypothetical bad guy needs to make sense of the program.

But it goes beyond this. You could imagine statically linking a program against all of its dependencies, then stripping all possible symbols. The thing is, there is plenty of info in the binary to make sense of the program. Strings, for example. Error messages. Every piece of data you want to show to the user is an indication of what the surrounding function is doing.

IDA Pro is pretty incredible in that situation. Each time you figure out what a function is doing, you just give it whatever name you want. IDA updates the whole interface so that it uses that name everywhere, instead of the hex address. You end up with a neat, perfectly sensible program. Entire game engines have been meticulously reverse engineered this way.

However, viruses use a crafty technique to prevent analysis. You encrypt your program like an onion. Your outer program runs, and everyone can see that. But it's carrying around a blob of functionality that was encrypted using the target's information. E.g. their MAC address, or a subset of their list of installed programs, or anything that would uniquely identify your target separately from everyone else. Any time the virus is installed anywhere, it tries to decrypt itself using this information, which only succeeds when it's installed on the target machine.

This defeats all attempts at analysis. You can't analyze what you can't decrypt.


This is an interesting concept. Can you provide any links to the practical info re. this?

It seems to be quite easy to do something similar; especially if you're specifically targeting a single PC with a very specific configuration.

How do you ensure that the amount of entropy in the key is enough to stop a person from finding the key? Assuming the person reverse-engineering the virus already knows the key generation routine, since he has access to the binary of the virus.

Hostnames and lists of installed programs should be prone to a dictionary attack, mac addresses are not even close to having enough bytes.


Yeah! I have to run for a bit but I'll be back. There's an incredible article about Flame I detailed a bit here:

https://news.ycombinator.com/item?id=15046089

It depends entirely whether you have persistent access to the target. If your target is airgapped, your only option is to know something about the target machine (i.e. have a spy on the inside) that gives you enough info to encrypt the virus. For example, they could install a special program so that it's listed in C:\Program Files, then the virus decrypts using the string "${mac_addr}${x}" for x in [list of installed programs]. So as an analyst, you won't have any idea what the magic program name was.

That technique was likely Stuxnet, not Flame, so that article might not contain any info about it. But it's amazing in its own way. If you have any other questions too, I love chatting about this stuff.


Heh, that's interesting regarding using the MAC address as the encryption key.

I'm a bit confused though, how would the virus get to the target with the encrypted payload, do you mean say on another computer, the MAC address for the target computer would be utilised, then the main payload of the virus encrypted, and the virus given to the target?

It's interesting using the MAC, because if you placed it in a VM it wouldn't run, unless you used the same MAC for the VM network adapter.


Yes, exactly.

If this kind of thing interests you, this article about Flame is absolutely fascinating:

http://www.symantec.com/content/en/us/enterprise/media/secur...

Hard to believe it was already almost six years ago.

> On both servers command and-control activity happens through a Web application called Newsforyou. It processes the W32.Flamer client interactions and provides a simple control panel. The control panel allows the attackers to upload packages of code to deliver to compromised clients, and download packages containing stolen client data. However, in a technique not previously seen before the uploaded and downloaded packages are encrypted, so infiltrating the command-and-control server does not reveal the code or the stolen client data. The command and-control server simply serves as a proxy for the data and the data is encrypted and decrypted offline by the attackers using keys unique to each client. This application also contains functionality to communicate with clients compromised by malware other than Flamer. The Web application was designed to be a framework for supporting different malware campaigns.

> In addition to avoiding the compromise of their operations, preventing both the uploading of rogue code and viewing of stolen data, the setup also maintains a clear distinction of roles. The roles include those responsible for setting up the server (admins), those responsible for uploading packages and downloading stolen data through the control panel (operators), and those holding the private key with the ability to decrypt the stolen data (attack coordinators). The operators themselves may actually be completely unaware of the contents in the stolen data. This is due to design of the process to use data security compartmentalization techniques, as shown in Figure 1.

> Despite these techniques, we were still able to determine that one of the servers delivered a module instructing Flamer to commit suicide and wipe itself off computers in late May 2012, an action we also witnessed through compromised honeypots.

> Finally, access to the control panel required a password which is stored as a hash. Despite brute force attempts at reversing the hash to plain text, we were unable to determine the plain text password.

The whole thing is filled with gems like that. They controlled the virus with a PHP webapp called "news for you!" presumably with a sly winky face emoji appended to the <title> tag.


Cheers, that's very interesting!

I bought this book a while ago - 'Malicious Cryptography: Exposing Crytovirology' I still need to read it more thoroughly but they present some interesting ideas.


Linux programs are notorious for containing a lot of symbol information and the toolchain doesn't exactly make it easy to get rid of them. To start with, you have to compile executables with '-fdefault-visibility=hidden'. Otherwise, strip cannot remove the symbols defined in the executable because it's assumed that another executable might want to load your executable as a shared object and import its functions/global variables.

Next, if you are using any kind of library, even statically linked, changing the default visibility for your executable isn't enough to get rid of the libraries' symbols, because those were already defaulted to public when the respective library was compiled.

Also when statically linking, you probably want to compile all the libraries with '-ffunction-sections -fdata-sections' and pass '--gc-sections' to the linker to prevent bloating your executable with unused functionality.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: