> In this paper, we present a new system called EKLAVYA which trains a recurrent neural network to recover function type signatures from disassembled binary code. EKLAVYA assumes no knowledge of the target instruction set semantics to make such inference.
> [..] we find by analyzing its model that it auto-learns relationships between instructions, compiler conventions, stack frame setup instructions, use-before-write patterns, and operations relevant to identifying types directly from binaries.
> In our evaluation on Linux binaries compiled with clang and gcc, for two different architectures (x86 and x64), EKLAVYA exhibits accuracy of around 84% and 81% for function argument count and type recovery tasks respectively.
I've long had this idea to build a decompiler (a program that maps binaries back to source code) using machine learning. The problem in decompilation is that you loose information when you compile source code. Machine learning could help even recover things like most likely variable names. There is also tons of training data that can be easily generated.
This is already being researched! See http://jsnice.org/ for recovering variable names from minified javascript.
I remember testing it by coding a binary search with single letter variable names. It managed to rename them to the more canonical "lo"/"mid"/"hi" names just by recognizing that the code pattern is typically used for a binary search.
Neural nets can't magically create information that's not there; all it can do is combine the data in the binary with its learned 'experience', and make what amounts to educated guesses about what the decompiled code might be. That has value, obviously, but it's not 'recovering' anything, really, it's 'assuming'.
yep that with a auto-spell checker would make a darn good tool! something like MS word spell checker that goes suggestion-by-suggestion and clicking yes/no/manual for var names :)
Cast it as a reinforcement learning problem: you produce a candidate decompilation, try to compile it, and score it based on the similarity to the initial binary.
Our jobs will just move an abstraction level higher. We used to have to worry about registers and stuff, then we automated that. We used to worry about memory, then we automated that. We used to worry about CPU stuff, then we automated that for most applocations.
Eventually we're going to become technical PMs. As long as there will be fuzzy unclear problem descriptions, there will be jobs for people who can codify and standardize processes.
I'd really appreciate something that takes my hacked-together code and cleans it up and makes it presentable. That's usually the second phase of the job anyway, which could be automated.
It would be interesting to automatically generate the training data and feed it to the ML module in real time.
You could hard-code an interactive agent that periodically peeked and coerced the input (between itself and the ML function). This way the ML module is coercible.
It would even be more cool if the ML function itself learned the interactive agent function as part of itself.
In a completely different approach, a neural net was trained on input-output pairs and stack traces of a program. The neural net learned the algorithm. The advantage is that all algorithms could be "neuralized".
I've often wondered about the theoretical limit of a neural net to learn from examples - seems like a fascinating subject with a lots of implications. From a quick search, I found this paper:
https://experts.illinois.edu/en/publications/computational-l...
which is already very interesting. Are there any other good pointers on this?
According to the universal approximation theorem [1], a multilayer feedforward net with a single hidden layer can approximate any continuous function on R^n. There's also a nice visual proof. [2]
Are there also theorems about what functions can be learned, or how quickly they can be learned, by a multilayer feed forward neural net using backpropagation?
The short story is that VC dimension describes "worst case" behavior of a learning algorithm. This is a good predictor of performance on really hard problems, but lots of modern problems turn out to be theoretically easy. For example, neural networks trained on images work much better than we would guess based on the VC dimension of neural networks. This is because the distribution of images is closer to a best case scenario than a worst case scenario.
Lots of research is being done to determine when we are in a best case scenario. The "simplest" is called the Rademacher complexity. (See Chapter 26 of the UML book.) Simplest is in quotes because most phd-level students in machine learning that I've met don't even understand what a rademacher complexity is unless they are specializing in theory. And the rademacher complexity doesn't really even come close to capturing why neural networks work well on images either.
No, it does not even guarantee an upper bound on the number of hidden units required to approximate the function with a certain accuracy (even if it can be learned). So its usefulness is very limited.
Are you sure? AFAICT their method only uses debugging symbols for training. I haven't read it all but it would be a big omission if they didn't mention that it only works on binaries with debugging symbols. When actually reversing they work on numeric "symbols" for the assembly instructions.
So why not strip these? Well, you can't. But the reason is interesting. Most programs run by using shared libraries, and these libraries need to provide a standard API that other programs can use. Hence the function names. Which of course is exactly the info that a hypothetical bad guy needs to make sense of the program.
But it goes beyond this. You could imagine statically linking a program against all of its dependencies, then stripping all possible symbols. The thing is, there is plenty of info in the binary to make sense of the program. Strings, for example. Error messages. Every piece of data you want to show to the user is an indication of what the surrounding function is doing.
IDA Pro is pretty incredible in that situation. Each time you figure out what a function is doing, you just give it whatever name you want. IDA updates the whole interface so that it uses that name everywhere, instead of the hex address. You end up with a neat, perfectly sensible program. Entire game engines have been meticulously reverse engineered this way.
However, viruses use a crafty technique to prevent analysis. You encrypt your program like an onion. Your outer program runs, and everyone can see that. But it's carrying around a blob of functionality that was encrypted using the target's information. E.g. their MAC address, or a subset of their list of installed programs, or anything that would uniquely identify your target separately from everyone else. Any time the virus is installed anywhere, it tries to decrypt itself using this information, which only succeeds when it's installed on the target machine.
This defeats all attempts at analysis. You can't analyze what you can't decrypt.
This is an interesting concept. Can you provide any links to the practical info re. this?
It seems to be quite easy to do something similar; especially if you're specifically targeting a single PC with a very specific configuration.
How do you ensure that the amount of entropy in the key is enough to stop a person from finding the key? Assuming the person reverse-engineering the virus already knows the key generation routine, since he has access to the binary of the virus.
Hostnames and lists of installed programs should be prone to a dictionary attack, mac addresses are not even close to having enough bytes.
It depends entirely whether you have persistent access to the target. If your target is airgapped, your only option is to know something about the target machine (i.e. have a spy on the inside) that gives you enough info to encrypt the virus. For example, they could install a special program so that it's listed in C:\Program Files, then the virus decrypts using the string "${mac_addr}${x}" for x in [list of installed programs]. So as an analyst, you won't have any idea what the magic program name was.
That technique was likely Stuxnet, not Flame, so that article might not contain any info about it. But it's amazing in its own way. If you have any other questions too, I love chatting about this stuff.
Heh, that's interesting regarding using the MAC address as the encryption key.
I'm a bit confused though, how would the virus get to the target with the encrypted payload, do you mean say on another computer, the MAC address for the target computer would be utilised, then the main payload of the virus encrypted, and the virus given to the target?
It's interesting using the MAC, because if you placed it in a VM it wouldn't run, unless you used the same MAC for the VM network adapter.
Hard to believe it was already almost six years ago.
> On both servers command and-control activity happens through a Web
application called Newsforyou. It processes the W32.Flamer client
interactions and provides a simple control panel. The control panel
allows the attackers to upload packages of code to deliver to
compromised clients, and download packages containing stolen client
data. However, in a technique not previously seen before the uploaded
and downloaded packages are encrypted, so infiltrating the
command-and-control server does not reveal the code or the stolen
client data. The command and-control server simply serves as a proxy
for the data and the data is encrypted and decrypted offline by the
attackers using keys unique to each client. This application also
contains functionality to communicate with clients compromised by
malware other than Flamer. The Web application was designed to be a
framework for supporting different malware campaigns.
> In addition to avoiding the compromise of their operations, preventing
both the uploading of rogue code and viewing of stolen data, the setup
also maintains a clear distinction of roles. The roles include those
responsible for setting up the server (admins), those responsible for
uploading packages and downloading stolen data through the control
panel (operators), and those holding the private key with the ability
to decrypt the stolen data (attack coordinators). The operators
themselves may actually be completely unaware of the contents in the
stolen data. This is due to design of the process to use data security
compartmentalization techniques, as shown in Figure 1.
> Despite these techniques, we were still able to determine that one of
the servers delivered a module instructing Flamer to commit suicide
and wipe itself off computers in late May 2012, an action we also
witnessed through compromised honeypots.
> Finally, access to the control panel required a password which is
stored as a hash. Despite brute force attempts at reversing the hash
to plain text, we were unable to determine the plain text password.
The whole thing is filled with gems like that. They controlled the virus with a PHP webapp called "news for you!" presumably with a sly winky face emoji appended to the <title> tag.
I bought this book a while ago - 'Malicious Cryptography: Exposing Crytovirology' I still need to read it more thoroughly but they present some interesting ideas.
Linux programs are notorious for containing a lot of symbol information and the toolchain doesn't exactly make it easy to get rid of them. To start with, you have to compile executables with '-fdefault-visibility=hidden'. Otherwise, strip cannot remove the symbols defined in the executable because it's assumed that another executable might want to load your executable as a shared object and import its functions/global variables.
Next, if you are using any kind of library, even statically linked, changing the default visibility for your executable isn't enough to get rid of the libraries' symbols, because those were already defaulted to public when the respective library was compiled.
Also when statically linking, you probably want to compile all the libraries with '-ffunction-sections -fdata-sections' and pass '--gc-sections' to the linker to prevent bloating your executable with unused functionality.
> In this paper, we present a new system called EKLAVYA which trains a recurrent neural network to recover function type signatures from disassembled binary code. EKLAVYA assumes no knowledge of the target instruction set semantics to make such inference.
> [..] we find by analyzing its model that it auto-learns relationships between instructions, compiler conventions, stack frame setup instructions, use-before-write patterns, and operations relevant to identifying types directly from binaries.
> In our evaluation on Linux binaries compiled with clang and gcc, for two different architectures (x86 and x64), EKLAVYA exhibits accuracy of around 84% and 81% for function argument count and type recovery tasks respectively.