Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Decompilation is an important tool in reverse engineering computer software.
Researchers have proposed using techniques from neural machine translation to automate the process.
Existing neural decompilers require language-specific domain knowledge to build an abstract syntax tree.
This paper explores a different tradeoff that treats the assembly and source languages as plain text.
The prototype decompiler, Beyond The C (BTC), is evaluated on Go, Fortran, OCaml, and C.
Training data selection and tokenization impact the quality of decompilation.
The training data, trained decompilation models, and code will be released to encourage future research.

Paper Content

I. introduction

Decompilation is the process of reversing the compilation process
It is used in reverse engineering, malware analysis, security auditing, and re-engineering of legacy code
Decompilers for binary code are few and far between and require a lot of engineering
Traditional decompilation is based on lifting the binary into an architecture-independent intermediate language
Recently, Neural Machine Translation (NMT) has been used to try and simplify the process
NMT decompilers have all targeted C
NMT decompilers rely on domain knowledge and tools that may be difficult to come by for other languages
This paper explores a different approach to decompilation, treating the training data as text and avoiding domain knowledge
The paper demonstrates flexibility by training decompilation models for C, Fortran, Go, and OCaml
The paper evaluates the models in terms of speed and quality of decompilation
The paper will make training corpora, trained models and code available for further research

Ii. background

A. challenges of decompilation

Decompilation of binary code is a difficult problem.
Disassembling a binary executable is known as “reassembleable disassembly”.
This paper focuses on source language retargetability for the x86 architecture.
Decompilation from assembly is hard because much information is lost during the compilation process.

B. neural machine translation

NMT is a type of neural network used to translate text from one language to another
NMT systems use the transformer architecture
NMT models consist of an encoder and decoder
The decoder generates a probability distribution over the tokens in the target language
Greedy methods such as beam search and nucleus (top-p) sampling are used to find a solution
Text can be decomposed into tokens of varying granularity
The model is trained over a corpus of example translations
The loss function used to train the model is a differentiable measure of how the output probability distribution is from the ground truth

C. neural decompilation

Researchers have studied applying NMT methods to decompilation.
They focused on the C language and used RNN architectures.
RNNs have feedback loops that allow information to flow through the encoder and decoder.
More recent work uses transformer models, which have been successful in NLP.
Transformers remove the feedback loops, allowing training to be more efficiently parallelized.

D. classical vs. neural decompilation

Traditional decompilers use program analyses and domain knowledge to construct a source-level representation.
Neural decompilers use deep neural networks to learn a mapping between assembly and source code.
Neural decompilers are opaque and can produce nonsensical output when input is different from training set.
Neural decompilers don’t require handcrafted patterns and can easily support new architectures and source languages.
Neural decompilers generate code that looks more natural and is free of artifacts.

E. computer languages vs. human languages

Programming languages are different from natural language.
Sequence contraction is a parameter used to quantify the differences.
Figure 2 shows the distribution of sequence contraction for programming languages and human languages.
Decompilation is a distinctively different domain compared to NMT for human languages.

Iii. design

Problem of decompilation is formulated as a text-to-text translation
No source parsing, CPU instruction awareness or indication to the transformer about what a token means
Source and machine code are seen as a sequence of tokens
Retargetable approach that can be applied to any language
Fetch source of programs from a code repository, compile them into assembly, and then generate pairs of assembly and source code functions

A. model and data

Training the model requires source code and assembly snippets that are paired together.
The source code snippets are whole functions.
Using functions makes it easier to attribute source code to assembly.
CFI directives are used to extract the pairs of source and assembly.
CFI directives are shared between LLVM and GNU compilers.
No parsing or lexing is required.

B. preprocessing

Generate assembly files from source files
Extract functions from assembly files
Preprocess source lines corresponding to functions (normalize whitespace, remove comments, replace string literals)
Preprocessing implementation is simple and handles four languages in 166 lines of Python

C. tokenization

We need to tokenize source and assembly text.
We can’t use a common NLP tokenizer because source code contains many unique names.
We can’t tokenize at the byte level because it would limit the size of functions.
We use byte pair encoding (BPE) which has become popular in NLP.
We train the tokenizer on source code and assembly code separately.

Iv. implementation a. training data

Quality of training data is important for machine learning models
4 languages used in the paper
For each language, programs written in that language and assembly code obtained
C-Small: competitive programming and interview style problems
Go-Small: competitive programming and interview style repositories, plus general algorithm implementations and utility programs
Fortran-Small: realworld programs solving computational problems
OCaml-Small: 119 packages from Jane Street and CMU’s Binary Analysis Platform
Large Datasets: 10x more data for OCaml, 50K packages for C from Debian Linux
6 million assembly files generated

B. training the transformer

Used Facebook’s fairseq toolkit with transformer encoder/decoder architecture
Trained on HPC node with NVIDIA V100/RTX8000 GPU and 2x Intel Xeon Platinum 8268 CPUs
Model parameters in Table II, max sequence lengths and vocab sizes in Table I
Batched data with fairseq option to specify max number of tokens in batch
Used larger amount of data for C and OCaml, trained two larger models with twice as many encoder/decoder layers and larger embedding dimension
Small models trained until test accuracy stopped improving, large models took significantly longer to train

C. btc vscode extension

Created a Visual Studio Code extension to allow interactive exploration of results
Measured decompilation quality using Average Edit Distance
Average Edit Distance indicates how many changes a reverse engineer might have to make to the decompiled function
Best result (C-L-L) achieved an Average Edit Distance of 0.46, meaning changes are needed to half the tokens in a given function

B. effect of dataset and model size

Training on larger datasets improved model performance for C and OCaml.
Overfitting was reduced with larger datasets.

C. translation performance

Translation time is a metric of concern
All models translate more than 10 functions per second on average on a V100 GPU
This is not directly comparable to traditional decompilers
BTC has room to be paired with more costly techniques to fix candidate sketches

D. tokenization

Compared BPE tokenization to character-level tokenization on Fortran dataset
Limited training dataset for character-level model to functions less than 6,000 tokens
BPE model trained on full Fortran dataset
Reduced test set consists of smaller functions, making it easier to decompile

E. qualitative observations

BTC decompilations show general structure and major features like loops and conditionals correspond to original code
Variable names make sense in many instances
Models make small but semantically meaningful mistakes
Quality varies between languages, OCaml is hardest to decompile
Edit distance does not perfectly capture decompilation quality
Decompiled output not accurate enough to be trusted without additional fixups

F. quality of training data

Code used in the wild can be difficult to use for training data due to the variety of structures and idioms.
Extractor uses “.loc” directives in assembly to attribute source lines to assembly procedures.
Mapping between source and assembly can be difficult in complex languages, such as OCaml, Haskell, and C.

Reverse engineering of binary code has been studied in many research works
Traditional methods have been used to make decompilers more human-readable and structured
Automation of decompilers has been proposed using a formal specification of the hardware platform
Learning-based methods have been developed to automate the lifting of binary to an intermediate representation
Specific sub-tasks have been studied, such as detecting functions and suggesting variable names
Representation learning methods have been applied to assembly code
Transfer learning has been used to learn execution semantics from execution traces
Prior efforts have focused on the C language
This work focuses on real-world code

Vii. limitations and future work

Traditional decompilers use multiple stages of parsing and intermediate-languages
Traditional decompilers rely on hardcoded idioms and data-flow and type analysis
Interpretability is a major drawback in deep learning methods
Techniques can enhance accuracy of neural decompilers
Quality of source to assembly mapping provided by compilers affects accuracy
NMT methods limited in length of input sequences

Viii. conclusion

Neural decompilation can be used to create retargetable decompilers for different programming languages.
Results were reported for C, Fortran, Go and OCaml.
BTC does not require compiler customization, complex parsing, or specifying keywords or operators.
Despite lack of explicit knowledge, model can replicate many features of the language.
Challenges remain with code size, semantic accuracy and data quality.

Link to paper#

Abstract#

Paper Content#

I. introduction#

Ii. background#

A. challenges of decompilation#

B. neural machine translation#

C. neural decompilation#

D. classical vs. neural decompilation#

E. computer languages vs. human languages#

Iii. design#

A. model and data#

B. preprocessing#

C. tokenization#

Iv. implementation a. training data#

B. training the transformer#

C. btc vscode extension#

B. effect of dataset and model size#

C. translation performance#

D. tokenization#

E. qualitative observations#

F. quality of training data#

Vi. related work#

Vii. limitations and future work#

Viii. conclusion#

Link to paper

Abstract

Paper Content

I. introduction

Ii. background

A. challenges of decompilation

B. neural machine translation

C. neural decompilation

D. classical vs. neural decompilation

E. computer languages vs. human languages

Iii. design

A. model and data

B. preprocessing

C. tokenization

Iv. implementation a. training data

B. training the transformer

C. btc vscode extension

B. effect of dataset and model size

C. translation performance

D. tokenization

E. qualitative observations

F. quality of training data

Vi. related work

Vii. limitations and future work

Viii. conclusion