Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Decompilation is an important tool in reverse engineering computer software.
- Researchers have proposed using techniques from neural machine translation to automate the process.
- Existing neural decompilers require language-specific domain knowledge to build an abstract syntax tree.
- This paper explores a different tradeoff that treats the assembly and source languages as plain text.
- The prototype decompiler, Beyond The C (BTC), is evaluated on Go, Fortran, OCaml, and C.
- Training data selection and tokenization impact the quality of decompilation.
- The training data, trained decompilation models, and code will be released to encourage future research.
Paper Content
I. introduction
- Decompilation is the process of reversing the compilation process
- It is used in reverse engineering, malware analysis, security auditing, and re-engineering of legacy code
- Decompilers for binary code are few and far between and require a lot of engineering
- Traditional decompilation is based on lifting the binary into an architecture-independent intermediate language
- Recently, Neural Machine Translation (NMT) has been used to try and simplify the process
- NMT decompilers have all targeted C
- NMT decompilers rely on domain knowledge and tools that may be difficult to come by for other languages
- This paper explores a different approach to decompilation, treating the training data as text and avoiding domain knowledge
- The paper demonstrates flexibility by training decompilation models for C, Fortran, Go, and OCaml
- The paper evaluates the models in terms of speed and quality of decompilation
- The paper will make training corpora, trained models and code available for further research
Ii. background
A. challenges of decompilation
- Decompilation of binary code is a difficult problem.
- Disassembling a binary executable is known as “reassembleable disassembly”.
- This paper focuses on source language retargetability for the x86 architecture.
- Decompilation from assembly is hard because much information is lost during the compilation process.
B. neural machine translation
- NMT is a type of neural network used to translate text from one language to another
- NMT systems use the transformer architecture
- NMT models consist of an encoder and decoder
- The decoder generates a probability distribution over the tokens in the target language
- Greedy methods such as beam search and nucleus (top-p) sampling are used to find a solution
- Text can be decomposed into tokens of varying granularity
- The model is trained over a corpus of example translations
- The loss function used to train the model is a differentiable measure of how the output probability distribution is from the ground truth
C. neural decompilation
- Researchers have studied applying NMT methods to decompilation.
- They focused on the C language and used RNN architectures.
- RNNs have feedback loops that allow information to flow through the encoder and decoder.
- More recent work uses transformer models, which have been successful in NLP.
- Transformers remove the feedback loops, allowing training to be more efficiently parallelized.
D. classical vs. neural decompilation
- Traditional decompilers use program analyses and domain knowledge to construct a source-level representation.
- Neural decompilers use deep neural networks to learn a mapping between assembly and source code.
- Neural decompilers are opaque and can produce nonsensical output when input is different from training set.
- Neural decompilers don’t require handcrafted patterns and can easily support new architectures and source languages.
- Neural decompilers generate code that looks more natural and is free of artifacts.
E. computer languages vs. human languages
- Programming languages are different from natural language.
- Sequence contraction is a parameter used to quantify the differences.
- Figure 2 shows the distribution of sequence contraction for programming languages and human languages.
- Decompilation is a distinctively different domain compared to NMT for human languages.
Iii. design
- Problem of decompilation is formulated as a text-to-text translation
- No source parsing, CPU instruction awareness or indication to the transformer about what a token means
- Source and machine code are seen as a sequence of tokens
- Retargetable approach that can be applied to any language
- Fetch source of programs from a code repository, compile them into assembly, and then generate pairs of assembly and source code functions
A. model and data
- Training the model requires source code and assembly snippets that are paired together.
- The source code snippets are whole functions.
- Using functions makes it easier to attribute source code to assembly.
- CFI directives are used to extract the pairs of source and assembly.
- CFI directives are shared between LLVM and GNU compilers.
- No parsing or lexing is required.
B. preprocessing
- Generate assembly files from source files
- Extract functions from assembly files
- Preprocess source lines corresponding to functions (normalize whitespace, remove comments, replace string literals)
- Preprocessing implementation is simple and handles four languages in 166 lines of Python
C. tokenization
- We need to tokenize source and assembly text.
- We can’t use a common NLP tokenizer because source code contains many unique names.
- We can’t tokenize at the byte level because it would limit the size of functions.
- We use byte pair encoding (BPE) which has become popular in NLP.
- We train the tokenizer on source code and assembly code separately.
Iv. implementation a. training data
- Quality of training data is important for machine learning models
- 4 languages used in the paper
- For each language, programs written in that language and assembly code obtained
- C-Small: competitive programming and interview style problems
- Go-Small: competitive programming and interview style repositories, plus general algorithm implementations and utility programs
- Fortran-Small: realworld programs solving computational problems
- OCaml-Small: 119 packages from Jane Street and CMU’s Binary Analysis Platform
- Large Datasets: 10x more data for OCaml, 50K packages for C from Debian Linux
- 6 million assembly files generated
B. training the transformer
- Used Facebook’s fairseq toolkit with transformer encoder/decoder architecture
- Trained on HPC node with NVIDIA V100/RTX8000 GPU and 2x Intel Xeon Platinum 8268 CPUs
- Model parameters in Table II, max sequence lengths and vocab sizes in Table I
- Batched data with fairseq option to specify max number of tokens in batch
- Used larger amount of data for C and OCaml, trained two larger models with twice as many encoder/decoder layers and larger embedding dimension
- Small models trained until test accuracy stopped improving, large models took significantly longer to train
C. btc vscode extension
- Created a Visual Studio Code extension to allow interactive exploration of results
- Measured decompilation quality using Average Edit Distance
- Average Edit Distance indicates how many changes a reverse engineer might have to make to the decompiled function
- Best result (C-L-L) achieved an Average Edit Distance of 0.46, meaning changes are needed to half the tokens in a given function
B. effect of dataset and model size
- Training on larger datasets improved model performance for C and OCaml.
- Overfitting was reduced with larger datasets.
C. translation performance
- Translation time is a metric of concern
- All models translate more than 10 functions per second on average on a V100 GPU
- This is not directly comparable to traditional decompilers
- BTC has room to be paired with more costly techniques to fix candidate sketches
D. tokenization
- Compared BPE tokenization to character-level tokenization on Fortran dataset
- Limited training dataset for character-level model to functions less than 6,000 tokens
- BPE model trained on full Fortran dataset
- Reduced test set consists of smaller functions, making it easier to decompile
E. qualitative observations
- BTC decompilations show general structure and major features like loops and conditionals correspond to original code
- Variable names make sense in many instances
- Models make small but semantically meaningful mistakes
- Quality varies between languages, OCaml is hardest to decompile
- Edit distance does not perfectly capture decompilation quality
- Decompiled output not accurate enough to be trusted without additional fixups
F. quality of training data
- Code used in the wild can be difficult to use for training data due to the variety of structures and idioms.
- Extractor uses “.loc” directives in assembly to attribute source lines to assembly procedures.
- Mapping between source and assembly can be difficult in complex languages, such as OCaml, Haskell, and C.
Vi. related work
- Reverse engineering of binary code has been studied in many research works
- Traditional methods have been used to make decompilers more human-readable and structured
- Automation of decompilers has been proposed using a formal specification of the hardware platform
- Learning-based methods have been developed to automate the lifting of binary to an intermediate representation
- Specific sub-tasks have been studied, such as detecting functions and suggesting variable names
- Representation learning methods have been applied to assembly code
- Transfer learning has been used to learn execution semantics from execution traces
- Prior efforts have focused on the C language
- This work focuses on real-world code
Vii. limitations and future work
- Traditional decompilers use multiple stages of parsing and intermediate-languages
- Traditional decompilers rely on hardcoded idioms and data-flow and type analysis
- Interpretability is a major drawback in deep learning methods
- Techniques can enhance accuracy of neural decompilers
- Quality of source to assembly mapping provided by compilers affects accuracy
- NMT methods limited in length of input sequences
Viii. conclusion
- Neural decompilation can be used to create retargetable decompilers for different programming languages.
- Results were reported for C, Fortran, Go and OCaml.
- BTC does not require compiler customization, complex parsing, or specifying keywords or operators.
- Despite lack of explicit knowledge, model can replicate many features of the language.
- Challenges remain with code size, semantic accuracy and data quality.