Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Decompilation is an important tool in reverse engineering computer software.
  • Researchers have proposed using techniques from neural machine translation to automate the process.
  • Existing neural decompilers require language-specific domain knowledge to build an abstract syntax tree.
  • This paper explores a different tradeoff that treats the assembly and source languages as plain text.
  • The prototype decompiler, Beyond The C (BTC), is evaluated on Go, Fortran, OCaml, and C.
  • Training data selection and tokenization impact the quality of decompilation.
  • The training data, trained decompilation models, and code will be released to encourage future research.

Paper Content

I. introduction

  • Decompilation is the process of reversing the compilation process
  • It is used in reverse engineering, malware analysis, security auditing, and re-engineering of legacy code
  • Decompilers for binary code are few and far between and require a lot of engineering
  • Traditional decompilation is based on lifting the binary into an architecture-independent intermediate language
  • Recently, Neural Machine Translation (NMT) has been used to try and simplify the process
  • NMT decompilers have all targeted C
  • NMT decompilers rely on domain knowledge and tools that may be difficult to come by for other languages
  • This paper explores a different approach to decompilation, treating the training data as text and avoiding domain knowledge
  • The paper demonstrates flexibility by training decompilation models for C, Fortran, Go, and OCaml
  • The paper evaluates the models in terms of speed and quality of decompilation
  • The paper will make training corpora, trained models and code available for further research

Ii. background

A. challenges of decompilation

  • Decompilation of binary code is a difficult problem.
  • Disassembling a binary executable is known as “reassembleable disassembly”.
  • This paper focuses on source language retargetability for the x86 architecture.
  • Decompilation from assembly is hard because much information is lost during the compilation process.

B. neural machine translation

  • NMT is a type of neural network used to translate text from one language to another
  • NMT systems use the transformer architecture
  • NMT models consist of an encoder and decoder
  • The decoder generates a probability distribution over the tokens in the target language
  • Greedy methods such as beam search and nucleus (top-p) sampling are used to find a solution
  • Text can be decomposed into tokens of varying granularity
  • The model is trained over a corpus of example translations
  • The loss function used to train the model is a differentiable measure of how the output probability distribution is from the ground truth

C. neural decompilation

  • Researchers have studied applying NMT methods to decompilation.
  • They focused on the C language and used RNN architectures.
  • RNNs have feedback loops that allow information to flow through the encoder and decoder.
  • More recent work uses transformer models, which have been successful in NLP.
  • Transformers remove the feedback loops, allowing training to be more efficiently parallelized.

D. classical vs. neural decompilation

  • Traditional decompilers use program analyses and domain knowledge to construct a source-level representation.
  • Neural decompilers use deep neural networks to learn a mapping between assembly and source code.
  • Neural decompilers are opaque and can produce nonsensical output when input is different from training set.
  • Neural decompilers don’t require handcrafted patterns and can easily support new architectures and source languages.
  • Neural decompilers generate code that looks more natural and is free of artifacts.

E. computer languages vs. human languages

  • Programming languages are different from natural language.
  • Sequence contraction is a parameter used to quantify the differences.
  • Figure 2 shows the distribution of sequence contraction for programming languages and human languages.
  • Decompilation is a distinctively different domain compared to NMT for human languages.

Iii. design

  • Problem of decompilation is formulated as a text-to-text translation
  • No source parsing, CPU instruction awareness or indication to the transformer about what a token means
  • Source and machine code are seen as a sequence of tokens
  • Retargetable approach that can be applied to any language
  • Fetch source of programs from a code repository, compile them into assembly, and then generate pairs of assembly and source code functions

A. model and data

  • Training the model requires source code and assembly snippets that are paired together.
  • The source code snippets are whole functions.
  • Using functions makes it easier to attribute source code to assembly.
  • CFI directives are used to extract the pairs of source and assembly.
  • CFI directives are shared between LLVM and GNU compilers.
  • No parsing or lexing is required.

B. preprocessing

  • Generate assembly files from source files
  • Extract functions from assembly files
  • Preprocess source lines corresponding to functions (normalize whitespace, remove comments, replace string literals)
  • Preprocessing implementation is simple and handles four languages in 166 lines of Python

C. tokenization

  • We need to tokenize source and assembly text.
  • We can’t use a common NLP tokenizer because source code contains many unique names.
  • We can’t tokenize at the byte level because it would limit the size of functions.
  • We use byte pair encoding (BPE) which has become popular in NLP.
  • We train the tokenizer on source code and assembly code separately.

Iv. implementation a. training data

  • Quality of training data is important for machine learning models
  • 4 languages used in the paper
  • For each language, programs written in that language and assembly code obtained
  • C-Small: competitive programming and interview style problems
  • Go-Small: competitive programming and interview style repositories, plus general algorithm implementations and utility programs
  • Fortran-Small: realworld programs solving computational problems
  • OCaml-Small: 119 packages from Jane Street and CMU’s Binary Analysis Platform
  • Large Datasets: 10x more data for OCaml, 50K packages for C from Debian Linux
  • 6 million assembly files generated

B. training the transformer

  • Used Facebook’s fairseq toolkit with transformer encoder/decoder architecture
  • Trained on HPC node with NVIDIA V100/RTX8000 GPU and 2x Intel Xeon Platinum 8268 CPUs
  • Model parameters in Table II, max sequence lengths and vocab sizes in Table I
  • Batched data with fairseq option to specify max number of tokens in batch
  • Used larger amount of data for C and OCaml, trained two larger models with twice as many encoder/decoder layers and larger embedding dimension
  • Small models trained until test accuracy stopped improving, large models took significantly longer to train

C. btc vscode extension

  • Created a Visual Studio Code extension to allow interactive exploration of results
  • Measured decompilation quality using Average Edit Distance
  • Average Edit Distance indicates how many changes a reverse engineer might have to make to the decompiled function
  • Best result (C-L-L) achieved an Average Edit Distance of 0.46, meaning changes are needed to half the tokens in a given function

B. effect of dataset and model size

  • Training on larger datasets improved model performance for C and OCaml.
  • Overfitting was reduced with larger datasets.

C. translation performance

  • Translation time is a metric of concern
  • All models translate more than 10 functions per second on average on a V100 GPU
  • This is not directly comparable to traditional decompilers
  • BTC has room to be paired with more costly techniques to fix candidate sketches

D. tokenization

  • Compared BPE tokenization to character-level tokenization on Fortran dataset
  • Limited training dataset for character-level model to functions less than 6,000 tokens
  • BPE model trained on full Fortran dataset
  • Reduced test set consists of smaller functions, making it easier to decompile

E. qualitative observations

  • BTC decompilations show general structure and major features like loops and conditionals correspond to original code
  • Variable names make sense in many instances
  • Models make small but semantically meaningful mistakes
  • Quality varies between languages, OCaml is hardest to decompile
  • Edit distance does not perfectly capture decompilation quality
  • Decompiled output not accurate enough to be trusted without additional fixups

F. quality of training data

  • Code used in the wild can be difficult to use for training data due to the variety of structures and idioms.
  • Extractor uses “.loc” directives in assembly to attribute source lines to assembly procedures.
  • Mapping between source and assembly can be difficult in complex languages, such as OCaml, Haskell, and C.
  • Reverse engineering of binary code has been studied in many research works
  • Traditional methods have been used to make decompilers more human-readable and structured
  • Automation of decompilers has been proposed using a formal specification of the hardware platform
  • Learning-based methods have been developed to automate the lifting of binary to an intermediate representation
  • Specific sub-tasks have been studied, such as detecting functions and suggesting variable names
  • Representation learning methods have been applied to assembly code
  • Transfer learning has been used to learn execution semantics from execution traces
  • Prior efforts have focused on the C language
  • This work focuses on real-world code

Vii. limitations and future work

  • Traditional decompilers use multiple stages of parsing and intermediate-languages
  • Traditional decompilers rely on hardcoded idioms and data-flow and type analysis
  • Interpretability is a major drawback in deep learning methods
  • Techniques can enhance accuracy of neural decompilers
  • Quality of source to assembly mapping provided by compilers affects accuracy
  • NMT methods limited in length of input sequences

Viii. conclusion

  • Neural decompilation can be used to create retargetable decompilers for different programming languages.
  • Results were reported for C, Fortran, Go and OCaml.
  • BTC does not require compiler customization, complex parsing, or specifying keywords or operators.
  • Despite lack of explicit knowledge, model can replicate many features of the language.
  • Challenges remain with code size, semantic accuracy and data quality.