Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • End-to-end neural approaches lack interpretability and robustness
  • Binder is a training-free neural-symbolic framework that maps task input to a program
  • Unified API of language model functionalities is used to extend grammar coverage
  • GPT-3 Codex is used as the language model
  • Few in-context exemplar annotations are used
  • Binder achieves state-of-the-art results on WikiTableQuestions and TabFact datasets
  • No training required, only uses dozens of annotations as in-context exemplars

Paper Content


  • Performance on natural language processing tasks is dominated by neural end-to-end systems
  • Symbolic approaches produce explicit intermediate representations
  • Symbolic approaches are interpretable and robust
  • Coverage is limited by the grammar of the symbolic language
  • Neural-symbolic approaches combine neural modules and symbolic languages
  • Neural-symbolic approaches require human design and large training data
  • BINDER is a training-free neural-symbolic framework that maps task inputs to an executable program
  • BINDER requires few annotations and is more interpretable, scalable, and robust than end-to-end approaches


Binder framework

  • BINDER framework is used to solve NLP tasks
  • BINDER program is generated from natural language input and optional context
  • Output answer is derived by executing BINDER program with interpreter

Binder parsing

  • Input natural language is parsed into a BINDER program
  • BINDER program is an expression in a symbolic language that includes API calls
  • API call is a function that accepts a question and context to be queried
  • Output of API call is the answer to the question
  • Output is represented as a variable compatible with the symbolic language grammar

Binder execution

  • Program Z is executed by a BINDER interpreter to derive the answer A.
  • BINDER interpreter consists of a standard symbolic language interpreter and the model(s) realizing the API calls.
  • Lexical and syntax analysis includes adding f ( Q ; D) as a new identifier in the grammar.
  • Program evaluation involves evaluating the API calls by calling the underlying neural models.

In-context learning for binder

  • Uses large language models for in-context learning
  • Only takes a few annotations/demonstrations as a prompt
  • Performs inference without training model parameters
  • Uses Codex as both semantic parser and model to perform API call functionalities
  • Takes advantage of few-shot generalization ability of Codex
  • Applies in-context learning for BINDER with k in-context exemplars
  • Outputs n candidate BINDER programs
  • Programs are executed by BINDER interpreter
  • Output answer is derived via majority voting strategy

Binder implementation

  • BINDER is designed to be extensible to various programming languages and API call functionalities.
  • Two APIs are implemented: f col and f val.
  • f col calls a language model to answer questions based on column data.
  • f val is used for more complex questions and outputs a value as the answer.


Experiment setup

  • Evaluated method on three knowledge grounding datasets
  • WIKITQ requires complex table reasoning skills
  • 20% of WIKITQ questions not answerable by pure SQL
  • TABFACT is a binary fact verification benchmark
  • Evaluation metrics are execution accuracy for WIKITQ and TABFACT
  • Pre-matching check for semantically correct cases in WIKITQ


  • Compared to other strong published methods, Codex BINDER (ours) achieved 85.1 accuracy on the official small test set without finetuning.
  • Codex was also evaluated with additional inference models, including end-to-end QA and semantic parsing with the standard SQL language.

Implementation details

  • Used OpenAI Codex API model for experiments
  • Annotated 14 in-context exemplars with BINDER programs
  • Prompt format follows Rajkumar et al., 2022
  • Benefits of BINDER include interpretability and robustness


Ablation study

  • Binding neural module API calls into a programming language can help solve queries that are unsolvable in that language alone.
  • Codex BINDER outperforms Codex SQL by 10.1% on program-unsolvable questions.
  • BINDER has a much lower spurious rate than SQL (12% vs. 33%).


  • BINDER improves interpretability over end-to-end approaches
  • BINDER enables finding the source of errors and provides a way to fix them



  • BINDER is more scalable than end-to-end QA
  • BINDER can handle large knowledge sources, while end-to-end QA fails or degrades

Noisy content

  • End-to-end methods are more brittle to noisy inputs.
  • A noisy WIKITQ development subset was built with distractors.
  • BINDER is stable confronting distractors, while end-to-end QA is more likely to be confused.

Binder with python

  • BINDER is designed to be extensible to various programming languages
  • Python (with the Pandas package) is used as the BINDER language on WIKITQ
  • Neural API is incorporated into BINDER with Python
  • Evaluated on program-unsolvable subset of WIKITQ to test if method improves Python’s capability
  • BINDER with Python effectively improves Python coverage on difficult subset

Multimodal application

  • BINDER is applied to the multi-modal dataset MULTIMODALQA (MMQA) across text, tables, and images.
  • Images are converted into textual image captions with a vision-text pretrained model OFA.
  • BINDER achieves better performance than end-to-end QA and the fine-tuned baseline Implicit-Decomp.
  • With the oracle retriever, BINDER can achieve comparable performance with the state-of-the-art.
  • Semantic parsing is a symbolic method used to produce executable programs from natural language input
  • Neural-Symbolic methods integrate neural modules with symbolic languages
  • BINDER is a training-free method that requires only dozens of annotations and is expressive and flexible to handle real-world diverse questions


  • Propose BINDER, a training-free neural-symbolic framework
  • Combines strengths of end-to-end and symbolic approaches
  • State-of-the-art performance on WIKITQ and TABFACT with only a few in-context demonstrations
  • No additional training required
  • Language model-focused attempt to integrate two widely-adopted paradigms in NLP
  • Can be extended to many more scenarios with the appropriate programming language and functionalities
  • Codex used as the LM for all neural modules
  • Image captions used for images
  • Majority vote used to ensemble multiple candidate answers
  • BINDER grammar adapted to SQL
  • Performance increases with more generations of BINDER programs
  • Annotation interface allows real-time executions with huggingface spaces2
  • Source code to be released