Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- End-to-end neural approaches lack interpretability and robustness
- Binder is a training-free neural-symbolic framework that maps task input to a program
- Unified API of language model functionalities is used to extend grammar coverage
- GPT-3 Codex is used as the language model
- Few in-context exemplar annotations are used
- Binder achieves state-of-the-art results on WikiTableQuestions and TabFact datasets
- No training required, only uses dozens of annotations as in-context exemplars
Paper Content
Introduction
- Performance on natural language processing tasks is dominated by neural end-to-end systems
- Symbolic approaches produce explicit intermediate representations
- Symbolic approaches are interpretable and robust
- Coverage is limited by the grammar of the symbolic language
- Neural-symbolic approaches combine neural modules and symbolic languages
- Neural-symbolic approaches require human design and large training data
- BINDER is a training-free neural-symbolic framework that maps task inputs to an executable program
- BINDER requires few annotations and is more interpretable, scalable, and robust than end-to-end approaches
Approach
Binder framework
- BINDER framework is used to solve NLP tasks
- BINDER program is generated from natural language input and optional context
- Output answer is derived by executing BINDER program with interpreter
Binder parsing
- Input natural language is parsed into a BINDER program
- BINDER program is an expression in a symbolic language that includes API calls
- API call is a function that accepts a question and context to be queried
- Output of API call is the answer to the question
- Output is represented as a variable compatible with the symbolic language grammar
Binder execution
- Program Z is executed by a BINDER interpreter to derive the answer A.
- BINDER interpreter consists of a standard symbolic language interpreter and the model(s) realizing the API calls.
- Lexical and syntax analysis includes adding f ( Q ; D) as a new identifier in the grammar.
- Program evaluation involves evaluating the API calls by calling the underlying neural models.
In-context learning for binder
- Uses large language models for in-context learning
- Only takes a few annotations/demonstrations as a prompt
- Performs inference without training model parameters
- Uses Codex as both semantic parser and model to perform API call functionalities
- Takes advantage of few-shot generalization ability of Codex
- Applies in-context learning for BINDER with k in-context exemplars
- Outputs n candidate BINDER programs
- Programs are executed by BINDER interpreter
- Output answer is derived via majority voting strategy
Binder implementation
- BINDER is designed to be extensible to various programming languages and API call functionalities.
- Two APIs are implemented: f col and f val.
- f col calls a language model to answer questions based on column data.
- f val is used for more complex questions and outputs a value as the answer.
Experiments
Experiment setup
- Evaluated method on three knowledge grounding datasets
- WIKITQ requires complex table reasoning skills
- 20% of WIKITQ questions not answerable by pure SQL
- TABFACT is a binary fact verification benchmark
- Evaluation metrics are execution accuracy for WIKITQ and TABFACT
- Pre-matching check for semantically correct cases in WIKITQ
Method
- Compared to other strong published methods, Codex BINDER (ours) achieved 85.1 accuracy on the official small test set without finetuning.
- Codex was also evaluated with additional inference models, including end-to-end QA and semantic parsing with the standard SQL language.
Implementation details
- Used OpenAI Codex API model for experiments
- Annotated 14 in-context exemplars with BINDER programs
- Prompt format follows Rajkumar et al., 2022
- Benefits of BINDER include interpretability and robustness
Analysis
Ablation study
- Binding neural module API calls into a programming language can help solve queries that are unsolvable in that language alone.
- Codex BINDER outperforms Codex SQL by 10.1% on program-unsolvable questions.
- BINDER has a much lower spurious rate than SQL (12% vs. 33%).
Interpretability
- BINDER improves interpretability over end-to-end approaches
- BINDER enables finding the source of errors and provides a way to fix them
Robustness
Scalability
- BINDER is more scalable than end-to-end QA
- BINDER can handle large knowledge sources, while end-to-end QA fails or degrades
Noisy content
- End-to-end methods are more brittle to noisy inputs.
- A noisy WIKITQ development subset was built with distractors.
- BINDER is stable confronting distractors, while end-to-end QA is more likely to be confused.
Binder with python
- BINDER is designed to be extensible to various programming languages
- Python (with the Pandas package) is used as the BINDER language on WIKITQ
- Neural API is incorporated into BINDER with Python
- Evaluated on program-unsolvable subset of WIKITQ to test if method improves Python’s capability
- BINDER with Python effectively improves Python coverage on difficult subset
Multimodal application
- BINDER is applied to the multi-modal dataset MULTIMODALQA (MMQA) across text, tables, and images.
- Images are converted into textual image captions with a vision-text pretrained model OFA.
- BINDER achieves better performance than end-to-end QA and the fine-tuned baseline Implicit-Decomp.
- With the oracle retriever, BINDER can achieve comparable performance with the state-of-the-art.
Related work
- Semantic parsing is a symbolic method used to produce executable programs from natural language input
- Neural-Symbolic methods integrate neural modules with symbolic languages
- BINDER is a training-free method that requires only dozens of annotations and is expressive and flexible to handle real-world diverse questions
Conclusion
- Propose BINDER, a training-free neural-symbolic framework
- Combines strengths of end-to-end and symbolic approaches
- State-of-the-art performance on WIKITQ and TABFACT with only a few in-context demonstrations
- No additional training required
- Language model-focused attempt to integrate two widely-adopted paradigms in NLP
- Can be extended to many more scenarios with the appropriate programming language and functionalities
- Codex used as the LM for all neural modules
- Image captions used for images
- Majority vote used to ensemble multiple candidate answers
- BINDER grammar adapted to SQL
- Performance increases with more generations of BINDER programs
- Annotation interface allows real-time executions with huggingface spaces2
- Source code to be released