Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Language models can be used for complex reasoning either end-to-end or compositionally.
Iterated decomposition is a workflow for developing and refining compositional LM programs.
ICE is an open-source tool for visualizing the execution traces of LM programs.
Iterated decomposition is applied to three real-world tasks and improves accuracy of LM programs.

Paper Content

Introduction

Language models are often trained using feedback on outcomes
Good outputs can be distinguished from bad ones
As model capabilities and task complexities scale up, outcome-based evaluation may run into alignment problems
Process supervision is an alternative to outcome-based training
Process supervision promises increased interpretability, trust, and alignment

Process supervision

Process supervision is a way to train and deploy machine learning models.
Human-understandable steps are used in the process.
Literature is reviewed and gaps are highlighted.
Iterated decomposition workflow and ICE visualizer are explained.

Prior work on process supervision

Significant advances in techniques and frameworks for process supervision
Real-world use cases are rare
Review of prior work and gaps in literature
Rapidly growing field
New task decompositions, training and finetuning techniques, workflows and tutorials, surveys and frameworks, tools, and theory
Creswell et al., Kazemi et al., Wu et al., Saha et al., Yang et al., ReAct, WebGPT, Gao et al., Trivedi et al., Khattab et al., Khot et al., Ozturkler et al., Jung et al., Wies et al., Press et al.
Compositionality gap
Taxonomy of process and supervision
Iterated decomposition workflow and ICE visualizer

Iterated decomposition

Start with minimal task decomposition
Apply decomposition to test inputs with gold standard answers
Evaluate results automatically or manually
Identify failures and improve components
Repeat steps 2-5 until good performance or resources are exhausted

Interactive composition explorer

ICE provides a decorator to record async functions
ICE provides utilities to record custom values, strings, and structure
Data can be visualized in a browser for interactive exploration
ICE provides three views into the data: tree, table, and detail pane
Dropdown menu of all recorded functions and their call counts
Detail pane includes special support for rendering interpolated strings

Real-world context of case studies

Used iterated decomposition and ICE to improve performance on 3 real-world case studies
Placebo Classification & Description task helps researchers evaluate risk of bias
Participant Flow in RCTs task helps user understand what researchers did in the study
Evaluating participant adherence to an intervention helps assess risk of bias
Elicit supports 20 pre-specified questions and allows users to enter their own questions
QASPER NLP Q&A task tests generalizing participant flow decomposition to a different domain

Case study: placebo classification & description

Focused on domain-specific decompositions
Accuracy of generated placebo description improved from 25% to 65%

Setup

QASPER is a dataset of 5049 questions about NLP papers.
Answers to QASPER questions are in one of three formats: yes/no, excerpts from the text, or freeform answers.
The authors measured F 1 scores of generated answers versus the expert answers.
The dataset comes a ranked of relevant documents.
The goal is to answer about NLP papers.
The baseline fails primarily by failing to make good use of the long context of the paper.
An approach of ranking the most relevant paragraphs and using them to generate an answer resulted in a big improvement on the Elicit baseline.
A regex keyword-matching algorithm was created to classify whether a trial used a placebo.
The regex keyword approach does not generalize to harder and more ambiguous tasks.

Case study: participant flow in randomized controlled trials

Accuracy of extracting experiments improved from 40% to 70%.
Accuracy of trial arms improved from 55% to 86%.
Accuracy of adherence improved from 53% to 70%.

Evaluation

Experiments and arms can be evaluated easily
Adherence requires a narrative answer and is more subjective
Adherence may benefit from a nuanced decomposition
There are 135 adherence answers in the test set
Information about adherence is only available for 41% of the arms

Iterations

Tested select-then-generate baselines and best-performing decompositions on QASPER questions
Best approach was perplexity selection with few-shot examples from demonstrations, scored 69% accuracy
Baseline approach of top-1 paragraph from classifier and generating answer scored 38%
Elicit-like perplexity classifier and generating without few-shot demos scored 55%
Robust to domain transfer
Baseline same as Elicit, scored 53% on adherence subtask
Finetuning with T-Few approach improved adherence subtask from 72% to 94%
Prompt engineering improved performance on specific subtasks
Zero-shot perplexity-based cross-encoder outperformed monoT5 classifier
Pruning step improved precision from 0.14 to 0.29 and recall from 0.77 to 0.54, increasing F1 from 0.24 to 0.38 and accuracy from 0.55 to 0.6

Conclusion

Iterated decomposition is a workflow for process supervision
Elicit is an AI Research Assistant
ICE is an open-source debugger for language model programs
Iteration speed needs to be increased
Decompositions need to be more complex
Developer tools need to be more sophisticated
Machine-accessible dev tools need to be created
Task design and execution need to be reduced
Cost of complex decompositions needs to be reduced
Supervision goal is to identify and fix failure modes, detect and correct errors, evaluate final task answer, or generate feedback signal
Supervisor is the developer
Ground truth is available
Placebo case study evaluates each trial
Placebo is a substance or intervention that mimics the treatment group
Selection found mention of a placebo but failed to find the description

Link to paper#

Abstract#

Paper Content#

Introduction#

Process supervision#

Prior work on process supervision#

Iterated decomposition#

Interactive composition explorer#

Real-world context of case studies#

Case study: placebo classification & description#

Setup#

Case study: participant flow in randomized controlled trials#

Evaluation#

Iterations#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Process supervision

Prior work on process supervision

Iterated decomposition

Interactive composition explorer

Real-world context of case studies

Case study: placebo classification & description

Setup

Case study: participant flow in randomized controlled trials

Evaluation

Iterations

Conclusion