Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Language models can be used for complex reasoning either end-to-end or compositionally.
  • Iterated decomposition is a workflow for developing and refining compositional LM programs.
  • ICE is an open-source tool for visualizing the execution traces of LM programs.
  • Iterated decomposition is applied to three real-world tasks and improves accuracy of LM programs.

Paper Content

Introduction

  • Language models are often trained using feedback on outcomes
  • Good outputs can be distinguished from bad ones
  • As model capabilities and task complexities scale up, outcome-based evaluation may run into alignment problems
  • Process supervision is an alternative to outcome-based training
  • Process supervision promises increased interpretability, trust, and alignment

Process supervision

  • Process supervision is a way to train and deploy machine learning models.
  • Human-understandable steps are used in the process.
  • Literature is reviewed and gaps are highlighted.
  • Iterated decomposition workflow and ICE visualizer are explained.

Prior work on process supervision

  • Significant advances in techniques and frameworks for process supervision
  • Real-world use cases are rare
  • Review of prior work and gaps in literature
  • Rapidly growing field
  • New task decompositions, training and finetuning techniques, workflows and tutorials, surveys and frameworks, tools, and theory
  • Creswell et al., Kazemi et al., Wu et al., Saha et al., Yang et al., ReAct, WebGPT, Gao et al., Trivedi et al., Khattab et al., Khot et al., Ozturkler et al., Jung et al., Wies et al., Press et al.
  • Compositionality gap
  • Taxonomy of process and supervision
  • Iterated decomposition workflow and ICE visualizer

Iterated decomposition

  • Start with minimal task decomposition
  • Apply decomposition to test inputs with gold standard answers
  • Evaluate results automatically or manually
  • Identify failures and improve components
  • Repeat steps 2-5 until good performance or resources are exhausted

Interactive composition explorer

  • ICE provides a decorator to record async functions
  • ICE provides utilities to record custom values, strings, and structure
  • Data can be visualized in a browser for interactive exploration
  • ICE provides three views into the data: tree, table, and detail pane
  • Dropdown menu of all recorded functions and their call counts
  • Detail pane includes special support for rendering interpolated strings

Real-world context of case studies

  • Used iterated decomposition and ICE to improve performance on 3 real-world case studies
  • Placebo Classification & Description task helps researchers evaluate risk of bias
  • Participant Flow in RCTs task helps user understand what researchers did in the study
  • Evaluating participant adherence to an intervention helps assess risk of bias
  • Elicit supports 20 pre-specified questions and allows users to enter their own questions
  • QASPER NLP Q&A task tests generalizing participant flow decomposition to a different domain

Case study: placebo classification & description

  • Focused on domain-specific decompositions
  • Accuracy of generated placebo description improved from 25% to 65%

Setup

  • QASPER is a dataset of 5049 questions about NLP papers.
  • Answers to QASPER questions are in one of three formats: yes/no, excerpts from the text, or freeform answers.
  • The authors measured F 1 scores of generated answers versus the expert answers.
  • The dataset comes a ranked of relevant documents.
  • The goal is to answer about NLP papers.
  • The baseline fails primarily by failing to make good use of the long context of the paper.
  • An approach of ranking the most relevant paragraphs and using them to generate an answer resulted in a big improvement on the Elicit baseline.
  • A regex keyword-matching algorithm was created to classify whether a trial used a placebo.
  • The regex keyword approach does not generalize to harder and more ambiguous tasks.

Case study: participant flow in randomized controlled trials

  • Accuracy of extracting experiments improved from 40% to 70%.
  • Accuracy of trial arms improved from 55% to 86%.
  • Accuracy of adherence improved from 53% to 70%.

Evaluation

  • Experiments and arms can be evaluated easily
  • Adherence requires a narrative answer and is more subjective
  • Adherence may benefit from a nuanced decomposition
  • There are 135 adherence answers in the test set
  • Information about adherence is only available for 41% of the arms

Iterations

  • Tested select-then-generate baselines and best-performing decompositions on QASPER questions
  • Best approach was perplexity selection with few-shot examples from demonstrations, scored 69% accuracy
  • Baseline approach of top-1 paragraph from classifier and generating answer scored 38%
  • Elicit-like perplexity classifier and generating without few-shot demos scored 55%
  • Robust to domain transfer
  • Baseline same as Elicit, scored 53% on adherence subtask
  • Finetuning with T-Few approach improved adherence subtask from 72% to 94%
  • Prompt engineering improved performance on specific subtasks
  • Zero-shot perplexity-based cross-encoder outperformed monoT5 classifier
  • Pruning step improved precision from 0.14 to 0.29 and recall from 0.77 to 0.54, increasing F1 from 0.24 to 0.38 and accuracy from 0.55 to 0.6

Conclusion

  • Iterated decomposition is a workflow for process supervision
  • Elicit is an AI Research Assistant
  • ICE is an open-source debugger for language model programs
  • Iteration speed needs to be increased
  • Decompositions need to be more complex
  • Developer tools need to be more sophisticated
  • Machine-accessible dev tools need to be created
  • Task design and execution need to be reduced
  • Cost of complex decompositions needs to be reduced
  • Supervision goal is to identify and fix failure modes, detect and correct errors, evaluate final task answer, or generate feedback signal
  • Supervisor is the developer
  • Ground truth is available
  • Placebo case study evaluates each trial
  • Placebo is a substance or intervention that mimics the treatment group
  • Selection found mention of a placebo but failed to find the description