Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Large language models can predict structured text such as code.
  • These models can make mistakes that users must fix or introduce subtle bugs.
  • R-U-SURE is an approach to build uncertainty-aware suggestions based on a decision-theoretic model.
  • R-U-SURE can be applied to different user interaction patterns without retraining the model.

Paper Content

Introduction

  • Large language models can generate natural language and source code
  • Used in developer assistance tools and services
  • Tendency to guess or “hallucinate” unwanted outputs
  • Can slow development and lead to undetected problems
  • Automation bias can cause users to miss issues in automated systems
  • Some parts of user intent can be predicted more easily than others
  • Approach proposed to predict parts of generated programs that may need editing
  • Approach uses a pretrained language model to generate a suggestion prototype
  • Combinatorial optimization used to insert annotations into the prototype
  • Indicators of model uncertainty or user-fillable “holes” can be included
  • Utility-driven framework proposed to produce uncertainty-aware suggestions
  • Dual decomposition used to optimize utility functions
  • Variants of utility functions constructed to incorporate tree structure, account for deletions and insertions, and respond to uncertainty
  • Demonstrated across three developer-assistance-inspired tasks

Problem statement

  • Problem of providing contextual, uncertainty-aware suggestions to assist users of ML-integrated tools
  • Focus on assisting software development
  • Not enough information to fully determine user’s intent
  • Augment space of possible suggestions to account for uncertainty
  • Insert visual markers into code-completion suggestion to draw attention to parts of suggestion user may wish to change
  • Formalize intuition using decision theoretic framework
  • Maximize utility of suggestion given context and uncertainty about user’s goal

Approach

  • We do not have access to the distribution in Equation (1).
  • R-U-SURE is a procedure for approximating s* by combining samples from a model using combinatorial optimization.

Approximating true intents with model samples

  • Assume access to a generative model that predicts plausible goals in a given context
  • Model can be used as a proxy for the true conditional distribution
  • Use Monte-Carlo estimate to find suggestion that is reliably useful across high-likelihood goals
  • Search over a restricted space derived from one of the samples

Decomposing into independent subproblems

  • Dual decomposition is a standard technique for optimizing sums over combinatorial discrete spaces.
  • The search space is embedded into the space of binary vectors.
  • The problem is reinterpreted as a constrained optimization problem.
  • Lagrange multipliers are used to relax the equality constraints.
  • Maxmarginal-averaging is an efficient optimization algorithm.
  • Coordinate descent is used to update the Lagrange multipliers.

Expanding utility functions to decision diagrams

  • Objective is to efficiently compute updates in Equation (7)
  • Focus on family of utility functions that can be computed using edit-distance-like dynamic program
  • Rewrite to simultaneously search over edit sequences and confidence annotations
  • Algorithm 1 computes edit-distance-based utility
  • Algorithm 2 extends to search over confidence annotations
  • Maximum-utility path simultaneously computes optimal alignment and optimal confidence annotations
  • Max-marginals incorporated by modifying costs of edges in subproblem
  • Decode approximate solution to Equation (4) by greedily committing to most promising assignment for each element of b

Extending the utility function

  • Method can be applied to any space of suggestions and utility function
  • Incorporating tree structure with hierarchical edits
  • Constraining locations of UNSURE regions
  • Adding localization and insertion penalties
  • Decoding by maximizing utility
  • Sampling-based decoding strategies used to minimize Bayes risk
  • Combining parts of multiple samples to construct a single combined sample
  • Self-consistency decoding uses samples to identify most likely correct answer
  • Reinforcement-learning and sequential-decision-making techniques used to maximize conditional expected utility
  • Maximizing quality metrics such as BLEU or error-rate
  • Selective classification to minimize overall risk
  • Output multiple predictions with confidence above a threshold
  • Representing multiple sequences as a lattice
  • Identifying common patterns in source code
  • Generating programs with holes to aid in program synthesis
  • Comparing model-generated sequences to ground truth
  • Studying calibration of deep models
  • Proposing new mechanisms for training better-calibrated models
  • Language models fine-tuned to output calibrated uncertainty estimates
  • Visualizing per-token probabilities to summarize model uncertainty
  • Dual decomposition and block coordinate descent/ascent solvers applied to optimization problems

Experiments

  • Evaluated approach by applying it to 3 developer assistance tasks
  • Generated suggestion prototypes and hypothetical intents using a 5B-parameter decoder-only LM
  • Parsed them into trees using an error-tolerant bracket-matching pseudo-parser
  • Compared approach to task-specific baselines
  • Evaluated how well each method can predict changes necessary to obtain final code state from dataset

Localizing edits in code suggestions

  • R-U-SURE is used to insert confidence annotations around parts of code completion suggestions
  • Utility function is configured similarly to Algorithm 1
  • 5000 held-out code files for each of the languages JAVA, JAVASCRIPT, C++ and PYTHON are assembled
  • 31 completions are sampled from the language model
  • Results are compared to heuristics based on token probabilities
  • Utility for true file contents, estimated utility and leave-one-out utility are reported
  • Estimated and true utilities are highly correlated
  • Sensitivity and specificity are computed to evaluate binary classification problem
  • F1 scores are reported and Pareto curve is obtained

Selecting suggestion lengths

  • ML models can be used to show inline grey “ghost text” suggestions as users type in the editor.
  • Longer correct suggestions can accelerate developer productivity, but long incorrect suggestions can slow developers down.
  • An approach was developed to search over prefixes of the prototype suggestion.
  • This approach was compared to a number of heuristics and achieved the highest utility.

Api discovery

  • Our method can be used to identify sequences of function calls that the user is likely to write, even if the control flow structures around them are not predictable.
  • We compare our approach against baselines which use all calls in the file or which only predict calls that use identifiers that are not in the context. Results show that our approach has high utility and gives a favorable tradeoff between correct and incorrect predictions.

Discussion

  • R-U-SURE can incorporate uncertainty annotations into model suggestions.
  • Annotations lead to improved performance and accurate predictions.
  • Approach does not require retraining or fine-tuning the base model.
  • Limitation is that it is restricted to utility functions that can be efficiently decomposed.

A. example outputs of r-u-sure

B. decision diagrams: definitions and algorithms

  • Definition of decision diagrams
  • Mappings h, t, w, and α
  • Source and sink nodes
  • Computation paths for binary vectors
  • Weight of paths
  • Ordered BDDs
  • System of BDDs to represent binary functions
  • Utility functions represented as decision diagrams
  • Size of decision diagrams grows quadratically with number of tokens

B.2. a comparison to other definitions of decision diagrams

  • Decision diagrams have been used for a variety of problems
  • Deterministic decision diagrams have two edges from each node and assign a variable
  • Nondeterministic decision diagrams may need to search for a computation path
  • Unambiguous decision diagrams have one computation path for each vector
  • Ambiguous decision diagrams take the max over edges
  • Binary and multivalued decision diagrams can be used

B.3. efficient algorithms for max-marginal message passing on bdds

  • Lagrangian relaxation of BDD system can be optimized efficiently
  • Number of decision diagrams and model samples may not be equal
  • Penalty acts independently on each variable
  • Algorithm uses two dynamic programming tables
  • Algorithm updates λ and λ in alternating forward and backward sweeps
  • Algorithm has time complexity proportional to size of decision diagram
  • Algorithm not guaranteed to find primal solution if dual gap is nonzero

B.4. extension to multivalued decision diagrams

  • Algorithms are optimized over binary variables, but utility functions are written in terms of assignments to an arbitrary finite set of values.
  • Simultaneous block updates are performed over all indicator variables.
  • Entries for m (j) (i,v):=0 are not computed in the implementation of Algorithm 5.
  • When decoding a heuristic primal solution, variables are iterated through and the best assignment is selected.

C. utility functions and tree representation

  • Represent model samples and user intents as sequences of nodes
  • Token nodes represent programming language tokens
  • Decoration nodes denote whitespace
  • Group nodes contain an arbitrary number of child nodes
  • Region start nodes represent locations where we may start a confidence region
  • Region end nodes represent locations where we may end a confidence region
  • Truncation nodes represent locations where we may decide to truncate the suggestion
  • Configured by a set of edit penalties
  • Nodes in decision diagram associated with a tuple of positions
  • Grouped into types to track progress of edits
  • PROCESS-PROTOTYPE, MAY-DELETE, MAY-INSERT, MATCH, RECURSIVELY-DELETING
  • Each node associated with a confidence level
  • Token nodes handled depending on the state
  • Group nodes handled using a recursive call
  • Parameterize the space of suggestions using a binary decision for each control node
  • Introduce UNSURE regions into a suggestion
  • Build a second decision diagram to enforce constraints on annotated regions
  • Tracks a more fine-grained set of confidence types

D. overview of our pseudo-parser

  • Our pseudo-parser converts code fragments into abstract syntax tree (AST) like structures
  • Pairs of these pseudo-parse trees are used by R-U-SURE to construct matching graphs
  • Representing the source code by a syntactically meaningful tree structure allows R-U-SURE to produce completion results that respect the nature of source code
  • Language independence: our system handles JAVA, JAVASCRIPT, C++ and PYTHON
  • Error tolerance: our system can handle syntactically invalid code
  • Tokenization: source code text is converted into a sequence of tokens
  • Basic Bracket Matching: a bracket-matching algorithm produces a nested structure
  • Python code block indents and dedents are handled with a second pass of the pseudo-parser
  • Average utility (higher is better) for R-U-SURE and baseline methods for the uncertainty-regions task
  • Performance increases dramatically with the number of base model samples combined by R-U-SURE
  • Token level sensitivity / specificity trade-off across methods
  • Mean utility for R-U-SURE w.r.t the ground-truth user intent as a function of the number of samples used
  • Stateful dynamic programming algorithm initialization step
  • Stateful dynamic programming cache invalidation step
  • Stateful dynamic programming algorithm for m (j) i:=β
  • Per-token conditional probability heatmap and output of token-probability-based baselines