Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Focuses on a novel optimization problem where the objective function can only be evaluated through a ranking oracle.
  • RLHF is an example of this problem, used to improve the quality of LLMs with human guidance.
  • Proposes ZO-RankSGD, a zeroth-order optimization algorithm with a theoretical guarantee.
  • Can be applied to policy search problems in reinforcement learning with only a ranking oracle of the episode reward.
  • Demonstrates effectiveness of ZO-RankSGD in improving the quality of images generated by a diffusion generative model with human ranking feedback.

Paper Content

Introduction

  • Ranking data is used in many online platforms and applications.
  • Ranking information is appealing to humans as it allows them to express their preferences.
  • Ranking-based approaches are more natural and straightforward for human evaluators.
  • This paper studies an optimization problem where the objective function can only be accessed via a ranking oracle.

Problem formulation

  • Optimization problem with objective function f: R d → R
  • Ranking oracle can sort every input based on values of f
  • Focus on particular family of ranking oracles where only sorted indexes of top elements are returned
  • (m, k)-ranking oracle: given function f and m points to query, returns k smallest points sorted in their order

Applications

  • Optimization problem (1) with a (m, k)-ranking oracle is common in real-world applications
  • Example of this type of problem is found in Reinforcement Learning with Human Feedback
  • Proposed application uses human feedback to enhance quality of images generated by Stable Diffusion
  • Ranking oracles can be useful in cases where information must remain private
  • Obtaining rank data may be cheaper and easier than obtaining exact function values
  • Zeroth-order optimization has been studied for decades
  • Most existing works assume objective function value is directly accessible
  • Heuristic algorithms exist that rely solely on ranking information
  • Recent study investigates zeroth-order optimization via comparison oracle
  • Approach limited to convex objective functions
  • Our work considers a more general (m, k)-ranking oracle and extends scope to non-convex functions
  • Our approach does not rely on compressive sensing techniques
  • Relationship to RLHF: requires large amount of ranking data, interaction with humans occurs only once
  • Our proposed algorithm allows for collection of ranking data in an online fashion, can continuously improve model without pre-collected data

Contributions in this work

  • Presents a rank-based zeroth-order optimization algorithm with a theoretical guarantee
  • Applicable to policy search in reinforcement learning and improving the quality of images generated by Stable Diffusion
  • Algorithm facilitates the real-time acquisition of ranking data and enhances the details of generated images

Notations and assumptions

  • Sign operator is defined as 1 if x is greater than or equal to 0, and -1 otherwise
  • Standard Gaussian distribution is denoted by N (0, I d )
  • Assumption 1 states that the objective function f is lower bounded by a value f*

Paper organization

  • Introduces a novel approach for estimating descent direction based on ranking information
  • Presents main algorithm, ZO-RankSGD, along with corresponding convergence analysis
  • Demonstrates effectiveness of ZO-RankSGD through experiments on synthetic and real-world data

A comparison-based estimator for descent direction

  • Estimate descent direction of objective function without solving compressive sensing problem
  • Estimator ĝ(x) is an effective estimator for descent direction
  • Step size γ and constant µ can be used to make value E ξ1,ξ2 [f (x − γĝ(x))] strictly smaller than f (x)

From ranking information to pairwise comparison

  • Ranking information can be translated into pairwise comparisons
  • Represent input and output of (m, k)-ranking oracles as a directed acyclic graph (DAG)
  • Rank-based gradient estimator proposed
  • Number of pairwise comparisons used to determine variance of gradient estimator
  • Graph topology of DAG determined by m and k
  • Number of edges and neighboring edge pairs related to m and k
  • Variance of estimator bounded by two metrics
  • First variance term vanishes as m → ∞
  • Second variance term vanishes when both k and m tend to infinity
  • Non-diminishing term M1(f, µ) smaller than 2d and can be bounded by a dimension-independent constant

Zo-ranksgd: zeroth-order rank-based stochastic gradient descent

  • ZO-RankSGD is a proposed algorithm
  • Algorithm 1 outlines the pseudocode for ZO-RankSGD
  • Theorem 1 states that after running Algorithm 1 for a certain number of iterations, a certain constant is achieved
  • Corollary 1 states that m and k affect the convergence speed of Algorithm 1

Line search via ranking oracle

  • Algorithm 1 can be difficult to tune step size
  • Algorithm 1 does not provide feedback on progress of objective function
  • Proposed line search method uses (l, 1)ranking oracle to determine optimal step size
  • Line search technique can be applied to any gradient-based optimization algorithm

Simple functions

  • Algorithm 1 is compared to existing algorithms
  • Experiments are conducted to demonstrate effectiveness of Algorithm 1
  • Results show that Algorithm 1 outperforms direct search and heuristic algorithms
  • Algorithm 1 performs similarly to ZO-SGD, indicating that ranking oracle is almost as informative as valuing oracle for zeroth-order optimization

Reinforcement learning with ranking oracle

  • Policy optimization problem in reinforcement learning with only a ranking oracle of the episode reward available
  • Existing RLHF approaches collect ranking feedback from humans for every single action
  • In contrast, ranking feedback in this setting is collected directly on the episode level
  • ZO-RankSGD compared to CMA-ES algorithm on simulated robot control with several problems from the MuJoCo suite of benchmarks
  • ZO-RankSGD outperforms CMA-ES by a significant margin
  • Draw inspiration from successes in aligning Language Models with human feedback
  • Task of text-to-image generation using Stable Diffusion model
  • Optimize initial latent embedding using human ranking feedback
  • Optimizing latent embedding requires fewer rounds of human feedback than fine-tuning entire model
  • Denoising diffusion process as a continuous mapping
  • Compare generated images obtained by optimizing human preference with those obtained by optimizing the CLIP similarity score

Conclusion

  • Rigorously studied a novel optimization problem where only a ranking oracle of the objective function is available
  • Proposed the first provable zeroth-order optimization algorithm, ZO-RankSGD
  • Demonstrated efficacy across a wide range of applications
  • Presented how different ranking oracles can impact optimization performance
  • Used to improve the detail of images generated by Stable Diffusion with human guidance
  • More efficient alternative to existing Reinforcement Learning with Human Feedback (RLHF) methods
  • Extension to handle noise and uncertainty in the ranking feedback
  • Combining ZO-RankSGD with a model-based approach like Bayesian Optimization
  • Applying it to other scenarios beyond human feedback
  • Constructed DAG from the ranking information of O (m,k) f
  • Denoted input degrees and output degrees of x i ∈ N
  • Used second-order Taylor expansion with Cauchy remainders
  • Defined two important regions R 11 and R 12
  • Defined an important function h(v, r, d)
  • Proved a constant C d exists for any d ∈ Z +
  • Computed the mean vector and the integrals
  • Used L-smoothness to bound the four terms in (88)
  • Used convexity of 2 and Jensen’s inequality to bound M 1 (f, µ)
  • Used Cauchy-Schwarz inequality to bound M 2 (f, µ)
  • Used a well-known fact that x>0 xp(x)dx = 1 √ 2π
  • Used a rotation matrix R ∈ R 2d×2d
  • Defined the fucntion q(u, d)
  • Proved an important properties of the q(u, d)
  • Modified ZO-RankSGD algorithm for optimizing latent embeddings of Stable Diffusion
  • Designed user interface for collecting human feedback
  • Evaluated latent embeddings by passing them to the DPM-solver with Stable Diffusion
  • Set the number of rounds for human feedback between 10 and 20
  • Fixed the number of querying rounds to 50
  • Used the same parameters for Algorithm 3: η = 1, µ = 0.1, and γ = 0.5