Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Focuses on a novel optimization problem where the objective function can only be evaluated through a ranking oracle.
- RLHF is an example of this problem, used to improve the quality of LLMs with human guidance.
- Proposes ZO-RankSGD, a zeroth-order optimization algorithm with a theoretical guarantee.
- Can be applied to policy search problems in reinforcement learning with only a ranking oracle of the episode reward.
- Demonstrates effectiveness of ZO-RankSGD in improving the quality of images generated by a diffusion generative model with human ranking feedback.
Paper Content
Introduction
- Ranking data is used in many online platforms and applications.
- Ranking information is appealing to humans as it allows them to express their preferences.
- Ranking-based approaches are more natural and straightforward for human evaluators.
- This paper studies an optimization problem where the objective function can only be accessed via a ranking oracle.
Problem formulation
- Optimization problem with objective function f: R d → R
- Ranking oracle can sort every input based on values of f
- Focus on particular family of ranking oracles where only sorted indexes of top elements are returned
- (m, k)-ranking oracle: given function f and m points to query, returns k smallest points sorted in their order
Applications
- Optimization problem (1) with a (m, k)-ranking oracle is common in real-world applications
- Example of this type of problem is found in Reinforcement Learning with Human Feedback
- Proposed application uses human feedback to enhance quality of images generated by Stable Diffusion
- Ranking oracles can be useful in cases where information must remain private
- Obtaining rank data may be cheaper and easier than obtaining exact function values
Related works
- Zeroth-order optimization has been studied for decades
- Most existing works assume objective function value is directly accessible
- Heuristic algorithms exist that rely solely on ranking information
- Recent study investigates zeroth-order optimization via comparison oracle
- Approach limited to convex objective functions
- Our work considers a more general (m, k)-ranking oracle and extends scope to non-convex functions
- Our approach does not rely on compressive sensing techniques
- Relationship to RLHF: requires large amount of ranking data, interaction with humans occurs only once
- Our proposed algorithm allows for collection of ranking data in an online fashion, can continuously improve model without pre-collected data
Contributions in this work
- Presents a rank-based zeroth-order optimization algorithm with a theoretical guarantee
- Applicable to policy search in reinforcement learning and improving the quality of images generated by Stable Diffusion
- Algorithm facilitates the real-time acquisition of ranking data and enhances the details of generated images
Notations and assumptions
- Sign operator is defined as 1 if x is greater than or equal to 0, and -1 otherwise
- Standard Gaussian distribution is denoted by N (0, I d )
- Assumption 1 states that the objective function f is lower bounded by a value f*
Paper organization
- Introduces a novel approach for estimating descent direction based on ranking information
- Presents main algorithm, ZO-RankSGD, along with corresponding convergence analysis
- Demonstrates effectiveness of ZO-RankSGD through experiments on synthetic and real-world data
A comparison-based estimator for descent direction
- Estimate descent direction of objective function without solving compressive sensing problem
- Estimator ĝ(x) is an effective estimator for descent direction
- Step size γ and constant µ can be used to make value E ξ1,ξ2 [f (x − γĝ(x))] strictly smaller than f (x)
From ranking information to pairwise comparison
- Ranking information can be translated into pairwise comparisons
- Represent input and output of (m, k)-ranking oracles as a directed acyclic graph (DAG)
- Rank-based gradient estimator proposed
- Number of pairwise comparisons used to determine variance of gradient estimator
- Graph topology of DAG determined by m and k
- Number of edges and neighboring edge pairs related to m and k
- Variance of estimator bounded by two metrics
- First variance term vanishes as m → ∞
- Second variance term vanishes when both k and m tend to infinity
- Non-diminishing term M1(f, µ) smaller than 2d and can be bounded by a dimension-independent constant
Zo-ranksgd: zeroth-order rank-based stochastic gradient descent
- ZO-RankSGD is a proposed algorithm
- Algorithm 1 outlines the pseudocode for ZO-RankSGD
- Theorem 1 states that after running Algorithm 1 for a certain number of iterations, a certain constant is achieved
- Corollary 1 states that m and k affect the convergence speed of Algorithm 1
Line search via ranking oracle
- Algorithm 1 can be difficult to tune step size
- Algorithm 1 does not provide feedback on progress of objective function
- Proposed line search method uses (l, 1)ranking oracle to determine optimal step size
- Line search technique can be applied to any gradient-based optimization algorithm
Simple functions
- Algorithm 1 is compared to existing algorithms
- Experiments are conducted to demonstrate effectiveness of Algorithm 1
- Results show that Algorithm 1 outperforms direct search and heuristic algorithms
- Algorithm 1 performs similarly to ZO-SGD, indicating that ranking oracle is almost as informative as valuing oracle for zeroth-order optimization
Reinforcement learning with ranking oracle
- Policy optimization problem in reinforcement learning with only a ranking oracle of the episode reward available
- Existing RLHF approaches collect ranking feedback from humans for every single action
- In contrast, ranking feedback in this setting is collected directly on the episode level
- ZO-RankSGD compared to CMA-ES algorithm on simulated robot control with several problems from the MuJoCo suite of benchmarks
- ZO-RankSGD outperforms CMA-ES by a significant margin
- Draw inspiration from successes in aligning Language Models with human feedback
- Task of text-to-image generation using Stable Diffusion model
- Optimize initial latent embedding using human ranking feedback
- Optimizing latent embedding requires fewer rounds of human feedback than fine-tuning entire model
- Denoising diffusion process as a continuous mapping
- Compare generated images obtained by optimizing human preference with those obtained by optimizing the CLIP similarity score
Conclusion
- Rigorously studied a novel optimization problem where only a ranking oracle of the objective function is available
- Proposed the first provable zeroth-order optimization algorithm, ZO-RankSGD
- Demonstrated efficacy across a wide range of applications
- Presented how different ranking oracles can impact optimization performance
- Used to improve the detail of images generated by Stable Diffusion with human guidance
- More efficient alternative to existing Reinforcement Learning with Human Feedback (RLHF) methods
- Extension to handle noise and uncertainty in the ranking feedback
- Combining ZO-RankSGD with a model-based approach like Bayesian Optimization
- Applying it to other scenarios beyond human feedback
- Constructed DAG from the ranking information of O (m,k) f
- Denoted input degrees and output degrees of x i ∈ N
- Used second-order Taylor expansion with Cauchy remainders
- Defined two important regions R 11 and R 12
- Defined an important function h(v, r, d)
- Proved a constant C d exists for any d ∈ Z +
- Computed the mean vector and the integrals
- Used L-smoothness to bound the four terms in (88)
- Used convexity of 2 and Jensen’s inequality to bound M 1 (f, µ)
- Used Cauchy-Schwarz inequality to bound M 2 (f, µ)
- Used a well-known fact that x>0 xp(x)dx = 1 √ 2π
- Used a rotation matrix R ∈ R 2d×2d
- Defined the fucntion q(u, d)
- Proved an important properties of the q(u, d)
- Modified ZO-RankSGD algorithm for optimizing latent embeddings of Stable Diffusion
- Designed user interface for collecting human feedback
- Evaluated latent embeddings by passing them to the DPM-solver with Stable Diffusion
- Set the number of rounds for human feedback between 10 and 20
- Fixed the number of querying rounds to 50
- Used the same parameters for Algorithm 3: η = 1, µ = 0.1, and γ = 0.5