Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Offline reinforcement learning can be used to obtain a policy initialization from existing datasets.
- Existing offline RL methods tend to have poor online fine-tuning performance.
- Online RL methods struggle to incorporate offline data.
- Calibrated Q-learning (Cal-QL) provides a lower bound on the true value function of the learned policy and an upper bound on the value of some other (suboptimal) reference policy.
- Cal-QL outperforms state-of-the-art methods on 10/11 fine-tuning benchmark tasks.
Paper Content
Introduction
- Online RL ne-tuning is a recipe for machine learning success.
- Prior o ine RL methods tend to have slow performance improvement or initial performance degradation.
- Calibrated Q-learning (Cal-QL) is a method for acquiring an o ine initialization that facilitates online ne-tuning.
- Cal-QL aims to learn conservative value functions that are calibrated with respect to a reference policy.
- Cal-QL has stronger guarantees on cumulative regret during ne-tuning.
- Cal-QL can be implemented on top of conservative Q-learning without additional hyperparameters.
- Cal-QL is evaluated across a range of benchmark tasks.
- Cal-QL matches or outperforms the best methods on all tasks.
Related work
- Online RL methods require a large number of samples to learn from scratch
- Incorporating offline data into online RL can improve sample efficiency
- Prior methods do not eliminate the need to actively roll out poor policies for data collection
- Offline RL can be used to learn a good policy and value initialization
- Offline RL typically uses policy constraints or pessimism
- Using pessimism or constraints for fine-tuning slows down fine-tuning or leads to initial unlearning
- Cal-QL aims to learn a better offline initialization that enables standard online fine-tuning
- Cal-QL netunes without ensembles or exploration bonuses
Preliminaries and background
- Goal of RL is to learn optimal policy for MDP
- MDP consists of state and action spaces, dynamics and reward functions, and initial state distribution
- Discount factor is between 0 and 1
- Goal is to learn policy that maximizes cumulative discounted value function
- CQL algorithm imposes regularizer to prevent overestimation of Q-values for out-of-distribution actions
When can o line rl initializations enable fast online fine-tuning?
- Initializing the value function with an existing online RL method can lead to poor performance in online fine-tuning.
- Analyzing the limitations of explicit policy constraint methods for fine-tuning.
- Studying an implicit policy constraint method and a conservative method for fine-tuning a robot policy on a visual pick-and-place task.
Empirical analysis
- Two methods of online netuning are presented in Figure 2
- Neither of the methods performs well during netuning
- CQL unlearns the offline initialization and takes 20K steps to recover
- IQL’s slow speed of netuning is investigated in Appendix E
- Q-values learned by CQL in the offline phase are underestimated
- Q-values drastically jump and adjust in scale when online netuning begins
- Performance recovery coincides with a period where the range of Q-values changes to match the true range
- Conservative methods can attain good asymptotic performance, but “waste” samples to correct the learned Q-function
Conditions on o line initializations that enable fast fine-tuning
- Motivates two conditions on Q-function initialization for fast ne-tuning
- Conservative Q-functions can attain good asymptotic performance
- Q-function must be calibrated to prevent unlearning during ne-tuning
- Cal-QL enforces calibration with respect to policies whose Q-value can be estimated reliably
- Cal-QL corrects the scale of the learned Q-values
- CQL training objective is modified to mask out the push down of the learned Q-value on OOD actions
- Cal-QL trains on a mixture of offline and online data
- Analysis of Cal-QL shows favorable regret bound during online phase
- Operating in finite-horizon setting with horizon
- Concentrability coefficient quantifies the distribution shift between policy and dataset
- Bellman bilinear rank of Q-function class is ≤
Intuition
- Cal-QL can attain a smaller regret compared to not imposing either calibration or conservatism.
- Regret can be decomposed into two terms: (i) difference between ground-truth value of optimal policy and learned Q-function, and (ii) amount of over-estimation in the learned value function.
- To control regret, a balance must be struck between learning a calibrated and conservative Q-function.
Theorem statement
- Cal-QL obtains a bound on total regret accumulated during online ne-tuning
- Song et al. (2023) presents a regret bound
- Cal-QL can enable a tighter regret guarantee compared to Song et al. (2023)
- Cal-QL’s concentrability coefficient is no larger than the one that appears in Theorem 1 of Song et al. (2023)
- In the worst possible case, Cal-QL reverts back to the guarantee from Song et al. (2023)
Experimental evaluation
- Goal of experiment is to study how well Cal-QL facilitates sample-efficient online fine-tuning
- Performance of Cal-QL compared to other state-of-the-art fine-tuning methods on a variety of offline RL benchmark tasks
- Performance evaluated before and after online fine-tuning
- Effectiveness of Cal-QL on higher-dimensional tasks studied
- Empirical studies to understand efficacy of Cal-QL with different dataset compositions and impact of errors in reference function value estimation
- Cal-QL compared to running online SAC, CQL, IQL, O3F, and SAC + offline data
- Learning curves for online fine-tuning presented
- Quantitatively evaluate each method on its ability to improve initialization learned from offline data
- Compare Cal-QL to RLPD, a more sample-efficient version of “SAC + offline data”
Empirical results
- Cal-QL consistently achieves a smaller regret than other methods
- Cal-QL brings the benefits of fast online learning together with a good offline initialization
- Cal-QL improves drastically on the regret metric
- Cal-QL improves over the best prior ne-tuning method and attains a much larger performance improvement over the course of online ne-tuning
- Cal-QL can achieve higher sample efficiency by using a higher UTD ratio
- Cal-QL generally attains similar or higher asymptotic performance as RLPD
- Cal-QL stably transitions to online ne-tuning with no unlearning
- Utilizing Cal-QL to calibrate the Q-function against the behavior policy can be significantly helpful
- Cal-QL performance largely remains unaltered when reference value functions must be estimated using the offline dataset itself
- Errors in the reference Q-function do not affect the performance of Cal-QL significantly
Discussion
- Cal-QL is a method for acquiring initializations that facilitate fast online fine-tuning
- Cal-QL learns conservative value functions and is larger than the value function of a reference policy
- Cal-QL avoids initial unlearning in online fine-tuning with conservative methods
- Cal-QL enables fast online fine-tuning and outperforms prior methods
- Future work could adjust calibration and conservatism or extend Cal-QL to real-world problems
- Policy fine-tuning has been studied in different settings
- Cal-QL uses a pessimistic functional class to utilize offline data efficiently
- Cal-QL uses a Bellman operator and a weighted 2 norm
- Cal-QL uses a Bilinear model and a matrix norm
C. environment details
- Antmaze navigation tasks involve controlling an 8-DoF ant quadruped robot to move from a starting point to a goal in a maze.
- Rewards are given depending on whether the goal is reached or not.
- The maze is divided into “medium” and “hard” sections.
- Kitchen tasks involve controlling a 9-DoF Franka robot to arrange a kitchen environment into a desired configuration.
- Rewards are given depending on how many subtasks are solved.
- Adroit domain involves controlling a 24-DoF shadow hand robot with 3 tasks.
- Dataset is narrow and action space is large.
- Agent is pre-trained for 1M steps for Antmaze, 500K steps for Kitchen, and 20K steps for Adroit.
D. experiment details
D.1. normalized scores for all tasks
- Visual-manipulation, adroit, and antmaze domains are goal-oriented, sparse reward tasks.
- Normalized metric is goal achieved rate for each method.
- Visual manipulation environment: +1 reward if object placed in bin.
- Adroit tasks: success rate of opening door.
- Kitchen task: normalized score is #tasks solved total tasks.
D.2. mixing ratio hyperparameter
- Mixing ratio parameter is used during online tuning phase
- Mixing ratio is either a value in range [0, 1] or -1
- If mixing ratio is in range [0, 1], it represents percentage of offline and online data seen in each batch
- Hyperparameters for CQL and Cal-QL ablated over values of offline, online, and mixing ratio
- Variant of Bellman backup used in visual pick and place domain with = 4, and = 10 in other domains
- Dual version of CQL used in Antmaze domain with threshold of CQL regularizer ( )
- IQL parameters chosen from authors’ work and additional parameters swept over
D.5. hyperparameters for sac, sac + o line data
- Used standard hyperparameters from original SAC implementation
- Used same hyperparameters as CQL and Cal-QL
- Used automatic entropy tuning for policy and critic entropy terms
- Target entropy of negative action dimension
E. extended discussion on limitations of existing fine-tuning methods
- Aim to highlight potential reasons behind slow improvement of IQL
- Temperature values for IQL have little to no effect on sample efficiency
- Investigated IQL with more gradient steps and aggressive policy update
- Assumptions B.1 and B.2
- De nitions B.3 and B.4
- Theorem B.5
- Key Lemmas G.3.1 and G.3.2
- Online Suboptimality via Performance Di erence Lemma
- Lemma G.8 and G.9
- Cal-QL improves offline initialization significantly
- Cal-QL attains smallest regret in aggregate
- CQL and Cal-QL hyperparameters
- IQL hyperparameters