Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Offline reinforcement learning can be used to obtain a policy initialization from existing datasets.
  • Existing offline RL methods tend to have poor online fine-tuning performance.
  • Online RL methods struggle to incorporate offline data.
  • Calibrated Q-learning (Cal-QL) provides a lower bound on the true value function of the learned policy and an upper bound on the value of some other (suboptimal) reference policy.
  • Cal-QL outperforms state-of-the-art methods on 10/11 fine-tuning benchmark tasks.

Paper Content


  • Online RL ne-tuning is a recipe for machine learning success.
  • Prior o ine RL methods tend to have slow performance improvement or initial performance degradation.
  • Calibrated Q-learning (Cal-QL) is a method for acquiring an o ine initialization that facilitates online ne-tuning.
  • Cal-QL aims to learn conservative value functions that are calibrated with respect to a reference policy.
  • Cal-QL has stronger guarantees on cumulative regret during ne-tuning.
  • Cal-QL can be implemented on top of conservative Q-learning without additional hyperparameters.
  • Cal-QL is evaluated across a range of benchmark tasks.
  • Cal-QL matches or outperforms the best methods on all tasks.
  • Online RL methods require a large number of samples to learn from scratch
  • Incorporating offline data into online RL can improve sample efficiency
  • Prior methods do not eliminate the need to actively roll out poor policies for data collection
  • Offline RL can be used to learn a good policy and value initialization
  • Offline RL typically uses policy constraints or pessimism
  • Using pessimism or constraints for fine-tuning slows down fine-tuning or leads to initial unlearning
  • Cal-QL aims to learn a better offline initialization that enables standard online fine-tuning
  • Cal-QL netunes without ensembles or exploration bonuses

Preliminaries and background

  • Goal of RL is to learn optimal policy for MDP
  • MDP consists of state and action spaces, dynamics and reward functions, and initial state distribution
  • Discount factor is between 0 and 1
  • Goal is to learn policy that maximizes cumulative discounted value function
  • CQL algorithm imposes regularizer to prevent overestimation of Q-values for out-of-distribution actions

When can o line rl initializations enable fast online fine-tuning?

  • Initializing the value function with an existing online RL method can lead to poor performance in online fine-tuning.
  • Analyzing the limitations of explicit policy constraint methods for fine-tuning.
  • Studying an implicit policy constraint method and a conservative method for fine-tuning a robot policy on a visual pick-and-place task.

Empirical analysis

  • Two methods of online netuning are presented in Figure 2
  • Neither of the methods performs well during netuning
  • CQL unlearns the offline initialization and takes 20K steps to recover
  • IQL’s slow speed of netuning is investigated in Appendix E
  • Q-values learned by CQL in the offline phase are underestimated
  • Q-values drastically jump and adjust in scale when online netuning begins
  • Performance recovery coincides with a period where the range of Q-values changes to match the true range
  • Conservative methods can attain good asymptotic performance, but “waste” samples to correct the learned Q-function

Conditions on o line initializations that enable fast fine-tuning

  • Motivates two conditions on Q-function initialization for fast ne-tuning
  • Conservative Q-functions can attain good asymptotic performance
  • Q-function must be calibrated to prevent unlearning during ne-tuning
  • Cal-QL enforces calibration with respect to policies whose Q-value can be estimated reliably
  • Cal-QL corrects the scale of the learned Q-values
  • CQL training objective is modified to mask out the push down of the learned Q-value on OOD actions
  • Cal-QL trains on a mixture of offline and online data
  • Analysis of Cal-QL shows favorable regret bound during online phase
  • Operating in finite-horizon setting with horizon
  • Concentrability coefficient quantifies the distribution shift between policy and dataset
  • Bellman bilinear rank of Q-function class is ≤


  • Cal-QL can attain a smaller regret compared to not imposing either calibration or conservatism.
  • Regret can be decomposed into two terms: (i) difference between ground-truth value of optimal policy and learned Q-function, and (ii) amount of over-estimation in the learned value function.
  • To control regret, a balance must be struck between learning a calibrated and conservative Q-function.

Theorem statement

  • Cal-QL obtains a bound on total regret accumulated during online ne-tuning
  • Song et al. (2023) presents a regret bound
  • Cal-QL can enable a tighter regret guarantee compared to Song et al. (2023)
  • Cal-QL’s concentrability coefficient is no larger than the one that appears in Theorem 1 of Song et al. (2023)
  • In the worst possible case, Cal-QL reverts back to the guarantee from Song et al. (2023)

Experimental evaluation

  • Goal of experiment is to study how well Cal-QL facilitates sample-efficient online fine-tuning
  • Performance of Cal-QL compared to other state-of-the-art fine-tuning methods on a variety of offline RL benchmark tasks
  • Performance evaluated before and after online fine-tuning
  • Effectiveness of Cal-QL on higher-dimensional tasks studied
  • Empirical studies to understand efficacy of Cal-QL with different dataset compositions and impact of errors in reference function value estimation
  • Cal-QL compared to running online SAC, CQL, IQL, O3F, and SAC + offline data
  • Learning curves for online fine-tuning presented
  • Quantitatively evaluate each method on its ability to improve initialization learned from offline data
  • Compare Cal-QL to RLPD, a more sample-efficient version of “SAC + offline data”

Empirical results

  • Cal-QL consistently achieves a smaller regret than other methods
  • Cal-QL brings the benefits of fast online learning together with a good offline initialization
  • Cal-QL improves drastically on the regret metric
  • Cal-QL improves over the best prior ne-tuning method and attains a much larger performance improvement over the course of online ne-tuning
  • Cal-QL can achieve higher sample efficiency by using a higher UTD ratio
  • Cal-QL generally attains similar or higher asymptotic performance as RLPD
  • Cal-QL stably transitions to online ne-tuning with no unlearning
  • Utilizing Cal-QL to calibrate the Q-function against the behavior policy can be significantly helpful
  • Cal-QL performance largely remains unaltered when reference value functions must be estimated using the offline dataset itself
  • Errors in the reference Q-function do not affect the performance of Cal-QL significantly


  • Cal-QL is a method for acquiring initializations that facilitate fast online fine-tuning
  • Cal-QL learns conservative value functions and is larger than the value function of a reference policy
  • Cal-QL avoids initial unlearning in online fine-tuning with conservative methods
  • Cal-QL enables fast online fine-tuning and outperforms prior methods
  • Future work could adjust calibration and conservatism or extend Cal-QL to real-world problems
  • Policy fine-tuning has been studied in different settings
  • Cal-QL uses a pessimistic functional class to utilize offline data efficiently
  • Cal-QL uses a Bellman operator and a weighted 2 norm
  • Cal-QL uses a Bilinear model and a matrix norm

C. environment details

  • Antmaze navigation tasks involve controlling an 8-DoF ant quadruped robot to move from a starting point to a goal in a maze.
  • Rewards are given depending on whether the goal is reached or not.
  • The maze is divided into “medium” and “hard” sections.
  • Kitchen tasks involve controlling a 9-DoF Franka robot to arrange a kitchen environment into a desired configuration.
  • Rewards are given depending on how many subtasks are solved.
  • Adroit domain involves controlling a 24-DoF shadow hand robot with 3 tasks.
  • Dataset is narrow and action space is large.
  • Agent is pre-trained for 1M steps for Antmaze, 500K steps for Kitchen, and 20K steps for Adroit.

D. experiment details

D.1. normalized scores for all tasks

  • Visual-manipulation, adroit, and antmaze domains are goal-oriented, sparse reward tasks.
  • Normalized metric is goal achieved rate for each method.
  • Visual manipulation environment: +1 reward if object placed in bin.
  • Adroit tasks: success rate of opening door.
  • Kitchen task: normalized score is #tasks solved total tasks.

D.2. mixing ratio hyperparameter

  • Mixing ratio parameter is used during online tuning phase
  • Mixing ratio is either a value in range [0, 1] or -1
  • If mixing ratio is in range [0, 1], it represents percentage of offline and online data seen in each batch
  • Hyperparameters for CQL and Cal-QL ablated over values of offline, online, and mixing ratio
  • Variant of Bellman backup used in visual pick and place domain with = 4, and = 10 in other domains
  • Dual version of CQL used in Antmaze domain with threshold of CQL regularizer ( )
  • IQL parameters chosen from authors’ work and additional parameters swept over

D.5. hyperparameters for sac, sac + o line data

  • Used standard hyperparameters from original SAC implementation
  • Used same hyperparameters as CQL and Cal-QL
  • Used automatic entropy tuning for policy and critic entropy terms
  • Target entropy of negative action dimension

E. extended discussion on limitations of existing fine-tuning methods

  • Aim to highlight potential reasons behind slow improvement of IQL
  • Temperature values for IQL have little to no effect on sample efficiency
  • Investigated IQL with more gradient steps and aggressive policy update
  • Assumptions B.1 and B.2
  • De nitions B.3 and B.4
  • Theorem B.5
  • Key Lemmas G.3.1 and G.3.2
  • Online Suboptimality via Performance Di erence Lemma
  • Lemma G.8 and G.9
  • Cal-QL improves offline initialization significantly
  • Cal-QL attains smallest regret in aggregate
  • CQL and Cal-QL hyperparameters
  • IQL hyperparameters