Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Offline reinforcement learning can be used to obtain a policy initialization from existing datasets.
Existing offline RL methods tend to have poor online fine-tuning performance.
Online RL methods struggle to incorporate offline data.
Calibrated Q-learning (Cal-QL) provides a lower bound on the true value function of the learned policy and an upper bound on the value of some other (suboptimal) reference policy.
Cal-QL outperforms state-of-the-art methods on 10/11 fine-tuning benchmark tasks.

Paper Content

Introduction

Online RL ne-tuning is a recipe for machine learning success.
Prior o ine RL methods tend to have slow performance improvement or initial performance degradation.
Calibrated Q-learning (Cal-QL) is a method for acquiring an o ine initialization that facilitates online ne-tuning.
Cal-QL aims to learn conservative value functions that are calibrated with respect to a reference policy.
Cal-QL has stronger guarantees on cumulative regret during ne-tuning.
Cal-QL can be implemented on top of conservative Q-learning without additional hyperparameters.
Cal-QL is evaluated across a range of benchmark tasks.
Cal-QL matches or outperforms the best methods on all tasks.

Online RL methods require a large number of samples to learn from scratch
Incorporating offline data into online RL can improve sample efficiency
Prior methods do not eliminate the need to actively roll out poor policies for data collection
Offline RL can be used to learn a good policy and value initialization
Offline RL typically uses policy constraints or pessimism
Using pessimism or constraints for fine-tuning slows down fine-tuning or leads to initial unlearning
Cal-QL aims to learn a better offline initialization that enables standard online fine-tuning
Cal-QL netunes without ensembles or exploration bonuses

Preliminaries and background

Goal of RL is to learn optimal policy for MDP
MDP consists of state and action spaces, dynamics and reward functions, and initial state distribution
Discount factor is between 0 and 1
Goal is to learn policy that maximizes cumulative discounted value function
CQL algorithm imposes regularizer to prevent overestimation of Q-values for out-of-distribution actions

When can o line rl initializations enable fast online fine-tuning?

Initializing the value function with an existing online RL method can lead to poor performance in online fine-tuning.
Analyzing the limitations of explicit policy constraint methods for fine-tuning.
Studying an implicit policy constraint method and a conservative method for fine-tuning a robot policy on a visual pick-and-place task.

Empirical analysis

Two methods of online netuning are presented in Figure 2
Neither of the methods performs well during netuning
CQL unlearns the offline initialization and takes 20K steps to recover
IQL’s slow speed of netuning is investigated in Appendix E
Q-values learned by CQL in the offline phase are underestimated
Q-values drastically jump and adjust in scale when online netuning begins
Performance recovery coincides with a period where the range of Q-values changes to match the true range
Conservative methods can attain good asymptotic performance, but “waste” samples to correct the learned Q-function

Conditions on o line initializations that enable fast fine-tuning

Motivates two conditions on Q-function initialization for fast ne-tuning
Conservative Q-functions can attain good asymptotic performance
Q-function must be calibrated to prevent unlearning during ne-tuning
Cal-QL enforces calibration with respect to policies whose Q-value can be estimated reliably
Cal-QL corrects the scale of the learned Q-values
CQL training objective is modified to mask out the push down of the learned Q-value on OOD actions
Cal-QL trains on a mixture of offline and online data
Analysis of Cal-QL shows favorable regret bound during online phase
Operating in finite-horizon setting with horizon
Concentrability coefficient quantifies the distribution shift between policy and dataset
Bellman bilinear rank of Q-function class is ≤

Intuition

Cal-QL can attain a smaller regret compared to not imposing either calibration or conservatism.
Regret can be decomposed into two terms: (i) difference between ground-truth value of optimal policy and learned Q-function, and (ii) amount of over-estimation in the learned value function.
To control regret, a balance must be struck between learning a calibrated and conservative Q-function.

Theorem statement

Cal-QL obtains a bound on total regret accumulated during online ne-tuning
Song et al. (2023) presents a regret bound
Cal-QL can enable a tighter regret guarantee compared to Song et al. (2023)
Cal-QL’s concentrability coefficient is no larger than the one that appears in Theorem 1 of Song et al. (2023)
In the worst possible case, Cal-QL reverts back to the guarantee from Song et al. (2023)

Experimental evaluation

Goal of experiment is to study how well Cal-QL facilitates sample-efficient online fine-tuning
Performance of Cal-QL compared to other state-of-the-art fine-tuning methods on a variety of offline RL benchmark tasks
Performance evaluated before and after online fine-tuning
Effectiveness of Cal-QL on higher-dimensional tasks studied
Empirical studies to understand efficacy of Cal-QL with different dataset compositions and impact of errors in reference function value estimation
Cal-QL compared to running online SAC, CQL, IQL, O3F, and SAC + offline data
Learning curves for online fine-tuning presented
Quantitatively evaluate each method on its ability to improve initialization learned from offline data
Compare Cal-QL to RLPD, a more sample-efficient version of “SAC + offline data”

Empirical results

Cal-QL consistently achieves a smaller regret than other methods
Cal-QL brings the benefits of fast online learning together with a good offline initialization
Cal-QL improves drastically on the regret metric
Cal-QL improves over the best prior ne-tuning method and attains a much larger performance improvement over the course of online ne-tuning
Cal-QL can achieve higher sample efficiency by using a higher UTD ratio
Cal-QL generally attains similar or higher asymptotic performance as RLPD
Cal-QL stably transitions to online ne-tuning with no unlearning
Utilizing Cal-QL to calibrate the Q-function against the behavior policy can be significantly helpful
Cal-QL performance largely remains unaltered when reference value functions must be estimated using the offline dataset itself
Errors in the reference Q-function do not affect the performance of Cal-QL significantly

Discussion

Cal-QL is a method for acquiring initializations that facilitate fast online fine-tuning
Cal-QL learns conservative value functions and is larger than the value function of a reference policy
Cal-QL avoids initial unlearning in online fine-tuning with conservative methods
Cal-QL enables fast online fine-tuning and outperforms prior methods
Future work could adjust calibration and conservatism or extend Cal-QL to real-world problems
Policy fine-tuning has been studied in different settings
Cal-QL uses a pessimistic functional class to utilize offline data efficiently
Cal-QL uses a Bellman operator and a weighted 2 norm
Cal-QL uses a Bilinear model and a matrix norm

C. environment details

Antmaze navigation tasks involve controlling an 8-DoF ant quadruped robot to move from a starting point to a goal in a maze.
Rewards are given depending on whether the goal is reached or not.
The maze is divided into “medium” and “hard” sections.
Kitchen tasks involve controlling a 9-DoF Franka robot to arrange a kitchen environment into a desired configuration.
Rewards are given depending on how many subtasks are solved.
Adroit domain involves controlling a 24-DoF shadow hand robot with 3 tasks.
Dataset is narrow and action space is large.
Agent is pre-trained for 1M steps for Antmaze, 500K steps for Kitchen, and 20K steps for Adroit.

D. experiment details

D.1. normalized scores for all tasks

Visual-manipulation, adroit, and antmaze domains are goal-oriented, sparse reward tasks.
Normalized metric is goal achieved rate for each method.
Visual manipulation environment: +1 reward if object placed in bin.
Adroit tasks: success rate of opening door.
Kitchen task: normalized score is #tasks solved total tasks.

D.2. mixing ratio hyperparameter

Mixing ratio parameter is used during online tuning phase
Mixing ratio is either a value in range [0, 1] or -1
If mixing ratio is in range [0, 1], it represents percentage of offline and online data seen in each batch
Hyperparameters for CQL and Cal-QL ablated over values of offline, online, and mixing ratio
Variant of Bellman backup used in visual pick and place domain with = 4, and = 10 in other domains
Dual version of CQL used in Antmaze domain with threshold of CQL regularizer ( )
IQL parameters chosen from authors’ work and additional parameters swept over

D.5. hyperparameters for sac, sac + o line data

Used standard hyperparameters from original SAC implementation
Used same hyperparameters as CQL and Cal-QL
Used automatic entropy tuning for policy and critic entropy terms
Target entropy of negative action dimension

E. extended discussion on limitations of existing fine-tuning methods

Aim to highlight potential reasons behind slow improvement of IQL
Temperature values for IQL have little to no effect on sample efficiency
Investigated IQL with more gradient steps and aggressive policy update
Assumptions B.1 and B.2
De nitions B.3 and B.4
Theorem B.5
Key Lemmas G.3.1 and G.3.2
Online Suboptimality via Performance Di erence Lemma
Lemma G.8 and G.9
Cal-QL improves offline initialization significantly
Cal-QL attains smallest regret in aggregate
CQL and Cal-QL hyperparameters
IQL hyperparameters

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Preliminaries and background#

When can o line rl initializations enable fast online fine-tuning?#

Empirical analysis#

Conditions on o line initializations that enable fast fine-tuning#

Intuition#

Theorem statement#

Experimental evaluation#

Empirical results#

Discussion#

C. environment details#

D. experiment details#

D.1. normalized scores for all tasks#

D.2. mixing ratio hyperparameter#

D.5. hyperparameters for sac, sac + o line data#

E. extended discussion on limitations of existing fine-tuning methods#

Link to paper

Abstract

Paper Content

Introduction

Related work

Preliminaries and background

When can o line rl initializations enable fast online fine-tuning?

Empirical analysis

Conditions on o line initializations that enable fast fine-tuning

Intuition

Theorem statement

Experimental evaluation

Empirical results

Discussion

C. environment details

D. experiment details

D.1. normalized scores for all tasks

D.2. mixing ratio hyperparameter

D.5. hyperparameters for sac, sac + o line data

E. extended discussion on limitations of existing fine-tuning methods