Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Offline RL methods reduce need for environment interaction by training agents using offline collected episodes.
Action information needs to be logged during data collection, which can be difficult or impossible in some cases.
AFP-RL investigates potential of using action-free offline datasets to improve online reinforcement learning.
AF-Guide consists of AFDT and Guided SAC to learn from offline dataset and guide online training.
AF-Guide improves sample efficiency and performance in online training.

Paper Content

Introduction

Training a reinforcement learning agent from scratch can be challenging and time consuming
Improving sample efficiency in RL is an important direction in the reinforcement learning community
Offline reinforcement learning methods use offline collected episodes to train RL agents
Actions need to be logged when collecting offline episodes, but this can be difficult or impossible in certain cases
Action-free offline reinforcement learning datasets can be utilized to guide online RL
AF-Guide is proposed to improve online training by learning to plan good target states from the action-free offline dataset
AF-Guide can significantly improve sample efficiency during online training

Offline reinforcement learning methods use pre-collected episodes from unknown behavior policies
Decision Transformer and Trajectory Transformer convert the offline RL problem as a context-conditioned sequential modeling problem
AFDT-Guide leverages action-free data to plan good target states and guide online training
Imitation Learning from observation methods infer expert actions given state transitions
Motion forecasting predicts future motion of agents given past and context information
Motion forecasting methods use RNN, convolution, and Transformer architectures

Background

Soft Actor-Critic (SAC) is an actor-critic RL approach
SAC is based on the maximum entropy framework
SAC is provided with an initial state and a desired initial return-to-go
Action-Free Offline Pretraining involves an action-free offline dataset
AF-Guide uses knowledge from action-free offline datasets to plan the next states
Guided SAC follows the planning with an additional Q function
AF-Guide is similar to the Learning-to-Think framework

Action-free decision transformer

AF-Guide is a variant of the UDRL model Decision Transformer (DT)
AFDT predicts the next state based on previous states and RTGs only
AFDT takes K steps of input, consisting of 2K tokens
AFDT uses token embedding, positional embedding, and layer normalization
AFDT predicts the state change ∆s t+1 and adds it back to s t to obtain s t+1
AFDT is trained autoregressively with L1 loss to predict the next state from the processed RTG token

Guided soft actor-critic

AFDT model can be used to benefit the learning of Soft Actor-Critic (SAC)
Guiding reward is designed to measure the discrepancy between the planned state and the actual state
Guiding Q function is used to prevent the guiding reward from misleading the agent
Combined Q function is used to guide the policy learning
Coefficient β determines whether to use the guiding reward or not

Experiments

AF-Guide is an approach for utilizing action-free offline reinforcement learning datasets in online reinforcement learning
Three ablation studies are conducted to provide evidence for the validity of the two components of AF-Guide
Action-Free D4RL is an adaptation of the widely-used offline reinforcement learning benchmark, D4RL
Action labels are removed from the original D4RL datasets to create action-free offline RL datasets
Six environments are evaluated, including three locomotion tasks, two ball maze environments, and one robot ant maze environment
AF-Guide training contains two stages: an offline stage and an online stage
Default hyperparameters are used for the architecture and training of AFDT and Guided SAC

Main experiments

AF-Guide outperforms SAC in all evaluated environments
AF-Guide shows a significant improvement of 50% in Halfcheetah and Walker2d
Different offline datasets do not result in significant performance differences

Ablation study

Guided SAC with an additional Q function is necessary to process the guiding reward
Ablation study conducted in Halfcheetah, Walker2d and Maze2d-Medium
Results show that AF-Guide [SAC] has similar performance as SAC in Maze2d-Medium and does not work in Halfcheetah and Walker2d
AF-Guide with Guided SAC benefits from the guiding reward by ignoring guiding rewards in future steps
AFDT plans better states than those from behavior policy
AF-Guide [Imi] performs worse than the original version in Walker2d and Maze2d-Large
AF-Guide [Imi] performs better than SAC in Halfcheetah and slightly better in Walker2d
AF-Guide [TT] had better performance in Hopper and Walker2d tasks, but worse in Halfcheetah

Limitations

AF-Guide is based on Decision Transformer and UDRL framework
UDRL may diverge from optimal policy in episodic setting with stochastic environments
AF-Guide is agnostic to sequential planning model
Guiding reward is based on L2 distance, which may not be optimal in some state spaces

Conclusion

Utilizing action-free offline datasets to guide online reinforcement learning
Proposed Action-Free Guide (AF-Guide) to learn to plan target state from offline datasets
AF-Guide has better sample efficiency than SAC in various locomotion and maze environments
Encourages further research in other areas where action-free offline pretraining can be an effective learning approach
Medium dataset collected using policy trained to 1/3 performance of expert
Medium-Replay uses training replay buffer of ‘Medium’ policy
Medium-Expert dataset contains 50% of data from Medium and remaining from expert policy
Antmaze-Umaze dataset has robot ant going from fixed start to fixed target
Antmaze-Umaze-Diverse has robot ant going to random target locations
AFDT follows default hyperparameters of DT
Guided SAC uses three-layer MLPs with ReLU activation and 256 hidden dimensions
AFTT based on ‘faster-trajectory-transformer’ repository
AF-Guide outperforms SAC in all evaluated locomotion and ball maze environments
AF-Guide [SAC] performs similarly to SAC in Maze2d-Medium, but not in Halfcheetah and Walker2d
AF-Guide [Imi] performs worse in Walker2d and Maze2d-Large
AF-Guide [TT] has better performance in Hopper and Walker2d, but worse in Halfcheetah
AF-Guide [TT] increases training time due to huge planning cost

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Background#

Action-free decision transformer#

Guided soft actor-critic#

Experiments#

Main experiments#

Ablation study#

Limitations#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Background

Action-free decision transformer

Guided soft actor-critic

Experiments

Main experiments

Ablation study

Limitations

Conclusion