Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Offline RL methods reduce need for environment interaction by training agents using offline collected episodes.
  • Action information needs to be logged during data collection, which can be difficult or impossible in some cases.
  • AFP-RL investigates potential of using action-free offline datasets to improve online reinforcement learning.
  • AF-Guide consists of AFDT and Guided SAC to learn from offline dataset and guide online training.
  • AF-Guide improves sample efficiency and performance in online training.

Paper Content

Introduction

  • Training a reinforcement learning agent from scratch can be challenging and time consuming
  • Improving sample efficiency in RL is an important direction in the reinforcement learning community
  • Offline reinforcement learning methods use offline collected episodes to train RL agents
  • Actions need to be logged when collecting offline episodes, but this can be difficult or impossible in certain cases
  • Action-free offline reinforcement learning datasets can be utilized to guide online RL
  • AF-Guide is proposed to improve online training by learning to plan good target states from the action-free offline dataset
  • AF-Guide can significantly improve sample efficiency during online training
  • Offline reinforcement learning methods use pre-collected episodes from unknown behavior policies
  • Decision Transformer and Trajectory Transformer convert the offline RL problem as a context-conditioned sequential modeling problem
  • AFDT-Guide leverages action-free data to plan good target states and guide online training
  • Imitation Learning from observation methods infer expert actions given state transitions
  • Motion forecasting predicts future motion of agents given past and context information
  • Motion forecasting methods use RNN, convolution, and Transformer architectures

Background

  • Soft Actor-Critic (SAC) is an actor-critic RL approach
  • SAC is based on the maximum entropy framework
  • SAC is provided with an initial state and a desired initial return-to-go
  • Action-Free Offline Pretraining involves an action-free offline dataset
  • AF-Guide uses knowledge from action-free offline datasets to plan the next states
  • Guided SAC follows the planning with an additional Q function
  • AF-Guide is similar to the Learning-to-Think framework

Action-free decision transformer

  • AF-Guide is a variant of the UDRL model Decision Transformer (DT)
  • AFDT predicts the next state based on previous states and RTGs only
  • AFDT takes K steps of input, consisting of 2K tokens
  • AFDT uses token embedding, positional embedding, and layer normalization
  • AFDT predicts the state change ∆s t+1 and adds it back to s t to obtain s t+1
  • AFDT is trained autoregressively with L1 loss to predict the next state from the processed RTG token

Guided soft actor-critic

  • AFDT model can be used to benefit the learning of Soft Actor-Critic (SAC)
  • Guiding reward is designed to measure the discrepancy between the planned state and the actual state
  • Guiding Q function is used to prevent the guiding reward from misleading the agent
  • Combined Q function is used to guide the policy learning
  • Coefficient β determines whether to use the guiding reward or not

Experiments

  • AF-Guide is an approach for utilizing action-free offline reinforcement learning datasets in online reinforcement learning
  • Three ablation studies are conducted to provide evidence for the validity of the two components of AF-Guide
  • Action-Free D4RL is an adaptation of the widely-used offline reinforcement learning benchmark, D4RL
  • Action labels are removed from the original D4RL datasets to create action-free offline RL datasets
  • Six environments are evaluated, including three locomotion tasks, two ball maze environments, and one robot ant maze environment
  • AF-Guide training contains two stages: an offline stage and an online stage
  • Default hyperparameters are used for the architecture and training of AFDT and Guided SAC

Main experiments

  • AF-Guide outperforms SAC in all evaluated environments
  • AF-Guide shows a significant improvement of 50% in Halfcheetah and Walker2d
  • Different offline datasets do not result in significant performance differences

Ablation study

  • Guided SAC with an additional Q function is necessary to process the guiding reward
  • Ablation study conducted in Halfcheetah, Walker2d and Maze2d-Medium
  • Results show that AF-Guide [SAC] has similar performance as SAC in Maze2d-Medium and does not work in Halfcheetah and Walker2d
  • AF-Guide with Guided SAC benefits from the guiding reward by ignoring guiding rewards in future steps
  • AFDT plans better states than those from behavior policy
  • AF-Guide [Imi] performs worse than the original version in Walker2d and Maze2d-Large
  • AF-Guide [Imi] performs better than SAC in Halfcheetah and slightly better in Walker2d
  • AF-Guide [TT] had better performance in Hopper and Walker2d tasks, but worse in Halfcheetah

Limitations

  • AF-Guide is based on Decision Transformer and UDRL framework
  • UDRL may diverge from optimal policy in episodic setting with stochastic environments
  • AF-Guide is agnostic to sequential planning model
  • Guiding reward is based on L2 distance, which may not be optimal in some state spaces

Conclusion

  • Utilizing action-free offline datasets to guide online reinforcement learning
  • Proposed Action-Free Guide (AF-Guide) to learn to plan target state from offline datasets
  • AF-Guide has better sample efficiency than SAC in various locomotion and maze environments
  • Encourages further research in other areas where action-free offline pretraining can be an effective learning approach
  • Medium dataset collected using policy trained to 1/3 performance of expert
  • Medium-Replay uses training replay buffer of ‘Medium’ policy
  • Medium-Expert dataset contains 50% of data from Medium and remaining from expert policy
  • Antmaze-Umaze dataset has robot ant going from fixed start to fixed target
  • Antmaze-Umaze-Diverse has robot ant going to random target locations
  • AFDT follows default hyperparameters of DT
  • Guided SAC uses three-layer MLPs with ReLU activation and 256 hidden dimensions
  • AFTT based on ‘faster-trajectory-transformer’ repository
  • AF-Guide outperforms SAC in all evaluated locomotion and ball maze environments
  • AF-Guide [SAC] performs similarly to SAC in Maze2d-Medium, but not in Halfcheetah and Walker2d
  • AF-Guide [Imi] performs worse in Walker2d and Maze2d-Large
  • AF-Guide [TT] has better performance in Hopper and Walker2d, but worse in Halfcheetah
  • AF-Guide [TT] increases training time due to huge planning cost