Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Offline RL methods reduce need for environment interaction by training agents using offline collected episodes.
- Action information needs to be logged during data collection, which can be difficult or impossible in some cases.
- AFP-RL investigates potential of using action-free offline datasets to improve online reinforcement learning.
- AF-Guide consists of AFDT and Guided SAC to learn from offline dataset and guide online training.
- AF-Guide improves sample efficiency and performance in online training.
Paper Content
Introduction
- Training a reinforcement learning agent from scratch can be challenging and time consuming
- Improving sample efficiency in RL is an important direction in the reinforcement learning community
- Offline reinforcement learning methods use offline collected episodes to train RL agents
- Actions need to be logged when collecting offline episodes, but this can be difficult or impossible in certain cases
- Action-free offline reinforcement learning datasets can be utilized to guide online RL
- AF-Guide is proposed to improve online training by learning to plan good target states from the action-free offline dataset
- AF-Guide can significantly improve sample efficiency during online training
Related work
- Offline reinforcement learning methods use pre-collected episodes from unknown behavior policies
- Decision Transformer and Trajectory Transformer convert the offline RL problem as a context-conditioned sequential modeling problem
- AFDT-Guide leverages action-free data to plan good target states and guide online training
- Imitation Learning from observation methods infer expert actions given state transitions
- Motion forecasting predicts future motion of agents given past and context information
- Motion forecasting methods use RNN, convolution, and Transformer architectures
Background
- Soft Actor-Critic (SAC) is an actor-critic RL approach
- SAC is based on the maximum entropy framework
- SAC is provided with an initial state and a desired initial return-to-go
- Action-Free Offline Pretraining involves an action-free offline dataset
- AF-Guide uses knowledge from action-free offline datasets to plan the next states
- Guided SAC follows the planning with an additional Q function
- AF-Guide is similar to the Learning-to-Think framework
Action-free decision transformer
- AF-Guide is a variant of the UDRL model Decision Transformer (DT)
- AFDT predicts the next state based on previous states and RTGs only
- AFDT takes K steps of input, consisting of 2K tokens
- AFDT uses token embedding, positional embedding, and layer normalization
- AFDT predicts the state change ∆s t+1 and adds it back to s t to obtain s t+1
- AFDT is trained autoregressively with L1 loss to predict the next state from the processed RTG token
Guided soft actor-critic
- AFDT model can be used to benefit the learning of Soft Actor-Critic (SAC)
- Guiding reward is designed to measure the discrepancy between the planned state and the actual state
- Guiding Q function is used to prevent the guiding reward from misleading the agent
- Combined Q function is used to guide the policy learning
- Coefficient β determines whether to use the guiding reward or not
Experiments
- AF-Guide is an approach for utilizing action-free offline reinforcement learning datasets in online reinforcement learning
- Three ablation studies are conducted to provide evidence for the validity of the two components of AF-Guide
- Action-Free D4RL is an adaptation of the widely-used offline reinforcement learning benchmark, D4RL
- Action labels are removed from the original D4RL datasets to create action-free offline RL datasets
- Six environments are evaluated, including three locomotion tasks, two ball maze environments, and one robot ant maze environment
- AF-Guide training contains two stages: an offline stage and an online stage
- Default hyperparameters are used for the architecture and training of AFDT and Guided SAC
Main experiments
- AF-Guide outperforms SAC in all evaluated environments
- AF-Guide shows a significant improvement of 50% in Halfcheetah and Walker2d
- Different offline datasets do not result in significant performance differences
Ablation study
- Guided SAC with an additional Q function is necessary to process the guiding reward
- Ablation study conducted in Halfcheetah, Walker2d and Maze2d-Medium
- Results show that AF-Guide [SAC] has similar performance as SAC in Maze2d-Medium and does not work in Halfcheetah and Walker2d
- AF-Guide with Guided SAC benefits from the guiding reward by ignoring guiding rewards in future steps
- AFDT plans better states than those from behavior policy
- AF-Guide [Imi] performs worse than the original version in Walker2d and Maze2d-Large
- AF-Guide [Imi] performs better than SAC in Halfcheetah and slightly better in Walker2d
- AF-Guide [TT] had better performance in Hopper and Walker2d tasks, but worse in Halfcheetah
Limitations
- AF-Guide is based on Decision Transformer and UDRL framework
- UDRL may diverge from optimal policy in episodic setting with stochastic environments
- AF-Guide is agnostic to sequential planning model
- Guiding reward is based on L2 distance, which may not be optimal in some state spaces
Conclusion
- Utilizing action-free offline datasets to guide online reinforcement learning
- Proposed Action-Free Guide (AF-Guide) to learn to plan target state from offline datasets
- AF-Guide has better sample efficiency than SAC in various locomotion and maze environments
- Encourages further research in other areas where action-free offline pretraining can be an effective learning approach
- Medium dataset collected using policy trained to 1/3 performance of expert
- Medium-Replay uses training replay buffer of ‘Medium’ policy
- Medium-Expert dataset contains 50% of data from Medium and remaining from expert policy
- Antmaze-Umaze dataset has robot ant going from fixed start to fixed target
- Antmaze-Umaze-Diverse has robot ant going to random target locations
- AFDT follows default hyperparameters of DT
- Guided SAC uses three-layer MLPs with ReLU activation and 256 hidden dimensions
- AFTT based on ‘faster-trajectory-transformer’ repository
- AF-Guide outperforms SAC in all evaluated locomotion and ball maze environments
- AF-Guide [SAC] performs similarly to SAC in Maze2d-Medium, but not in Halfcheetah and Walker2d
- AF-Guide [Imi] performs worse in Walker2d and Maze2d-Large
- AF-Guide [TT] has better performance in Hopper and Walker2d, but worse in Halfcheetah
- AF-Guide [TT] increases training time due to huge planning cost