Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Transformers are widely used in NLP and CV, mostly in supervised settings.
- Transformers are being used in reinforcement learning, but face unique design choices and challenges.
- This paper reviews motivations and progress on using Transformers in RL, provides a taxonomy, and discusses future prospects.
Paper Content
Introduction
- Reinforcement learning (RL) is a mathematical formalism for sequential decision-making
- RL can be used to acquire intelligent behaviors automatically
- Deep neural networks can be used to approximate functions with high capacity
- Deep reinforcement learning (DRL) has achieved tremendous developments in recent years
- Sample efficiency is an issue for DRL in real-world applications
- Inductive bias can be introduced into the DRL framework
- Choosing function approximator architectures is an important inductive bias
- Supervised learning (SL) has been used to motivate architecture for RL
- Convolutional neural networks (CNN) and recurrent neural networks (RNN) are common practices for DRL
- Transformer architecture has revolutionized learning paradigm across SL tasks
- Transformers have been applied to RL to extract relations between entities and capture multi-step temporal dependencies
- Offline RL has attracted attention due to its ability to leverage offline large-scale datasets
- Transformers can serve directly as a model for sequential decisions
- Transformer-based architectures often suffer from high computational and memory costs
Problem scope
Reinforcement learning
- Reinforcement Learning (RL) is a type of learning in a Markov Decision Process (MDP)
- RL aims to learn a policy to maximize the expected discounted return
- Topics in RL include meta RL, multi-task RL, and multi-agent RL
- Offline RL does not allow interaction with the environment during training
- Goal-conditioned RL extends the standard RL problem to goal-augmented setting
- Model-based RL learns an auxiliary dynamic model of the environment
Transformers
- Transformer is a neural network for modeling sequential data
- Self-attention mechanism captures dependencies within long sequences
- Inputs, queries, keys, and values are mapped to linear transformations
- Output of self-attention layer is a weighted sum of all values
- Multi-head attention and residual connection help Transformers learn expressive representations and model long-term interactions
Combination of transformers and rl
- Transformers can be used as a component for RL algorithms
- Transformers can also be used as a whole sequential decision-maker
Network architecture in rl
- Early progress of network architecture design in RL has challenges
- Techniques of neural networks (e.g., regularization, skip connection, batch normalization) can be applied to RL to improve performance and sample efficiency
Architectures for function approximators
- Proposed deep dense architecture for DRL agents with skip connections for efficient learning
- Ota et al. used DenseNet with decoupled representation learning to improve flows of information and gradients
- Transformers architecture applied to policy optimization algorithms, but found to fail in RL tasks
Challenges
- Transformer-based architectures have been successful in SL domains, but applying them in RL is difficult.
- RL algorithms are sensitive to design choices and can diverge when value estimates become unbounded.
- Transformer-based architectures have large memory footprints and high latency, making them difficult to deploy and infer.
Transformers in rl
- Transformer has not been widely used in the RL community
- Early attempts of TransformRL applied Transformers for state representation learning and providing memory information
- Recent works treat the RL problem as a conditional sequence modeling problem on fixed experiences
- Existing methods are categorized into four classes: representation learning, model learning, sequential decisionmaking, and generalist agents
Transformers for representation learning
- Transformer encoder module used to process complex information from variable number of entities
- Entity Transformer encodes observation in form of e i
- Follow-up works enrich entity Transformer mechanisms
- Transformer used to process local per-timestep sequences
- Gated Transformer-XL used to process temporal sequence
- Follow-up works use auxiliary (self-)supervised tasks and pre-trained Transformer architecture to improve data efficiency
Transformers for model learning
- Transformers used as encoder for sequence embedding and backbone of environmental model in model-based algorithms
- Transformer enables prediction conditioned on historical information
- Success of Dreamer and subsequent algorithms demonstrate benefits of world model conditioned on history
- Transformer-based world model used for planning and goal-conditioned planning
- Transformer architecture is more data-efficient than Dreamer and better for tasks requiring long-term memory
Transformers for sequential decision-making
- Transformer can be used for sequential decision-making
- Offline RL is a growing area of research
- Decision Transformer (DT) conditions on return-to-go
- Trajectory Transformer (TT) uses beam search for planning
- Behavior Transformer (BeT) uses a basic Transformer structure for behavior cloning
- Bootstrapped Transformer (BooT) uses data augmentation
- Hindsight Information Matching (HIM) uses arbitrary conditioning
- ESPER clusters trajectories and estimates average returns
- Dichotomy of Control (DoC) learns a representation agnostic to stochastic transitions
- Q-learning DT (QDT) relabels return-to-go in the dataset
- StARformer uses an additional Step Transformer for local per-timestep representation
- Contrastive Decision Transformer (ConDT) uses a return-dependent transformation
- SeParated Latent Trajectory Transformer (SPLT Transformer) uses two independent Transformer-based CVAE structures
- Online Decision Transformer (ODT) uses a trajectory-level policy entropy
- Multi-Agent Decision Transformer (MADT) uses a decentralized DT
Transformers for generalist agents
- Decision Transformer has been used in various tasks with offline data
- Several works explore whether Transformers can enable a generalist agent to solve multiple tasks
- Multi-Game Decision Transformer and Switch Trajectory Transformer are variants of DT that learn on diverse datasets and achieve close-to-human performance on multiple Atari games
- Baker et al. propose a semi-supervised scheme to utilize large-scale online data without action information
- Prompt-based Decision Transformer and Gato leverage prompting techniques for DT-based methods to enable fast adaptation
- Algorithm Distillation trains a Transformer on across-episode sequences of the learning progress of single-task RL algorithms
- Uni [MASK] unifies various commonly-studied domains as one mask inference problem
- Pre-training Transformer with language data and pre-trained large-scale language models can help improve the performance and convergence speed of DT
- RT-1 leverages large-scale datasets with diverse robotics experiences and language instructions to train a Transformer
Summary and future perspectives
- Transformers can be used as a powerful module in RL
- Transformers can serve as a sequential decision-maker
- Transformers can benefit generalization across tasks and domains
- Combining RL and (self-)supervised learning
- Bridging online and offline learning via Transformers
- Transformer structure tailored for decision-making problems
- Towards more generalist agents with Transformers
- RL for Transformers