Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Transformers are widely used in NLP and CV, mostly in supervised settings.
Transformers are being used in reinforcement learning, but face unique design choices and challenges.
This paper reviews motivations and progress on using Transformers in RL, provides a taxonomy, and discusses future prospects.

Paper Content

Introduction

Reinforcement learning (RL) is a mathematical formalism for sequential decision-making
RL can be used to acquire intelligent behaviors automatically
Deep neural networks can be used to approximate functions with high capacity
Deep reinforcement learning (DRL) has achieved tremendous developments in recent years
Sample efficiency is an issue for DRL in real-world applications
Inductive bias can be introduced into the DRL framework
Choosing function approximator architectures is an important inductive bias
Supervised learning (SL) has been used to motivate architecture for RL
Convolutional neural networks (CNN) and recurrent neural networks (RNN) are common practices for DRL
Transformer architecture has revolutionized learning paradigm across SL tasks
Transformers have been applied to RL to extract relations between entities and capture multi-step temporal dependencies
Offline RL has attracted attention due to its ability to leverage offline large-scale datasets
Transformers can serve directly as a model for sequential decisions
Transformer-based architectures often suffer from high computational and memory costs

Problem scope

Reinforcement learning

Reinforcement Learning (RL) is a type of learning in a Markov Decision Process (MDP)
RL aims to learn a policy to maximize the expected discounted return
Topics in RL include meta RL, multi-task RL, and multi-agent RL
Offline RL does not allow interaction with the environment during training
Goal-conditioned RL extends the standard RL problem to goal-augmented setting
Model-based RL learns an auxiliary dynamic model of the environment

Transformers

Transformer is a neural network for modeling sequential data
Self-attention mechanism captures dependencies within long sequences
Inputs, queries, keys, and values are mapped to linear transformations
Output of self-attention layer is a weighted sum of all values
Multi-head attention and residual connection help Transformers learn expressive representations and model long-term interactions

Combination of transformers and rl

Transformers can be used as a component for RL algorithms
Transformers can also be used as a whole sequential decision-maker

Network architecture in rl

Early progress of network architecture design in RL has challenges
Techniques of neural networks (e.g., regularization, skip connection, batch normalization) can be applied to RL to improve performance and sample efficiency

Architectures for function approximators

Proposed deep dense architecture for DRL agents with skip connections for efficient learning
Ota et al. used DenseNet with decoupled representation learning to improve flows of information and gradients
Transformers architecture applied to policy optimization algorithms, but found to fail in RL tasks

Challenges

Transformer-based architectures have been successful in SL domains, but applying them in RL is difficult.
RL algorithms are sensitive to design choices and can diverge when value estimates become unbounded.
Transformer-based architectures have large memory footprints and high latency, making them difficult to deploy and infer.

Transformers in rl

Transformer has not been widely used in the RL community
Early attempts of TransformRL applied Transformers for state representation learning and providing memory information
Recent works treat the RL problem as a conditional sequence modeling problem on fixed experiences
Existing methods are categorized into four classes: representation learning, model learning, sequential decisionmaking, and generalist agents

Transformers for representation learning

Transformer encoder module used to process complex information from variable number of entities
Entity Transformer encodes observation in form of e i
Follow-up works enrich entity Transformer mechanisms
Transformer used to process local per-timestep sequences
Gated Transformer-XL used to process temporal sequence
Follow-up works use auxiliary (self-)supervised tasks and pre-trained Transformer architecture to improve data efficiency

Transformers for model learning

Transformers used as encoder for sequence embedding and backbone of environmental model in model-based algorithms
Transformer enables prediction conditioned on historical information
Success of Dreamer and subsequent algorithms demonstrate benefits of world model conditioned on history
Transformer-based world model used for planning and goal-conditioned planning
Transformer architecture is more data-efficient than Dreamer and better for tasks requiring long-term memory

Transformers for sequential decision-making

Transformer can be used for sequential decision-making
Offline RL is a growing area of research
Decision Transformer (DT) conditions on return-to-go
Trajectory Transformer (TT) uses beam search for planning
Behavior Transformer (BeT) uses a basic Transformer structure for behavior cloning
Bootstrapped Transformer (BooT) uses data augmentation
Hindsight Information Matching (HIM) uses arbitrary conditioning
ESPER clusters trajectories and estimates average returns
Dichotomy of Control (DoC) learns a representation agnostic to stochastic transitions
Q-learning DT (QDT) relabels return-to-go in the dataset
StARformer uses an additional Step Transformer for local per-timestep representation
Contrastive Decision Transformer (ConDT) uses a return-dependent transformation
SeParated Latent Trajectory Transformer (SPLT Transformer) uses two independent Transformer-based CVAE structures
Online Decision Transformer (ODT) uses a trajectory-level policy entropy
Multi-Agent Decision Transformer (MADT) uses a decentralized DT

Transformers for generalist agents

Decision Transformer has been used in various tasks with offline data
Several works explore whether Transformers can enable a generalist agent to solve multiple tasks
Multi-Game Decision Transformer and Switch Trajectory Transformer are variants of DT that learn on diverse datasets and achieve close-to-human performance on multiple Atari games
Baker et al. propose a semi-supervised scheme to utilize large-scale online data without action information
Prompt-based Decision Transformer and Gato leverage prompting techniques for DT-based methods to enable fast adaptation
Algorithm Distillation trains a Transformer on across-episode sequences of the learning progress of single-task RL algorithms
Uni [MASK] unifies various commonly-studied domains as one mask inference problem
Pre-training Transformer with language data and pre-trained large-scale language models can help improve the performance and convergence speed of DT
RT-1 leverages large-scale datasets with diverse robotics experiences and language instructions to train a Transformer

Summary and future perspectives

Transformers can be used as a powerful module in RL
Transformers can serve as a sequential decision-maker
Transformers can benefit generalization across tasks and domains
Combining RL and (self-)supervised learning
Bridging online and offline learning via Transformers
Transformer structure tailored for decision-making problems
Towards more generalist agents with Transformers
RL for Transformers

Link to paper#

Abstract#

Paper Content#

Introduction#

Problem scope#

Reinforcement learning#

Transformers#

Combination of transformers and rl#

Network architecture in rl#

Architectures for function approximators#

Challenges#

Transformers in rl#

Transformers for representation learning#

Transformers for model learning#

Transformers for sequential decision-making#

Transformers for generalist agents#

Summary and future perspectives#

Link to paper

Abstract

Paper Content

Introduction

Problem scope

Reinforcement learning

Transformers

Combination of transformers and rl

Network architecture in rl

Architectures for function approximators

Challenges

Transformers in rl

Transformers for representation learning

Transformers for model learning

Transformers for sequential decision-making

Transformers for generalist agents

Summary and future perspectives