Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Transformers are widely used in NLP and CV, mostly in supervised settings.
  • Transformers are being used in reinforcement learning, but face unique design choices and challenges.
  • This paper reviews motivations and progress on using Transformers in RL, provides a taxonomy, and discusses future prospects.

Paper Content

Introduction

  • Reinforcement learning (RL) is a mathematical formalism for sequential decision-making
  • RL can be used to acquire intelligent behaviors automatically
  • Deep neural networks can be used to approximate functions with high capacity
  • Deep reinforcement learning (DRL) has achieved tremendous developments in recent years
  • Sample efficiency is an issue for DRL in real-world applications
  • Inductive bias can be introduced into the DRL framework
  • Choosing function approximator architectures is an important inductive bias
  • Supervised learning (SL) has been used to motivate architecture for RL
  • Convolutional neural networks (CNN) and recurrent neural networks (RNN) are common practices for DRL
  • Transformer architecture has revolutionized learning paradigm across SL tasks
  • Transformers have been applied to RL to extract relations between entities and capture multi-step temporal dependencies
  • Offline RL has attracted attention due to its ability to leverage offline large-scale datasets
  • Transformers can serve directly as a model for sequential decisions
  • Transformer-based architectures often suffer from high computational and memory costs

Problem scope

Reinforcement learning

  • Reinforcement Learning (RL) is a type of learning in a Markov Decision Process (MDP)
  • RL aims to learn a policy to maximize the expected discounted return
  • Topics in RL include meta RL, multi-task RL, and multi-agent RL
  • Offline RL does not allow interaction with the environment during training
  • Goal-conditioned RL extends the standard RL problem to goal-augmented setting
  • Model-based RL learns an auxiliary dynamic model of the environment

Transformers

  • Transformer is a neural network for modeling sequential data
  • Self-attention mechanism captures dependencies within long sequences
  • Inputs, queries, keys, and values are mapped to linear transformations
  • Output of self-attention layer is a weighted sum of all values
  • Multi-head attention and residual connection help Transformers learn expressive representations and model long-term interactions

Combination of transformers and rl

  • Transformers can be used as a component for RL algorithms
  • Transformers can also be used as a whole sequential decision-maker

Network architecture in rl

  • Early progress of network architecture design in RL has challenges
  • Techniques of neural networks (e.g., regularization, skip connection, batch normalization) can be applied to RL to improve performance and sample efficiency

Architectures for function approximators

  • Proposed deep dense architecture for DRL agents with skip connections for efficient learning
  • Ota et al. used DenseNet with decoupled representation learning to improve flows of information and gradients
  • Transformers architecture applied to policy optimization algorithms, but found to fail in RL tasks

Challenges

  • Transformer-based architectures have been successful in SL domains, but applying them in RL is difficult.
  • RL algorithms are sensitive to design choices and can diverge when value estimates become unbounded.
  • Transformer-based architectures have large memory footprints and high latency, making them difficult to deploy and infer.

Transformers in rl

  • Transformer has not been widely used in the RL community
  • Early attempts of TransformRL applied Transformers for state representation learning and providing memory information
  • Recent works treat the RL problem as a conditional sequence modeling problem on fixed experiences
  • Existing methods are categorized into four classes: representation learning, model learning, sequential decisionmaking, and generalist agents

Transformers for representation learning

  • Transformer encoder module used to process complex information from variable number of entities
  • Entity Transformer encodes observation in form of e i
  • Follow-up works enrich entity Transformer mechanisms
  • Transformer used to process local per-timestep sequences
  • Gated Transformer-XL used to process temporal sequence
  • Follow-up works use auxiliary (self-)supervised tasks and pre-trained Transformer architecture to improve data efficiency

Transformers for model learning

  • Transformers used as encoder for sequence embedding and backbone of environmental model in model-based algorithms
  • Transformer enables prediction conditioned on historical information
  • Success of Dreamer and subsequent algorithms demonstrate benefits of world model conditioned on history
  • Transformer-based world model used for planning and goal-conditioned planning
  • Transformer architecture is more data-efficient than Dreamer and better for tasks requiring long-term memory

Transformers for sequential decision-making

  • Transformer can be used for sequential decision-making
  • Offline RL is a growing area of research
  • Decision Transformer (DT) conditions on return-to-go
  • Trajectory Transformer (TT) uses beam search for planning
  • Behavior Transformer (BeT) uses a basic Transformer structure for behavior cloning
  • Bootstrapped Transformer (BooT) uses data augmentation
  • Hindsight Information Matching (HIM) uses arbitrary conditioning
  • ESPER clusters trajectories and estimates average returns
  • Dichotomy of Control (DoC) learns a representation agnostic to stochastic transitions
  • Q-learning DT (QDT) relabels return-to-go in the dataset
  • StARformer uses an additional Step Transformer for local per-timestep representation
  • Contrastive Decision Transformer (ConDT) uses a return-dependent transformation
  • SeParated Latent Trajectory Transformer (SPLT Transformer) uses two independent Transformer-based CVAE structures
  • Online Decision Transformer (ODT) uses a trajectory-level policy entropy
  • Multi-Agent Decision Transformer (MADT) uses a decentralized DT

Transformers for generalist agents

  • Decision Transformer has been used in various tasks with offline data
  • Several works explore whether Transformers can enable a generalist agent to solve multiple tasks
  • Multi-Game Decision Transformer and Switch Trajectory Transformer are variants of DT that learn on diverse datasets and achieve close-to-human performance on multiple Atari games
  • Baker et al. propose a semi-supervised scheme to utilize large-scale online data without action information
  • Prompt-based Decision Transformer and Gato leverage prompting techniques for DT-based methods to enable fast adaptation
  • Algorithm Distillation trains a Transformer on across-episode sequences of the learning progress of single-task RL algorithms
  • Uni [MASK] unifies various commonly-studied domains as one mask inference problem
  • Pre-training Transformer with language data and pre-trained large-scale language models can help improve the performance and convergence speed of DT
  • RT-1 leverages large-scale datasets with diverse robotics experiences and language instructions to train a Transformer

Summary and future perspectives

  • Transformers can be used as a powerful module in RL
  • Transformers can serve as a sequential decision-maker
  • Transformers can benefit generalization across tasks and domains
  • Combining RL and (self-)supervised learning
  • Bridging online and offline learning via Transformers
  • Transformer structure tailored for decision-making problems
  • Towards more generalist agents with Transformers
  • RL for Transformers