Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Reviews key results in single-agent reinforcement learning
  • Intended audience are those with some familiarity with RL

Paper Content

Fundamentals 2.1 the rl paradigm

  • Reinforcement learning (RL) is a field of machine learning
  • RL has a history in psychology, neuroscience, economics, engineering, and mathematics
  • RL is an interdisciplinary field

Agent and environment

Observability

  • Agent receives observation identical to environment state
  • Environment is partially observable

Markov processes and markov reward processes

  • Markov process is a sequence of random states with the Markov property
  • Defined in terms of a finite set of states and a state transition probability kernel
  • Markov Reward Process (MRP) extends the Markov process by including a reward function and a discount factor
  • Immediate expected reward in a given state is defined as a product of the state transition probability and the reward function
  • Discount factor determines the present value of future rewards
  • Cumulative sum of discounted rewards is a quantity RL agents often seek to maximize

Markov decision processes

  • Single-agent RL can be formalized using Markov decision processes (MDPs).
  • MDPs capture the components available to the learning agent: state of environment, actions, and rewards.
  • MDPs are described using a 5-tuple: states, actions, state transition probability kernel, immediate reward function, and discount factor.
  • Expected immediate reward for a given state and action is defined as r(s, a).

Policies, values and models

  • Reinforcement learning agents have a policy, value function and model
  • Policy is a behaviour function that determines the probability of taking an action in a state
  • Value function is a prediction of expected discounted future reward
  • Bellman expectation equation is used to calculate value and reward
  • Action-value function is expected discounted future reward for executing an action
  • Policy evaluation is used to learn value function for a given policy
  • Optimal policy is better than or equal to all other policies
  • Bellman optimality equation is used to calculate optimal value and state-value functions

Dynamic programming

  • Dynamic programming (DP) is a collection of algorithms used to compute optimal policies given a perfect model of the environment.
  • DP breaks complex problems into subproblems and combines the solutions.
  • DP is useful for overlapping subproblems, which can be cached and reused.
  • DP assumes the MDP is fully known and can be used to find the optimal value function and policy.

Prediction

  • Dynamic programming can be used to solve known MDPs.
  • Model-free approach seeks to solve MDPs without learning transitions or rewards.
  • Monte-Carlo (MC) methods estimate expected discounted future reward using complete episodes of experience.
  • Temporal-difference (TD) learning methods learn from incomplete episodes by bootstrapping.

Control with action-value functions

Value function approximation

  • Tabular representation of states and actions can be improved by using function approximators such as deep neural networks
  • Mean square error can be minimized using stochastic gradient descent
  • Linear function approximators can be used to update weights in proportion to the activity of their corresponding features
  • Non-linear function approximators can be used, but have weaker convergence guarantees
  • Policy gradient theorem enables model-free learning
  • REINFORCE and SARSA actor-critic are two approaches for determining q π

Baselines

  • REINFORCE and actor-critic based approaches can reduce variance in policy gradients.
  • A state-dependent baseline can be used to introduce no bias.
  • The value function can be used as a state-dependent baseline.

Compatible function approximation

  • Introducing bias when approximating q π can prevent convergence to a local optimum.
  • When the critic’s function approximator reaches a minimum in the mean-squared error, no bias is introduced.
  • For a Boltzmann policy with a linear combination of features, a compatible value function must be linear in the same features as the policy.

Deterministic policy gradients

  • RL agents can learn to take actions directly from experiences without modelling transitions or reward functions (model-free RL)
  • Model-based RL attempts to learn transitions and reward functions so the agent can predict the environment (model-based RL)

Model learning

  • MDPs are defined by 5-tuple S, A, P, r, γ
  • Models can be used to approximate state transition and reward functions
  • Dynamic programming can be used to learn optimal policy for approximate MDP
  • Supervised methods can be used to learn models from fixed set of experiences
  • Function approximators such as neural networks and Gaussian processes can be used

Combining model-free and model-based approaches

  • Model-based RL is used for planning
  • Dynamic programming is computationally infeasible in many situations
  • Dyna architecture combines model-based and model-free RL
  • Forward search starts rollouts from the current state
  • Monte-Carlo Tree search uses MC return to estimate action-value function

Latent variable models

  • Hidden or ’latent’ variables are not directly observed but influence observed variables.
  • In reinforcement learning, inferring latent variables can help agents make better predictions and control.
  • Latent variable models are used in unsupervised learning.
  • Unsupervised learning aims to capture high-dimensional correlations with fewer parameters, generate samples from a data distribution, and describe an underlying generative process.

Partially observable markov decision processes

  • POMDPs are a generalization of MDPs
  • POMDPs are defined by a 7-tuple
  • Agents in POMDPs seek to learn a policy that maximizes cumulative reward
  • Agents maintain a belief state over the latent environment state
  • Beliefs are updated according to a formula
  • Maintaining belief states in POMDPs is computationally intractable
  • Approximate solutions or function approximators can be used to address this

Deep reinforcement learning

  • Reinforcement learning can be learned using artificial neural networks.
  • Deep networks are typically trained on large amounts of data.
  • Deep networks were successfully used to learn Atari games from scratch.
  • Convolutional neural networks are used to learn from pixels.
  • Supervised image classification tasks use convolutional neural networks.

Experience replay

  • Agent interacts with environment and receives experiences for learning.
  • Experiences can be stored in a ‘replay buffer’ and sampled later for learning.
  • Mnih et al. (2013) introduced ‘deep Q-learning’ algorithm.
  • At each timestep, experiences are stored in a replay buffer over many episodes.
  • Q-learning updates are applied to randomly sampled experiences from the buffer.
  • Prioritised sampling of experiences can be used to reduce variance and overfitting.

Target networks

  • Temporal difference learning with deep function approximators can lead to instability of learning.
  • To address this, deep RL algorithms often use a separate target network that remains stable.
  • Parameters of the standard network can be copied to the target network at fixed intervals.
  • Alternatively, transition can be made more slowly using Polyak averaging.

Experience replay

  • Agent interacts with environment and receives experiences for learning.
  • Experiences can be stored in a ‘replay buffer’ and sampled later for learning.
  • Mnih et al. (2013) introduced ‘deep Q-learning’ algorithm.
  • At each timestep, experiences are stored in a replay buffer over many episodes.
  • Q-learning updates are applied to randomly sampled experiences from the buffer.
  • Prioritised sampling of experiences can be used to reduce variance and overfitting.

Target networks

  • Temporal difference learning with deep function approximators can lead to instability of learning.
  • To address this, deep RL algorithms often use a separate target network that remains stable.
  • Parameters of the standard network can be copied to the target network at fixed intervals.
  • Alternatively, transition can be made more slowly using Polyak averaging.