Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Reviews key results in single-agent reinforcement learning
- Intended audience are those with some familiarity with RL
Paper Content
Fundamentals 2.1 the rl paradigm
- Reinforcement learning (RL) is a field of machine learning
- RL has a history in psychology, neuroscience, economics, engineering, and mathematics
- RL is an interdisciplinary field
Agent and environment
Observability
- Agent receives observation identical to environment state
- Environment is partially observable
Markov processes and markov reward processes
- Markov process is a sequence of random states with the Markov property
- Defined in terms of a finite set of states and a state transition probability kernel
- Markov Reward Process (MRP) extends the Markov process by including a reward function and a discount factor
- Immediate expected reward in a given state is defined as a product of the state transition probability and the reward function
- Discount factor determines the present value of future rewards
- Cumulative sum of discounted rewards is a quantity RL agents often seek to maximize
Markov decision processes
- Single-agent RL can be formalized using Markov decision processes (MDPs).
- MDPs capture the components available to the learning agent: state of environment, actions, and rewards.
- MDPs are described using a 5-tuple: states, actions, state transition probability kernel, immediate reward function, and discount factor.
- Expected immediate reward for a given state and action is defined as r(s, a).
Policies, values and models
- Reinforcement learning agents have a policy, value function and model
- Policy is a behaviour function that determines the probability of taking an action in a state
- Value function is a prediction of expected discounted future reward
- Bellman expectation equation is used to calculate value and reward
- Action-value function is expected discounted future reward for executing an action
- Policy evaluation is used to learn value function for a given policy
- Optimal policy is better than or equal to all other policies
- Bellman optimality equation is used to calculate optimal value and state-value functions
Dynamic programming
- Dynamic programming (DP) is a collection of algorithms used to compute optimal policies given a perfect model of the environment.
- DP breaks complex problems into subproblems and combines the solutions.
- DP is useful for overlapping subproblems, which can be cached and reused.
- DP assumes the MDP is fully known and can be used to find the optimal value function and policy.
Prediction
- Dynamic programming can be used to solve known MDPs.
- Model-free approach seeks to solve MDPs without learning transitions or rewards.
- Monte-Carlo (MC) methods estimate expected discounted future reward using complete episodes of experience.
- Temporal-difference (TD) learning methods learn from incomplete episodes by bootstrapping.
Control with action-value functions
Value function approximation
- Tabular representation of states and actions can be improved by using function approximators such as deep neural networks
- Mean square error can be minimized using stochastic gradient descent
- Linear function approximators can be used to update weights in proportion to the activity of their corresponding features
- Non-linear function approximators can be used, but have weaker convergence guarantees
- Policy gradient theorem enables model-free learning
- REINFORCE and SARSA actor-critic are two approaches for determining q π
Baselines
- REINFORCE and actor-critic based approaches can reduce variance in policy gradients.
- A state-dependent baseline can be used to introduce no bias.
- The value function can be used as a state-dependent baseline.
Compatible function approximation
- Introducing bias when approximating q π can prevent convergence to a local optimum.
- When the critic’s function approximator reaches a minimum in the mean-squared error, no bias is introduced.
- For a Boltzmann policy with a linear combination of features, a compatible value function must be linear in the same features as the policy.
Deterministic policy gradients
- RL agents can learn to take actions directly from experiences without modelling transitions or reward functions (model-free RL)
- Model-based RL attempts to learn transitions and reward functions so the agent can predict the environment (model-based RL)
Model learning
- MDPs are defined by 5-tuple S, A, P, r, γ
- Models can be used to approximate state transition and reward functions
- Dynamic programming can be used to learn optimal policy for approximate MDP
- Supervised methods can be used to learn models from fixed set of experiences
- Function approximators such as neural networks and Gaussian processes can be used
Combining model-free and model-based approaches
- Model-based RL is used for planning
- Dynamic programming is computationally infeasible in many situations
- Dyna architecture combines model-based and model-free RL
- Forward search starts rollouts from the current state
- Monte-Carlo Tree search uses MC return to estimate action-value function
Latent variable models
- Hidden or ’latent’ variables are not directly observed but influence observed variables.
- In reinforcement learning, inferring latent variables can help agents make better predictions and control.
- Latent variable models are used in unsupervised learning.
- Unsupervised learning aims to capture high-dimensional correlations with fewer parameters, generate samples from a data distribution, and describe an underlying generative process.
Partially observable markov decision processes
- POMDPs are a generalization of MDPs
- POMDPs are defined by a 7-tuple
- Agents in POMDPs seek to learn a policy that maximizes cumulative reward
- Agents maintain a belief state over the latent environment state
- Beliefs are updated according to a formula
- Maintaining belief states in POMDPs is computationally intractable
- Approximate solutions or function approximators can be used to address this
Deep reinforcement learning
- Reinforcement learning can be learned using artificial neural networks.
- Deep networks are typically trained on large amounts of data.
- Deep networks were successfully used to learn Atari games from scratch.
- Convolutional neural networks are used to learn from pixels.
- Supervised image classification tasks use convolutional neural networks.
Experience replay
- Agent interacts with environment and receives experiences for learning.
- Experiences can be stored in a ‘replay buffer’ and sampled later for learning.
- Mnih et al. (2013) introduced ‘deep Q-learning’ algorithm.
- At each timestep, experiences are stored in a replay buffer over many episodes.
- Q-learning updates are applied to randomly sampled experiences from the buffer.
- Prioritised sampling of experiences can be used to reduce variance and overfitting.
Target networks
- Temporal difference learning with deep function approximators can lead to instability of learning.
- To address this, deep RL algorithms often use a separate target network that remains stable.
- Parameters of the standard network can be copied to the target network at fixed intervals.
- Alternatively, transition can be made more slowly using Polyak averaging.
Experience replay
- Agent interacts with environment and receives experiences for learning.
- Experiences can be stored in a ‘replay buffer’ and sampled later for learning.
- Mnih et al. (2013) introduced ‘deep Q-learning’ algorithm.
- At each timestep, experiences are stored in a replay buffer over many episodes.
- Q-learning updates are applied to randomly sampled experiences from the buffer.
- Prioritised sampling of experiences can be used to reduce variance and overfitting.
Target networks
- Temporal difference learning with deep function approximators can lead to instability of learning.
- To address this, deep RL algorithms often use a separate target network that remains stable.
- Parameters of the standard network can be copied to the target network at fixed intervals.
- Alternatively, transition can be made more slowly using Polyak averaging.