Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Significant progress has been made in developing reinforcement learning training systems.
Parallel environment execution is often the slowest part of the system but receives little attention.
EnvPool improves the RL environment simulation speed across different hardware setups.
EnvPool is compatible with existing RL training libraries.
EnvPool allows researchers to iterate their ideas quickly.

Paper Content

Introduction

Deep Reinforcement Learning (RL) has made remarkable progress in the past years
Leveraging the computation power of large-scale distributed systems and advanced AI chips like TPUs has improved training throughput
DQN took 8 days and 200 million frames to train an agent to play a single Atari game
IMPALA and Seed RL have shortened this process
Parallel environment execution is a common bottleneck in the RL training system
Inference and learning with the agent policy network can leverage experience and performance optimization techniques
Interaction between agents and environments is the unique technical difficulty in RL systems
EnvPool is a highly parallel RL environment execution engine
Supports OpenAI gym and DeepMind dm_env APIs
Synchronous and asynchronous execution modes
EnvPool covers many standard RL environments
Targeted users are RL researchers and practitioners, and developers familiar with RL environment implementation

Most implementations of RL systems use Python-level parallelization
Python-level parallelization is computationally inefficient
Ray and RLlib have to trade-off communication costs with other components
Sample Factory focuses on optimizing the entire RL system for a single-machine setup
EnvPool has both properties of high throughput and great compatibility with existing APIs and RL algorithms
Recent works use accelerators like GPUs and TPUs for the environment engine, but they cannot handle general environments
PodRacer architecture implements the C++ batched environment interface, but only supports synchronous execution mode
EnvPool uses asynchronous execution mode as a default and is not tied to any specific computing architectures

Methods

EnvPool is a computer science tool for developers and RL researchers/practitioners
It contains three components optimized in C++: ActionBufferQueue, ThreadPool and StateBufferQueue
It uses pybind11 to expose user interface to Python
It has an overall system architecture
It has optimizations in individual components
It has instructions on how to add new RL environments

Overview

EnvPool follows an asynchronous event-driven pattern.
Actions are sent to the environment via the send function.
Threads in the ThreadPool take action from the ActionBufferQueue and perform the corresponding environment execution.
The execution result is added to the StateBufferQueue.
The RL algorithm receives a batch of states by taking from the StateBufferQueue via the recv function.

Synchronous vs. asynchronous

Vectorized environments are executed synchronously in worker threads.
Number of environments is denoted as N.
RL agent receives N observation arrays and predicts N actions.
Synchronous step is determined by slowest environment execution time.
EnvPool introduces batch_size (M) to receive environment outputs.
M cannot be greater than N.
EnvPool waits for outputs of first M environment steps.
Asynchronous mode has advantage when environment execution time has large variance.

Threadpool

ThreadPool uses a fixed number of threads to execute tasks.
Number of threads is usually limited to the number of CPU cores to reduce context switch overhead.
Threads can be pinned to pre-determined CPU cores to further speed up execution.
Number of environments should be 2-3 times greater than the number of threads to keep threads fully loaded.

Adding new rl environments

EnvPool is a platform for adding reinforcement learning environments
Developers need to implement the RL environment in a C++ header file and write a Bazel BUILD file
Generate a dynamically linked binary and register the environment in Python
Adding new RL environments does not require a deep understanding of the core infrastructure

Experiments

Evaluated simulation performance of reinforcement learning environment execution engines
Tested EnvPool with CleanRL, rl_games, and DeepMind’s Acme framework
Demonstrated value of EnvPool for improving efficiency and scalability of RL research
EnvPool allows researchers to train agents more quickly and effectively

Pure environment simulation

Evaluated EnvPool against established baselines on RL environment execution component
Three hardware setups used for benchmark: laptop, workstation, NVIDIA DGX-A100
EnvPool outperforms all baselines with significant margins
Subprocess implementation has poor scalability
Even with single environment in EnvPool, ∼2× speedup
Synchronous modes have significant performance disadvantages against asynchronous systems
Profiled CleanRL’s PPO in Atari games with three parallelization paradigms
EnvPool ameliorates bottleneck, end-to-end training time decreased
Easy integration with popular deep RL libraries
High throughput training
Can be extended to distributed use case with remote execution

Conclusion

Introduced EnvPool, a highly parallel reinforcement learning environment execution engine
Leveraged techniques of a general asynchronous execution model, implemented with a C++ thread pool
Designed BufferQueue tailored for the RL environments
Extensive study with various setups to demonstrate scale-up ability
Significant improvements in existing RL training libraries’ speed when integrated with EnvPool
Trained Atari Pong and MuJoCo Ant in five minutes
Limitation: EnvPool cannot speed up RL environments originally written in Python
CPU specifications for experiments: 12 Intel CPU cores, 32 AMD CPU cores, 256 CPU cores with 8 NUMA nodes
NVIDIA DGX-A100 with AMD EPYC 7742 64-Core Processor
ActionBufferQueue and StateBufferQueue tailored for RL environments
Memory copies saved thanks to StateBufferQueue
Integrate EnvPool with Acme for experiments of PPO in MuJoCo tasks
Google Cloud TPUv3-8 machine with Intel Xeon CPU
Comparison of EnvPool and DummyVecEnv using Acme’s PPO implementation on MuJoCo HalfCheetah-v3 environment
Tuning num_envs can reduce training time while maintaining sample efficiency
Jitting environment simulation code with neural networks supported
Numeric results for benchmarking
Single environment simulation speed on different hardware setups

Link to paper#

Abstract#

Paper Content#

Introduction#

Related works#

Methods#

Overview#

Synchronous vs. asynchronous#

Threadpool#

Adding new rl environments#

Experiments#

Pure environment simulation#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related works

Methods

Overview

Synchronous vs. asynchronous

Threadpool

Adding new rl environments

Experiments

Pure environment simulation

Conclusion