Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Significant progress has been made in developing reinforcement learning training systems.
- Parallel environment execution is often the slowest part of the system but receives little attention.
- EnvPool improves the RL environment simulation speed across different hardware setups.
- EnvPool is compatible with existing RL training libraries.
- EnvPool allows researchers to iterate their ideas quickly.
Paper Content
Introduction
- Deep Reinforcement Learning (RL) has made remarkable progress in the past years
- Leveraging the computation power of large-scale distributed systems and advanced AI chips like TPUs has improved training throughput
- DQN took 8 days and 200 million frames to train an agent to play a single Atari game
- IMPALA and Seed RL have shortened this process
- Parallel environment execution is a common bottleneck in the RL training system
- Inference and learning with the agent policy network can leverage experience and performance optimization techniques
- Interaction between agents and environments is the unique technical difficulty in RL systems
- EnvPool is a highly parallel RL environment execution engine
- Supports OpenAI gym and DeepMind dm_env APIs
- Synchronous and asynchronous execution modes
- EnvPool covers many standard RL environments
- Targeted users are RL researchers and practitioners, and developers familiar with RL environment implementation
Related works
- Most implementations of RL systems use Python-level parallelization
- Python-level parallelization is computationally inefficient
- Ray and RLlib have to trade-off communication costs with other components
- Sample Factory focuses on optimizing the entire RL system for a single-machine setup
- EnvPool has both properties of high throughput and great compatibility with existing APIs and RL algorithms
- Recent works use accelerators like GPUs and TPUs for the environment engine, but they cannot handle general environments
- PodRacer architecture implements the C++ batched environment interface, but only supports synchronous execution mode
- EnvPool uses asynchronous execution mode as a default and is not tied to any specific computing architectures
Methods
- EnvPool is a computer science tool for developers and RL researchers/practitioners
- It contains three components optimized in C++: ActionBufferQueue, ThreadPool and StateBufferQueue
- It uses pybind11 to expose user interface to Python
- It has an overall system architecture
- It has optimizations in individual components
- It has instructions on how to add new RL environments
Overview
- EnvPool follows an asynchronous event-driven pattern.
- Actions are sent to the environment via the send function.
- Threads in the ThreadPool take action from the ActionBufferQueue and perform the corresponding environment execution.
- The execution result is added to the StateBufferQueue.
- The RL algorithm receives a batch of states by taking from the StateBufferQueue via the recv function.
Synchronous vs. asynchronous
- Vectorized environments are executed synchronously in worker threads.
- Number of environments is denoted as N.
- RL agent receives N observation arrays and predicts N actions.
- Synchronous step is determined by slowest environment execution time.
- EnvPool introduces batch_size (M) to receive environment outputs.
- M cannot be greater than N.
- EnvPool waits for outputs of first M environment steps.
- Asynchronous mode has advantage when environment execution time has large variance.
Threadpool
- ThreadPool uses a fixed number of threads to execute tasks.
- Number of threads is usually limited to the number of CPU cores to reduce context switch overhead.
- Threads can be pinned to pre-determined CPU cores to further speed up execution.
- Number of environments should be 2-3 times greater than the number of threads to keep threads fully loaded.
Adding new rl environments
- EnvPool is a platform for adding reinforcement learning environments
- Developers need to implement the RL environment in a C++ header file and write a Bazel BUILD file
- Generate a dynamically linked binary and register the environment in Python
- Adding new RL environments does not require a deep understanding of the core infrastructure
Experiments
- Evaluated simulation performance of reinforcement learning environment execution engines
- Tested EnvPool with CleanRL, rl_games, and DeepMind’s Acme framework
- Demonstrated value of EnvPool for improving efficiency and scalability of RL research
- EnvPool allows researchers to train agents more quickly and effectively
Pure environment simulation
- Evaluated EnvPool against established baselines on RL environment execution component
- Three hardware setups used for benchmark: laptop, workstation, NVIDIA DGX-A100
- EnvPool outperforms all baselines with significant margins
- Subprocess implementation has poor scalability
- Even with single environment in EnvPool, ∼2× speedup
- Synchronous modes have significant performance disadvantages against asynchronous systems
- Profiled CleanRL’s PPO in Atari games with three parallelization paradigms
- EnvPool ameliorates bottleneck, end-to-end training time decreased
- Easy integration with popular deep RL libraries
- High throughput training
- Can be extended to distributed use case with remote execution
Conclusion
- Introduced EnvPool, a highly parallel reinforcement learning environment execution engine
- Leveraged techniques of a general asynchronous execution model, implemented with a C++ thread pool
- Designed BufferQueue tailored for the RL environments
- Extensive study with various setups to demonstrate scale-up ability
- Significant improvements in existing RL training libraries’ speed when integrated with EnvPool
- Trained Atari Pong and MuJoCo Ant in five minutes
- Limitation: EnvPool cannot speed up RL environments originally written in Python
- CPU specifications for experiments: 12 Intel CPU cores, 32 AMD CPU cores, 256 CPU cores with 8 NUMA nodes
- NVIDIA DGX-A100 with AMD EPYC 7742 64-Core Processor
- ActionBufferQueue and StateBufferQueue tailored for RL environments
- Memory copies saved thanks to StateBufferQueue
- Integrate EnvPool with Acme for experiments of PPO in MuJoCo tasks
- Google Cloud TPUv3-8 machine with Intel Xeon CPU
- Comparison of EnvPool and DummyVecEnv using Acme’s PPO implementation on MuJoCo HalfCheetah-v3 environment
- Tuning num_envs can reduce training time while maintaining sample efficiency
- Jitting environment simulation code with neural networks supported
- Numeric results for benchmarking
- Single environment simulation speed on different hardware setups