notesum.ai

Published at November 8

Acceleration for Deep Reinforcement Learning using Parallel and Distributed Computing: A Survey

cs.LG

cs.AI

cs.DC

Released Date: November 8, 2024

Authors: Zhihong Liu¹, Xin Xu¹, Peng Qiao¹, Dongsheng Li¹

Aff.: ¹National University of Defense Technology, China

Arxiv: http://arxiv.org/abs/2411.05614v1

Methods	Computing parallelism Types					Implementation Details	Major Results
Methods	CC	MP/MT	GPU	FPGA	TPU	Implementation Details	Major Results
Gorila(Nair2015, )	✓					31 machines	10× speedup over GPU implementation
Ape-X(Horgan2018, )	✓		✓			360 CPU cores and 1 P100 GPU	4× median scores over Gorila
R2D2(kapturowski2018recurrent, )	✓		✓			256 actors and 1 GPU	4× median scores over Ape-X
IMPALA(Espeholt2018, )	✓	✓	✓			500 CPU cores and 8 P100 GPUs	250K FPS and multi-task setting
Ray RLlib (liang2017ray, )	✓	✓	✓			8,192 CPU cores on EC2	completes training Mojoco in 3.7mins
ARS(Mania2018, )	✓					48 CPU cores on EC2	15× speedup over ES-based method(salimans2017evolution, )
A3C(Mnih2016, )		✓				16 CPU cores	2× speedup over K40 GPU implementation
Reactor(Gruslys2018, )	✓	✓				20 CPU cores	4× speedup over A3C
DBA3C(Adamski2018, )	✓	✓				64 nodes with 768 CPU cores	completes training Atrai 2600 in 21 mins
DPPO(Heess2017, )		✓				64 actors	>20× speedup over A3C
D4PG(Radients2018, )		✓				64 CPU cores	4× higher return than PPO
SampleFactory(Petrenko2020, )		✓	✓			36 CPU cores and a 2080Ti GPU	4× speedup over SEED_RL
GA3C(Babaeizadeh2017, )		✓	✓			16 CPU cores and 1 Titan X GPU	45× speedup over A3C
PAAC(clemente2017efficient, )		✓	✓			4 CPU cores and a GTX 980 Ti GPU	>6× speedup over Gorila
rlpyt(Stooke, )(Stooke2018, )		✓	✓			8 P100 GPUs and 40 CPU cores	6× speedup using 8 GPUs relative to 1 GPU
Dactyl(Andrychowicz2020, )	✓	✓	✓			384 nodes (6144 cores and 8 GPUs)	5.5× speedup over implementation with 1 GPU and 768 CPU cores
DD-PPO(Wijmans2019, )		✓	✓			256 V100 GPUs	196× speed up over 1 V100 GPU
MSRL(zhu2023msrl, )		✓	✓			64 GPUs	3× speedup over Ray RLlib
SRL(mei2023srl, )	✓		✓			15K CPU cores and 32 A100 GPUs	5× speedup over OpenAI Rapid(berner2019dota, )
SpeedyZero(mei2023speedyzero, )	✓		✓			192 CPU cores and 20 A100 GPUs	mastering Atari benchmark within 35 minutes using only 300k samples.
NNQL(Su2017, )				✓		Arria 10 AX066 FPGA	346× speedup over GTX 760 GPU
TRPO_FPGA(Shao2017, )				✓		Intel Stratix-V FPGA	19.29× speedup over i7 CPU
DDPG_FPGA(Guo2019, )				✓		Intel Stratix-V FPGA	4.53× speedup over i7-6700 CPU core
FA3C(Cho2019, )		✓		✓		Xilinx VCU1525 VU9P FPGA	27.9% better than Tesla P100
PPO_FPGA(Meng2020, )		✓		✓		Xilinx Alveo U200	27.5× speedup against Titan Xp GPU
On-chip replay(Meng2022, )		✓		✓		Xilinx Alveo U200 acceler	4.3× higher IPS over GTX 3090 GPU
AlphaZero(Silver2017, )	✓	✓			✓	5000 TPUs v1 and 64 TPUs v2 cores	defeats world-champion program by training within 24 hours
AlphaStar(vinyals2019grandmaster, )	✓	✓			✓	3,072 TPU v3 and 50,400 CPU cores	achieves above 99.8% of ranked human players by training in 44 days
OpenAI Five(berner2019dota, )	✓	✓			✓	1,536 GPUs and 172,800 CPU cores	defeats Dota 2 world champion (Team OG) by training in 10 months
GATO(Reed2022, )	✓	✓			✓	256 TPU v3 cores	handles 604 distinct tasks with a single network
SEED_RL(espeholt2020seed, )	✓	✓			✓	520 CPU and 8 TPU v3 cores	11× faster than the IMPALA with a P100 GPU