10 RL: Multi-Agent Warehouse Robots
Deliverable 3 – Multi-Agent Soft Actor-Critic (MASAC) on Robotic Warehouse (RWARE) environment
11 Intro
Deliverable 3 explore cpu and gpu run time, cpu parallel and gpu parallel run time. In Titan the default partition use was \(#SBATCH --partition=sb\). SB was a 16 CPU, 128 GB memory, and 280 nodes. But it didn’t have GPU.
While research I came to the conculsion to switch to GPU.
Using
sinfo -o "%P %G %c %m %D %N"
I was able to see the partition Titan have.
PARTITION GRES CPUS MEMORY NODES NODELIST
bw* (null) 40+ 128000 38 c[001-034,315-318]
il (null) 48 256000 1 c319
sb (null) 16 128000 280 c[035-314]
t4 gpu:t4:4 64 256000 1 g000
gpu gpu:a30:4 64 512000 8 g[001-008]
osg (null) 16+ 128000 318 c[001-318]
Because GPUs were available and greatly accelerate batch forward/backward passes for neural networks, I switched to a GPU partition for the final experiments.
12 Parallelization
I opt for a synchronous parallel using internal synchronization wrapper SyncVectorEnv. This would let all environment steps be grouped and run, and the main thread waits for the slowest environment to finish before proceeding to the next step. This would in theroy dramatically increases the throughput of the training data, allowing it to fill the replay buffer much faster.
13 Test
| Test Case | Episodes | num_envs | Focus |
|---|---|---|---|
| Sequential CPU | 50 | 1 | CPU overhead, single-instance processing |
| Parallel CPU | 50 | 8 | Parallelization Gain (CPU cores used efficiently) |
| Sequential GPU | 50 | 1 | GPU Acceleration (single forward pass speed) |
| Parallel GPU | 50 | 8 | Final Optimized Speed (best-case scenario) |
13.1 Measured times (from runs)
| Test Case | Episodes | num_envs | Time |
|---|---|---|---|
| Sequential CPU | 50 | 1 | 1:02:39 |
| Parallel CPU | 50 | 8 | 09:49 |
| Sequential GPU | 50 | 1 | 06:43 |
| Parallel GPU | 50 | 8 | 01:06 |
13.2 Summary
GPU acceleration + parallel environments provide dramatic reductions in wall-clock time for MASAC on RWARE. For the measured runs: sequential CPU (50 ep) took 01:02:39 (3759 s), while parallel GPU (50 ep, 8 envs) took 01:06 (66 s), which is a ~57× reduction in wall-clock time.
13.2.1 References
https://pmc.ncbi.nlm.nih.gov/articles/PMC11059992/
https://github.com/JohannesAck/tf2multiagentrl
https://github.com/semitable/robotic-warehouse
OpenAI. (2024). ChatGPT (Oct 23 version) [Large language model]. https://chat.openai.com/chat