18  RL: Multi-Agent Warehouse Robots

Deliverable 5 – QMIX Final plot

Author

Lian Thang

Published

November 20, 2025

This report presents the performance evaluation of the QMIX multi-agent reinforcement learning model trained on the Unity-based warehouse environment. The objective of this analysis is to assess the learning stability, convergence behavior, and overall policy quality through two core metrics:

Both metrics were extracted from the Sacred experiment logs and visualized for clarity.

19 QMIX Training & Test Return

19.1 Initial

Parameter Value Purpose / Why Important
batch_size 16 Number of samples per learner update; small batch → faster but noisier learning.
buffer_size 2000 Replay buffer capacity storing past experiences.
epsilon_start 1.0 Initial exploration rate (full exploration).
epsilon_finish 0.1 Minimum exploration value when training stabilizes.
epsilon_anneal_time 500000 Timesteps over which epsilon decays from 1.0 → 0.1.
episode_limit 500 Maximum steps per episode.
env_args.no_graphics false Runs Unity with graphics enabled.
env_args.time_scale 20.0 Unity simulation runs 20× faster than real time.
t_max 1000000 Total number of training timesteps.
gamma 0.99 Standard discount factor prioritizing long-term rewards.
lr (learning rate) 0.0005 Learning rate for optimizer controlling update speed.
mixer qmix QMIX mixing network for cooperative value decomposition.
agent rnn Recurrent agent network to handle partial observability.
save_model true Model checkpoints are saved during training.

19.2 Optimized

Parameter Value Purpose / Why Important
batch_size 16 Number of samples per learner update; small batch → faster but noisier learning.
buffer_size 5000 Replay buffer capacity storing past experiences.
epsilon_start 1.0 Initial exploration rate (full exploration).
epsilon_finish 0.1 Minimum exploration value when training stabilizes.
epsilon_anneal_time 200000 Timesteps over which epsilon decays from 1.0 → 0.1.
episode_limit 300 Maximum steps per episode.
env_args.no_graphics true Runs Unity headless → faster performance.
env_args.time_scale 50.0 Unity simulation runs 50× faster than real time.
t_max 500000 Total number of training timesteps.
gamma 0.99 Standard discount factor prioritizing long-term rewards.
lr (learning rate) 0.001 Learning rate for optimizer controlling update speed.
mixer qmix QMIX mixing network for cooperative value decomposition.
agent rnn Recurrent agent network to handle partial observability.
save_model true Model checkpoints are saved during training.

19.3 What changed

Parameter Old Value New Value Meaning of Change
buffer_size 2000 5000 Larger replay buffer → more diverse samples, more stable learning.
epsilon_anneal_time 500000 200000 Exploration decays faster → becomes greedy earlier.
episode_limit 500 300 Shorter episodes → agents must complete tasks faster.
env_args.no_graphics false true Now headless mode → significantly faster Unity simulation.
env_args.time_scale 20.0 50.0 Simulation now 2.5× faster.
t_max 1000000 500000 Training time cut in half.
lr (learning rate) 0.0005 0.001 Higher learning rate → faster but potentially less stable.

The plot below compares the learning curve during training with periodic evaluation episodes.
The training return increases steadily and stabilizes with occasional variance, while evaluation returns appear toward the final stage, showing a large improvement spike. It also show the previous week metrics and how it compares to this week with changes parameters and increase trainning.

20 Summary

The QMIX agent demonstrates strong learning progression throughout training, with the following notable observations:

  • Stable upward trend in return_mean, confirming effective joint action-value decomposition under QMIX.
  • Late-stage surge in test_return_mean, indicating the final policy generalizes well when evaluated outside the training loop.
  • Variance near convergence suggests exploration and state-space complexity influence final policy stability.

Overall, the QMIX model exhibits reliable multi-agent coordination behavior suitable for the warehouse robot control scenario.


20.0.1 References

  • OpenAI. (2024). ChatGPT (Oct 23 version) [Large language model]. https://chat.openai.com/chat