11 Week 3 Deliverable - RWARE tiny-4ag-hard-v2 with QMIX: Parallel Learning Experiment and Comparative Analysis

Author

Dre Simmons

Published

November 6, 2025

12 Executive Summary

This report documents the methodology and results of QMIX applied to the RWARE tiny-4ag-hard-v2 warehouse task, comparing outcomes and challenges to last week’s (Week 2) baseline setup. The goal is to see the effect of parallelization and increased coordination complexity on multi-agent learning performance.

13 Experimental Workflow

13.1 Environment and Tool Setup

Platform: Mac (M3, 16GB RAM), Python venv
Libraries: gymnasium, rware, torch, tensorboard (was not working for unknown reasons), EPyMARL
Task Configuration: RWARE tiny-4ag-hard-v2 (four agents, two shelves), “parallel” protocol instead of “episode”

13.2 Treatment Design

Initial (Parallel, Baseline Config): Implemented parallel environment sampling with default parameters (the parameters that led to a successful run from Week 2).
Secondary (Parallel, Aggressive Config): Increased buffer, prolonged exploration decay, tuned rollout hyperparameters.
Comparison: Week 2 baseline (episode setup, two agents, one shelf)

14 Configuration Evolution: Switching from Effective Episode QMIX to Parallel QMIX

This week focused on two main parallel configurations to address the increased difficulty of RWARE tiny-4ag-hard-v2 (4 agents, 2 shelves) and compare their impact. The decision was not to scale to a larger warehouse to see the effect of four agents coordinating in the same environment as Week 2.

14.1 Initial Parallel Training Configuration

Parameter	Value	Purpose / Why Important
runner	parallel	Runs multiple environments to speed up experience collection.
batch_size_run	8	Number of parallel environments running simultaneously.
batch_size	256	Number of samples per update. Larger batches stabilize learning but require more memory; balance learning speed and stability.
buffer_size	200,000	Maximum number of experiences; larger buffer increases sample diversity and long-term learning in sparse tasks.
epsilon_start	1.0	Maximizes early exploration.
epsilon_finish	0.1	Allows mostly greedy exploitation by end of training.
epsilon_anneal_time	5,000,000	Number of steps for epsilon to decay; longer value allows thorough policy learning.
t_max	20,000,000	Standard training duration for initial testing.
gamma	0.99	Standard RL discount factor.
lr	0.0005	Stable learning rate for consistent training.
mixer	qmix	Essential for cooperative agent mixing.
agent	rnn	Handles partial observability in MARL.
env_args (key)	rware:rware-tiny-4ag-hard-v2	Defines the experiment environment.
env_args (time_limit)	100	Maximum steps per episode.
save_model	True	Checkpoints enabled for reproducibility and recovery.

14.1.1 Key Points

Parallel runner and batch_size_run enable fast, varied sample collection.
Large buffer and batch size promote stable, diverse learning for difficult, multi-agent tasks.
Slow annealing (epsilon_anneal_time) and retained exploration (epsilon_finish) help agents reliably find sparse rewards.
Saving model checkpoints protects against failure and helps analyze training progress.

Result:
Agents began to learn, achieving a final return mean of 2.58 and target mean of 8.21, approaching solved status under parallel config. Week 2’s benchmark was a return mean above 2.5.

14.2 Secondary (Aggressive) Parallel Configuration

Parameter	Value	Purpose / Why Important
runner	parallel	Same as above.
batch_size_run	16	Increased to create more parallel instances.
batch_size	512	Further increased; promotes stable multi-agent updates.
buffer_size	200,000	Larger buffer, critical for diverse experience replay.
epsilon_start	1.0	Full exploration at start.
epsilon_finish	0.10	Slightly more exploration retained at end.
epsilon_anneal_time	1,000,000	Quicker shift from exploration to exploitation.
t_max	20,000,000	Longer training horizon for sparse-reward setting.
gamma	0.99	Standard RL setup.
lr	0.001	Higher learning rate for quicker updates (monitor for stability).
mixer	qmix	Cooperative value mixing continues.
agent	rnn	Same as above.
env_args (key)	rware:rware-tiny-4ag-hard-v2	Same task environment.
env_args (time_limit)	100	Consistent episode bound.
save_model	True	Checkpoints enabled for reproducibility and recovery.

14.2.1 Key Points

Doubling batch_size_run and batch_size increases sample and training throughput—most effective for systems with more compute resources.
Faster epsilon annealing means agents start exploiting quicker, which can speed learning but may risk premature convergence in sparse environments.
Higher learning rate speeds up model updates; requires monitoring for stability.
Large buffer and model checkpoints protect learning stability and enable analysis/recovery.

Result:
Agents showed stronger learning, reaching a final return mean of 1.67 and target mean of 8.95 (with a more stable plateau), indicating improved exploration, sample efficiency, and stability.

15 Why the Improvements Worked

Parallel rollout combined with longer exploration, bigger buffers, and larger batches enabled robust learning in high-coordination, sparse-reward MARL tasks.
Checkpointing allowed for confident evaluation and repeatability.

16 Main Metrics & Results

Week & Treatment	Agents	Shelves	Target Mean (final)	Return Mean (final)	Status
Week 2 (Episode)	2	1	3.25	3.25	Solved
Week 3 Initial (Parallel)	4	2	8.21	2.58	Near solved
Week 3 Secondary	4	2	8.95	1.67	Strong learning

Increasing agent and shelf count made the environment harder, requiring more sample diversity and longer exploration to approach optimal learning.
Parallel enabled better sample efficiency compared to episode-based Week 2 training.

QUICK NOTE: Interpreting Return Mean vs. Target Mean in This Experiment Although the secondary (aggressive) parallel run produced a higher target mean (8.95) than the initial parallel run (8.21), its return mean was actually lower (1.67 versus 2.58). This demonstrates a common scenario in deep multi-agent RL: the average Q-value targets used for learning (target mean) can be inflated (or more optimistic) during aggressive or unstable training, while the actual episode returns (return mean) show the true effectiveness of the policy in the environment.

Return mean is the main measure of practical agent performance—how much reward the agents truly achieved. Target mean is an internal metric reflecting the values predicted and bootstrapped by the neural network during training; these can run higher or show more volatility, but do not always translate into better real-world behavior.

In literature, discrepancies like this are often caused by fast learning rates or aggressive parameter settings, which may lead to value overestimation or oscillation. As shown, the initial parallel config ultimately delivered better real-world agent returns, so it represents more robust performance this week.

Key Takeaway:

Always prioritize return mean for evaluating agent success.

Target mean is best used to monitor learning stability and the relative scale of Q-value predictions, but should not be the sole measure of policy quality.

If the metrics diverge, the config aggressiveness may cause optimism in value updates without corresponding gains in episode rewards.

17 Learning Curves & Visuals

No visuals to display due to TensorBoard issue. Relied on quantitative metrics above that were produced as JSON files.

18 Changes from Week 2

Shift from episode-based to parallel rollouts.
Larger team size and coordination complexity.
Aggressive config tuning (buffer size, batch, and epsilon scheduling) was essential for good results.

19 Practical Takeaways

Parallel QMIX supports robust multi-agent learning even as coordination difficulty increases.
Buffer size and exploration schedule must be scaled with environment complexity.
Comparative metrics show QMIX remains adaptable across sampling protocols, but tuning is crucial as task demands grow.

20 References

Rashid, T., Samvelyan, M., De Witt, C.S., Farquhar, G., Foerster, J., & Whiteson, S. (2018). QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. Proceedings of the 35th International Conference on Machine Learning (ICML), 2018.
Papoudakis, G., Christianos, F., Schäfer, L., & Albrecht, S.V. (2021). Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks. NeurIPS, 2021.
Farama Foundation. RWARE: Multi-Agent Warehouse Environment. https://github.com/Farama-Foundation/RWARE
EPyMARL: Extended PyMARL Framework. https://github.com/oxwhirl/epymarl
DI-engine documentation: QMIX. https://di-engine-docs.readthedocs.io/en/latest/12_policies/qmix.html
Schaefer, L. (2021). Efficient Exploration in Single-Agent & Multi-Agent Deep RL. PhD Thesis, TU Darmstadt. https://www.lukaschaefer.com/files/phd_thesis.pdf
SMAClite: Multi-agent Environments. https://openreview.net/pdf/9b76af423d461f3f4e700735a1d1ec87fa251db3.pdf
ChatGPT. Portions of this report were formatted with assistance from ChatGPT