16 Week 4 Deliverable – Extended Results Compilation & Presentation

Author

Salmon Riaz

Published

November 12, 2025

17 Week 4 – Extended Training Results & Research Paper Updates

17.1 1. Weekly Objectives

This week’s focus included:

Receiving extended training results from team members
Compiling comprehensive performance comparisons
Updating research paper results sections
Preparing materials for the second class presentation

17.2 2. Extended Training Results Compilation

17.2.1 2.1 QMIX Extended Training (Dre Simmons)

Extended training runs provided deeper insights into QMIX behavior:

Experiment	Environment	Timesteps	Final Return	Notes
QMIX-tiny-long	tiny-2ag-v2	30M	3.42	Continued improvement
QMIX-small	small-4ag-v1	40M	2.35	Stable coordination
QMIX-hard	tiny-2ag-hard	25M	2.15	Challenging but learnable

Key Observations:

Extended training yields marginal improvements beyond 20M steps
Performance plateau suggests policy convergence
Hard environment requires ~25% more training

17.2.2 2.2 IPPO-LSTM Extended Results (Price Allman)

Experiment	Environment	Timesteps	Final Return
IPPO-tiny	tiny-2ag-v2	15M	2.92
IPPO-small	small-4ag-v1	25M	1.85
IPPO-lstm-large	medium-6ag	40M	1.22

Insights:

LSTM memory helps in partially observable settings
Performance gap with QMIX narrows in larger environments
On-policy training provides more stable learning curves

17.2.3 2.3 Unity Integration Progress (Price Allman)

Documented Unity ML-Agents integration status:

Component	Status	Notes
Warehouse Scene	Complete	3D visualization ready
Robot Prefabs	Complete	Physics-based movement
ML-Agents Connection	Complete	Python-Unity bridge working
QMIX Integration	In Progress	Action-observation mapping

17.3 3. Research Paper Results Section

17.3.1 3.1 Performance Summary Table

Added to paper:

Algorithm	tiny-2ag-v2	small-4ag-v1	Training Time
QMIX	3.42	2.35	20-30M steps
IPPO-LSTM	2.92	1.85	15-25M steps
MASAC	2.45	1.62	15-20M steps
Random	0.05	0.02	N/A

17.3.2 3.2 Learning Curve Analysis

Documented learning progression patterns:

QMIX Learning Phases:
├── Phase 1 (0-5M): Exploration, minimal learning
├── Phase 2 (5-10M): Reward discovery, rapid improvement
├── Phase 3 (10-15M): Policy refinement, moderate gains
└── Phase 4 (15M+): Convergence, marginal improvements

17.3.3 3.3 Ablation Study Results

Compiled ablation studies on QMIX components:

Component Modified	Performance Impact
No mixing network	-45% (VDN baseline)
Linear mixing	-18%
Smaller buffer (50K)	-22%
Faster epsilon decay	-65% (often fails)
Larger batch (512)	+3%

17.4 4. Second Class Presentation

17.4.1 4.1 Presentation Structure

Project Recap (2 min)
- Team roles and algorithm assignments
- Environment progression (MPE → RWARE)
Algorithm Deep Dive: QMIX (5 min)
- Value decomposition explained
- Mixing network architecture
- Hyperparameter sensitivity
Extended Results (5 min)
- Performance comparison tables
- Learning curves visualization
- Scaling analysis findings
Unity Integration Demo (3 min)
- Live demonstration if available
- Visual comparison: RWARE vs Unity
Final Week Plans (2 min)
- Research paper finalization
- Industry expert interview preparation

17.4.2 4.2 Visual Materials Prepared

QMIX architecture diagram
Learning curve plots (all algorithms)
Performance comparison bar charts
Unity environment screenshots
Scaling analysis graphs

17.5 5. Research Paper Discussion Updates

17.5.1 5.1 Key Findings Section

Added to paper:

Key Findings: Our experiments reveal several critical insights for applying MARL algorithms to warehouse robotics:

Hyperparameter Sensitivity: Default configurations consistently fail on RWARE. Extended exploration through slow epsilon decay (5M+ steps) is essential for sparse reward discovery.

Value Decomposition Advantages: QMIX’s monotonic mixing network outperforms independent learning approaches by 15-25%, demonstrating the value of centralized coordination mechanisms.

Scaling Challenges: Performance degrades sub-linearly with agent count, but training requirements increase super-linearly. A 3x increase in agents requires approximately 4-5x more training steps.

Framework Importance: PyTorch-based implementations significantly outperform Keras alternatives for multi-agent scenarios, with fewer stability issues and better concurrency handling.

17.5.2 5.2 Implications for Warehouse Robotics

Added to paper:

Practical Implications: Our findings suggest that QMIX-based coordination is viable for small-scale warehouse deployments (2-4 robots). For larger fleets, hierarchical decomposition or domain-specific constraints may be necessary to maintain tractability. The significant hyperparameter tuning required indicates that production deployments should incorporate automated hyperparameter optimization and robust monitoring systems.

17.6 6. Comparative Analysis Tables

17.6.1 6.1 Algorithm Characteristics

Characteristic	QMIX	IPPO-LSTM	MASAC
Training Type	Off-policy	On-policy	Off-policy
Coordination	Mixing network	Independent	Centralized critic
Memory	RNN optional	LSTM	None
Sample Efficiency	Moderate	Low	High
Stability	High	High	Moderate

17.6.2 6.2 Training Requirements

Algorithm	Min Steps (tiny)	Min Steps (small)	GPU Recommended
QMIX	15M	30M	Yes
IPPO-LSTM	10M	20M	Optional
MASAC	8M	15M	Yes

17.7 7. Challenges Documented

17.7.1 7.1 Technical Challenges

Challenge	Solution	Status
Sparse reward discovery	Extended epsilon annealing	Resolved
Memory constraints	Reduced buffer on CPU	Resolved
Training instability	Larger batch sizes	Resolved
Unity latency	Async communication	In Progress

17.7.2 7.2 Research Challenges

Challenge	Approach	Notes
Fair comparison	Same environment seeds	Implemented
Reproducibility	Fixed random seeds	Documented
Generalization	Multiple configurations	Tested

17.8 8. Code Documentation Updates

17.8.1 8.1 Training Script Documentation

Documented training command structure:

# QMIX Training on RWARE
python src/main.py \
  --config=qmix \
  --env-config=gymma \
  with env_args.key="rware:rware-tiny-2ag-v2" \
  batch_size=256 \
  buffer_size=200000 \
  epsilon_anneal_time=5000000 \
  t_max=20000000 \
  save_model=True

17.8.2 8.2 Evaluation Script Documentation

# QMIX Evaluation with Visualization
python src/main.py \
  --config=qmix \
  --env-config=gymma \
  with env_args.key="rware:rware-tiny-2ag-v2" \
  checkpoint_path="results/models/qmix_seed0_..." \
  evaluate=True \
  render=True

17.9 9. Week 5 Preview

Final week objectives:

Finalize research paper
Synthesize insights across all experiments
Coordinate industry expert interview (if available)
Prepare final presentation materials

17.10 10. Deliverables Summary

Deliverable	Status
Extended results compilation	Complete
Performance comparison tables	Complete
Research paper results section	Complete
Presentation preparation	Complete
Code documentation	Complete

17.11 11. Research Paper Excerpt: Limitations

Added to Discussion section:

Limitations: Several limitations should be considered when interpreting our results:

Environment Scope: Our experiments focus on RWARE simulations. Transfer to physical warehouse systems would require additional domain adaptation.

Agent Count: Testing was limited to 2-6 agents due to computational constraints. Industrial warehouses may require 50+ coordinating robots.

Task Complexity: RWARE’s simplified task structure (shelf movement) does not capture all warehouse operations (picking, packing, inventory management).

Training Resources: Extended training requirements (20M+ steps) may be impractical for rapid deployment scenarios.

17.12 12. References

Rashid, T., et al. (2018). QMIX: Monotonic Value Function Factorisation. ICML.
Papoudakis, G., et al. (2021). Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms. NeurIPS.
Unity ML-Agents Toolkit: https://github.com/Unity-Technologies/ml-agents
EPyMARL: https://github.com/oxwhirl/epymarl
TensorBoard: https://www.tensorflow.org/tensorboard