20 Week 5 Deliverable – Final Research Paper & Windows Server Deployment

Author

Salmon Riaz

Published

November 19, 2025

21 Week 5 – Research Paper Finalization & Bonus Agile Task

21.1 1. Weekly Objectives

This week’s focus included:

Integrating all compiled findings into a cohesive research paper
Writing introduction, methodology, results, and conclusion sections
Synthesizing insights across all team member implementations
Preparing research paper for potential publication/submission
Coordinating and consolidating final documentation
Coordinating and scheduling Industry Expert Interview
Bonus Agile Task: Windows Server QMIX Deployment

21.2 2. Research Paper Finalization

21.2.1 2.1 Sections Completed

The final research paper integrates findings from all five weeks:

Section	Content
Introduction	Problem motivation, warehouse robotics challenges
Related Work	QMIX, IPPO-LSTM, MASAC literature review
Preliminaries	CTDE paradigm, mathematical formulation
Methods	Algorithm descriptions, hyperparameter configurations
Experiments	MPE, RWARE, Unity training results
Results	Performance comparisons, scaling analysis
Discussion	Key findings, limitations, practical implications
Conclusion	Summary and future directions

21.2.2 2.2 Key Findings Synthesized

The research paper documents the following major findings:

Algorithm Selection: QMIX emerged as the best-performing algorithm for warehouse robotics after systematic comparison with IPPO-LSTM and MASAC
Hyperparameter Sensitivity: Default configurations fail on RWARE; extended epsilon annealing (5M+ steps) critical for sparse rewards
Value Decomposition Advantages: QMIX’s monotonic mixing network outperforms independent learning by 15-25%
Scaling Challenges: Performance degrades sub-linearly with agent count, but training requirements increase super-linearly
Unity Integration Success: QMIX successfully deployed in 3D Unity warehouse environment with LIDAR-equipped robots

21.3 3. Industry Expert Interview

21.3.1 3.1 Interview Coordination

Successfully coordinated and scheduled an Industry Expert Interview:

Expert: Mr. James Nelsen
Position: CIO of 1aAi
Location: Tulsa-based AI company
Purpose: Gather industry perspective on MARL applications in warehouse automation

21.3.2 3.2 Interview Topics

Prepared discussion topics including:

Real-world challenges in multi-robot coordination
Industry adoption of MARL algorithms
Scalability requirements for production deployments
Transfer from simulation to physical robots

21.4 4. Bonus Agile Task: Windows Server QMIX Deployment

21.4.1 4.1 Task Overview

Based on Dr. Valderrama’s question regarding scaling-up of the current setup, I was assigned to replicate Price’s work on an Enterprise-level Windows Server to:

Showcase scaling-up capabilities
Increase depth of compatibility/cross-platform instructions
Validate GitHub documentation for different environments
Complete long-horizon training that failed on local machines

21.4.2 4.2 Environment Specifications

Component	Specification
OS	Windows Server 2022 Datacenter
CPU	Intel Xeon E5-2680 v4 (14c/28t)
RAM Allocated	196 GB
Actual RAM Usage	~20 GB
GPU	CUDA-enabled (PyTorch 2.8.0)
Python	3.9.13
Framework	EPyMARL (QMIX, RNN agents)
Unity Environment	unity_warehouse
Agents	3
Action Space	6 discrete
Observation Space	36-dim vector
Episode Limit	200 steps
Simulation Mode	no-graphics, time_scale = 50

21.5 5. First Training Run (500,000 Steps)

21.5.1 5.1 Run Summary

Metric	Value
Status	Successfully completed
Duration	~3 hours 2 minutes
Total Steps	500,199

21.5.2 5.2 Key Metrics

Metric	Value
Test Return Mean	95.2314
Test Return Std	0.0078
Episode Steps Mean	200.0
Q_taken_mean	3.05
target_mean	213.03
TD Error	0.49
Epsilon	0.10 (fully annealed)

21.5.3 5.3 Unity Evidence

Console logs confirmed repeated successful deliveries:

Package delivered in zone 01 Total: 58
Package delivered in zone 01 Total: 59
Package delivered in zone 01 Total: 60

21.5.4 5.4 Interpretation

What this run showed:

Agents surviving full episodes (200 steps)
Returns increased significantly from near-zero
Variance collapsed indicating stable learned behavior
TD error dropped showing clean value learning
Q-values healthy with no divergence
Agents actively delivering packages
Training stable with only ~20GB RAM usage

21.6 6. Second Training Run (1,000,019 Steps)

21.6.1 6.1 Run Summary

Metric	Value
Status	Completed full 1M steps
Duration	~6 hours 18 minutes
Total Steps	1,000,019

21.6.2 6.2 Key Metrics (from Price’s analysis)

Metric	Start	End	Change
Test Return	0	238.6	Major improvement
Training Return	2.67	231.4	~770% improvement
Peak Return	-	443	at ~920k steps
Q-values	-0.11	+8.67	Healthy growth
TD Error Abs	7.61	Low, stable	Strong convergence
grad_norm	-	~127	Healthy for QMIX
return_std	-	~0.01	Nearly deterministic

21.6.3 6.3 Learning Curve Breakdown

Learning Phases:
├── Steps 0-100k: Low returns (exploration)
├── Steps 100k-200k: Rapid learning spike
├── Steps 200k-1M: Consistent 200-250 returns
└── Peak performance: 443 at ~920k steps

21.6.4 6.4 Behavioral Conclusions

Policy became deterministic (very low variance)
Value network stable across full million steps
Q-values strong, no collapse
Agents continually delivered packages during training
Windows Server capable of completing long-horizon Unity runs

21.7 7. Run Comparison (500k vs 1M)

Metric	First Run (500k)	Second Run (1M)	Verdict
Test Return Mean	95.2	238.6	Major improvement
Peak Return	~110-150 (inferred)	443	Strong high-end learning
Variance (std)	0.0078	0.01	Both extremely stable
Episode Steps	200	200	Fully stable
TD Error	0.49	Lower & consistent	Strong convergence
Runtime	3h	6h18m	Linear scaling
Deliveries	Confirmed	Confirmed	Real behavior in both runs

Conclusion: Both runs were successful, but the 1M-step run demonstrated full agent maturation.

21.8 8. Why This is a Success

21.8.1 8.1 Local Machine Limitations Removed

Personal Computer Issues:

Could not complete long Unity runs
Crashed frequently
Struggled with memory & GPU load

Windows Server Advantages:

Completed two full runs (500k & 1M)
Stable execution
Plenty of RAM (only ~20GB used)
CUDA acceleration enabled
Supports production-level Unity training

21.8.2 8.2 Key Insights (Price’s Validation)

Agents learned significantly
- Test returns soared from 0 to 238
- Q-values healthy (+8.67)
- TD error collapsed
Stable and consistent
- Variance near zero
- Long episodes (200 steps)
- No early deaths, no stuck states
- Very clean training gradients
Deliveries were real
- “Package delivered…” in Unity console
- Navigation and pickup behavior confirmed
Production-quality behavior
- Deterministic learned policy
- No collapse, no divergence
- Return curve matched textbook QMIX convergence
Hard evidence
- QMIX learned end-to-end in Unity
- Windows Server can complete large-scale MARL runs
- We now have repeatable, reliable, scalable training

21.9 9. Documentation Coordination

21.9.1 9.1 Tasks Completed

Consolidated all team member deliverables (Weeks 1-5)
Organized GitHub repository structure
Created reproducible experiment configurations
Documented installation instructions for multiple platforms
Prepared final presentation materials

21.9.2 9.2 Cross-Platform Validation

The Windows Server deployment validated our GitHub instructions work across:

macOS (original development)
Windows Server 2022 (enterprise deployment)
Linux (HPC clusters via team members)

21.10 10. Deliverables Summary

Deliverable	Status
Research paper finalization	Complete
Introduction/methodology/results/conclusion	Complete
Industry expert interview coordination	Complete
Windows Server environment setup	Complete
First training run (500k steps)	Complete
Second training run (1M steps)	Complete
Documentation consolidation	Complete
Cross-platform validation	Complete

21.11 11. Final Summary

21.11.1 11.1 Week 5 Accomplishments

Research Paper: Integrated all findings into cohesive publication-ready document
Industry Interview: Coordinated interview with Mr. James Nelsen (CIO of 1aAi)
Bonus Agile Task: Successfully deployed and ran QMIX in Unity on Windows Server

21.11.2 11.2 Windows Server Deployment Results

Both runs (500k and 1M) were successful
Agents navigated properly, picked up packages, and delivered them
Achieved strong returns with stable, deterministic behavior
Windows Server solved the compute bottleneck
Dr. Valderrama’s requirement for long training horizons is now met
This validates the entire sim-to-Unity training pipeline

21.11.3 11.3 Project Impact

The Windows Server deployment significantly strengthens:

The final research paper with production-scale evidence
The demo video showing learned warehouse coordination
The reproducibility of our work across different platforms

21.12 12. References

Rashid, T., et al. (2018). QMIX: Monotonic Value Function Factorisation. ICML.
Papoudakis, G., et al. (2021). Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms. NeurIPS.
Unity ML-Agents Toolkit: https://github.com/Unity-Technologies/ml-agents
EPyMARL: https://github.com/oxwhirl/epymarl
RWARE: https://github.com/Farama-Foundation/RWARE