20 Week 5 Deliverable – Final Research Paper & Windows Server Deployment
21 Week 5 – Research Paper Finalization & Bonus Agile Task
21.1 1. Weekly Objectives
This week’s focus included:
- Integrating all compiled findings into a cohesive research paper
- Writing introduction, methodology, results, and conclusion sections
- Synthesizing insights across all team member implementations
- Preparing research paper for potential publication/submission
- Coordinating and consolidating final documentation
- Coordinating and scheduling Industry Expert Interview
- Bonus Agile Task: Windows Server QMIX Deployment
21.2 2. Research Paper Finalization
21.2.1 2.1 Sections Completed
The final research paper integrates findings from all five weeks:
| Section | Content |
|---|---|
| Introduction | Problem motivation, warehouse robotics challenges |
| Related Work | QMIX, IPPO-LSTM, MASAC literature review |
| Preliminaries | CTDE paradigm, mathematical formulation |
| Methods | Algorithm descriptions, hyperparameter configurations |
| Experiments | MPE, RWARE, Unity training results |
| Results | Performance comparisons, scaling analysis |
| Discussion | Key findings, limitations, practical implications |
| Conclusion | Summary and future directions |
21.2.2 2.2 Key Findings Synthesized
The research paper documents the following major findings:
Algorithm Selection: QMIX emerged as the best-performing algorithm for warehouse robotics after systematic comparison with IPPO-LSTM and MASAC
Hyperparameter Sensitivity: Default configurations fail on RWARE; extended epsilon annealing (5M+ steps) critical for sparse rewards
Value Decomposition Advantages: QMIX’s monotonic mixing network outperforms independent learning by 15-25%
Scaling Challenges: Performance degrades sub-linearly with agent count, but training requirements increase super-linearly
Unity Integration Success: QMIX successfully deployed in 3D Unity warehouse environment with LIDAR-equipped robots
21.3 3. Industry Expert Interview
21.3.1 3.1 Interview Coordination
Successfully coordinated and scheduled an Industry Expert Interview:
- Expert: Mr. James Nelsen
- Position: CIO of 1aAi
- Location: Tulsa-based AI company
- Purpose: Gather industry perspective on MARL applications in warehouse automation
21.3.2 3.2 Interview Topics
Prepared discussion topics including:
- Real-world challenges in multi-robot coordination
- Industry adoption of MARL algorithms
- Scalability requirements for production deployments
- Transfer from simulation to physical robots
21.4 4. Bonus Agile Task: Windows Server QMIX Deployment
21.4.1 4.1 Task Overview
Based on Dr. Valderrama’s question regarding scaling-up of the current setup, I was assigned to replicate Price’s work on an Enterprise-level Windows Server to:
- Showcase scaling-up capabilities
- Increase depth of compatibility/cross-platform instructions
- Validate GitHub documentation for different environments
- Complete long-horizon training that failed on local machines
21.4.2 4.2 Environment Specifications
| Component | Specification |
|---|---|
| OS | Windows Server 2022 Datacenter |
| CPU | Intel Xeon E5-2680 v4 (14c/28t) |
| RAM Allocated | 196 GB |
| Actual RAM Usage | ~20 GB |
| GPU | CUDA-enabled (PyTorch 2.8.0) |
| Python | 3.9.13 |
| Framework | EPyMARL (QMIX, RNN agents) |
| Unity Environment | unity_warehouse |
| Agents | 3 |
| Action Space | 6 discrete |
| Observation Space | 36-dim vector |
| Episode Limit | 200 steps |
| Simulation Mode | no-graphics, time_scale = 50 |
21.5 5. First Training Run (500,000 Steps)
21.5.1 5.1 Run Summary
| Metric | Value |
|---|---|
| Status | Successfully completed |
| Duration | ~3 hours 2 minutes |
| Total Steps | 500,199 |
21.5.2 5.2 Key Metrics
| Metric | Value |
|---|---|
| Test Return Mean | 95.2314 |
| Test Return Std | 0.0078 |
| Episode Steps Mean | 200.0 |
| Q_taken_mean | 3.05 |
| target_mean | 213.03 |
| TD Error | 0.49 |
| Epsilon | 0.10 (fully annealed) |
21.5.3 5.3 Unity Evidence
Console logs confirmed repeated successful deliveries:
Package delivered in zone 01 Total: 58
Package delivered in zone 01 Total: 59
Package delivered in zone 01 Total: 60
21.5.4 5.4 Interpretation
What this run showed:
- Agents surviving full episodes (200 steps)
- Returns increased significantly from near-zero
- Variance collapsed indicating stable learned behavior
- TD error dropped showing clean value learning
- Q-values healthy with no divergence
- Agents actively delivering packages
- Training stable with only ~20GB RAM usage
21.6 6. Second Training Run (1,000,019 Steps)
21.6.1 6.1 Run Summary
| Metric | Value |
|---|---|
| Status | Completed full 1M steps |
| Duration | ~6 hours 18 minutes |
| Total Steps | 1,000,019 |
21.6.2 6.2 Key Metrics (from Price’s analysis)
| Metric | Start | End | Change |
|---|---|---|---|
| Test Return | 0 | 238.6 | Major improvement |
| Training Return | 2.67 | 231.4 | ~770% improvement |
| Peak Return | - | 443 | at ~920k steps |
| Q-values | -0.11 | +8.67 | Healthy growth |
| TD Error Abs | 7.61 | Low, stable | Strong convergence |
| grad_norm | - | ~127 | Healthy for QMIX |
| return_std | - | ~0.01 | Nearly deterministic |
21.6.3 6.3 Learning Curve Breakdown
Learning Phases:
├── Steps 0-100k: Low returns (exploration)
├── Steps 100k-200k: Rapid learning spike
├── Steps 200k-1M: Consistent 200-250 returns
└── Peak performance: 443 at ~920k steps
21.6.4 6.4 Behavioral Conclusions
- Policy became deterministic (very low variance)
- Value network stable across full million steps
- Q-values strong, no collapse
- Agents continually delivered packages during training
- Windows Server capable of completing long-horizon Unity runs
21.7 7. Run Comparison (500k vs 1M)
| Metric | First Run (500k) | Second Run (1M) | Verdict |
|---|---|---|---|
| Test Return Mean | 95.2 | 238.6 | Major improvement |
| Peak Return | ~110-150 (inferred) | 443 | Strong high-end learning |
| Variance (std) | 0.0078 | 0.01 | Both extremely stable |
| Episode Steps | 200 | 200 | Fully stable |
| TD Error | 0.49 | Lower & consistent | Strong convergence |
| Runtime | 3h | 6h18m | Linear scaling |
| Deliveries | Confirmed | Confirmed | Real behavior in both runs |
Conclusion: Both runs were successful, but the 1M-step run demonstrated full agent maturation.
21.8 8. Why This is a Success
21.8.1 8.1 Local Machine Limitations Removed
Personal Computer Issues:
- Could not complete long Unity runs
- Crashed frequently
- Struggled with memory & GPU load
Windows Server Advantages:
- Completed two full runs (500k & 1M)
- Stable execution
- Plenty of RAM (only ~20GB used)
- CUDA acceleration enabled
- Supports production-level Unity training
21.8.2 8.2 Key Insights (Price’s Validation)
- Agents learned significantly
- Test returns soared from 0 to 238
- Q-values healthy (+8.67)
- TD error collapsed
- Stable and consistent
- Variance near zero
- Long episodes (200 steps)
- No early deaths, no stuck states
- Very clean training gradients
- Deliveries were real
- “Package delivered…” in Unity console
- Navigation and pickup behavior confirmed
- Production-quality behavior
- Deterministic learned policy
- No collapse, no divergence
- Return curve matched textbook QMIX convergence
- Hard evidence
- QMIX learned end-to-end in Unity
- Windows Server can complete large-scale MARL runs
- We now have repeatable, reliable, scalable training
21.9 9. Documentation Coordination
21.9.1 9.1 Tasks Completed
- Consolidated all team member deliverables (Weeks 1-5)
- Organized GitHub repository structure
- Created reproducible experiment configurations
- Documented installation instructions for multiple platforms
- Prepared final presentation materials
21.9.2 9.2 Cross-Platform Validation
The Windows Server deployment validated our GitHub instructions work across:
- macOS (original development)
- Windows Server 2022 (enterprise deployment)
- Linux (HPC clusters via team members)
21.10 10. Deliverables Summary
| Deliverable | Status |
|---|---|
| Research paper finalization | Complete |
| Introduction/methodology/results/conclusion | Complete |
| Industry expert interview coordination | Complete |
| Windows Server environment setup | Complete |
| First training run (500k steps) | Complete |
| Second training run (1M steps) | Complete |
| Documentation consolidation | Complete |
| Cross-platform validation | Complete |
21.11 11. Final Summary
21.11.1 11.1 Week 5 Accomplishments
- Research Paper: Integrated all findings into cohesive publication-ready document
- Industry Interview: Coordinated interview with Mr. James Nelsen (CIO of 1aAi)
- Bonus Agile Task: Successfully deployed and ran QMIX in Unity on Windows Server
21.11.2 11.2 Windows Server Deployment Results
- Both runs (500k and 1M) were successful
- Agents navigated properly, picked up packages, and delivered them
- Achieved strong returns with stable, deterministic behavior
- Windows Server solved the compute bottleneck
- Dr. Valderrama’s requirement for long training horizons is now met
- This validates the entire sim-to-Unity training pipeline
21.11.3 11.3 Project Impact
The Windows Server deployment significantly strengthens:
- The final research paper with production-scale evidence
- The demo video showing learned warehouse coordination
- The reproducibility of our work across different platforms
21.12 12. References
Rashid, T., et al. (2018). QMIX: Monotonic Value Function Factorisation. ICML.
Papoudakis, G., et al. (2021). Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms. NeurIPS.
Unity ML-Agents Toolkit: https://github.com/Unity-Technologies/ml-agents
EPyMARL: https://github.com/oxwhirl/epymarl
RWARE: https://github.com/Farama-Foundation/RWARE