20  Week 5 Deliverable – Final Research Paper & Windows Server Deployment

Author

Salmon Riaz

Published

November 19, 2025

21 Week 5 – Research Paper Finalization & Bonus Agile Task

21.1 1. Weekly Objectives

This week’s focus included:

  • Integrating all compiled findings into a cohesive research paper
  • Writing introduction, methodology, results, and conclusion sections
  • Synthesizing insights across all team member implementations
  • Preparing research paper for potential publication/submission
  • Coordinating and consolidating final documentation
  • Coordinating and scheduling Industry Expert Interview
  • Bonus Agile Task: Windows Server QMIX Deployment

21.2 2. Research Paper Finalization

21.2.1 2.1 Sections Completed

The final research paper integrates findings from all five weeks:

Section Content
Introduction Problem motivation, warehouse robotics challenges
Related Work QMIX, IPPO-LSTM, MASAC literature review
Preliminaries CTDE paradigm, mathematical formulation
Methods Algorithm descriptions, hyperparameter configurations
Experiments MPE, RWARE, Unity training results
Results Performance comparisons, scaling analysis
Discussion Key findings, limitations, practical implications
Conclusion Summary and future directions

21.2.2 2.2 Key Findings Synthesized

The research paper documents the following major findings:

  1. Algorithm Selection: QMIX emerged as the best-performing algorithm for warehouse robotics after systematic comparison with IPPO-LSTM and MASAC

  2. Hyperparameter Sensitivity: Default configurations fail on RWARE; extended epsilon annealing (5M+ steps) critical for sparse rewards

  3. Value Decomposition Advantages: QMIX’s monotonic mixing network outperforms independent learning by 15-25%

  4. Scaling Challenges: Performance degrades sub-linearly with agent count, but training requirements increase super-linearly

  5. Unity Integration Success: QMIX successfully deployed in 3D Unity warehouse environment with LIDAR-equipped robots


21.3 3. Industry Expert Interview

21.3.1 3.1 Interview Coordination

Successfully coordinated and scheduled an Industry Expert Interview:

  • Expert: Mr. James Nelsen
  • Position: CIO of 1aAi
  • Location: Tulsa-based AI company
  • Purpose: Gather industry perspective on MARL applications in warehouse automation

21.3.2 3.2 Interview Topics

Prepared discussion topics including:

  • Real-world challenges in multi-robot coordination
  • Industry adoption of MARL algorithms
  • Scalability requirements for production deployments
  • Transfer from simulation to physical robots

21.4 4. Bonus Agile Task: Windows Server QMIX Deployment

21.4.1 4.1 Task Overview

Based on Dr. Valderrama’s question regarding scaling-up of the current setup, I was assigned to replicate Price’s work on an Enterprise-level Windows Server to:

  • Showcase scaling-up capabilities
  • Increase depth of compatibility/cross-platform instructions
  • Validate GitHub documentation for different environments
  • Complete long-horizon training that failed on local machines

21.4.2 4.2 Environment Specifications

Component Specification
OS Windows Server 2022 Datacenter
CPU Intel Xeon E5-2680 v4 (14c/28t)
RAM Allocated 196 GB
Actual RAM Usage ~20 GB
GPU CUDA-enabled (PyTorch 2.8.0)
Python 3.9.13
Framework EPyMARL (QMIX, RNN agents)
Unity Environment unity_warehouse
Agents 3
Action Space 6 discrete
Observation Space 36-dim vector
Episode Limit 200 steps
Simulation Mode no-graphics, time_scale = 50

21.5 5. First Training Run (500,000 Steps)

21.5.1 5.1 Run Summary

Metric Value
Status Successfully completed
Duration ~3 hours 2 minutes
Total Steps 500,199

21.5.2 5.2 Key Metrics

Metric Value
Test Return Mean 95.2314
Test Return Std 0.0078
Episode Steps Mean 200.0
Q_taken_mean 3.05
target_mean 213.03
TD Error 0.49
Epsilon 0.10 (fully annealed)

21.5.3 5.3 Unity Evidence

Console logs confirmed repeated successful deliveries:

Package delivered in zone 01 Total: 58
Package delivered in zone 01 Total: 59
Package delivered in zone 01 Total: 60

21.5.4 5.4 Interpretation

What this run showed:

  • Agents surviving full episodes (200 steps)
  • Returns increased significantly from near-zero
  • Variance collapsed indicating stable learned behavior
  • TD error dropped showing clean value learning
  • Q-values healthy with no divergence
  • Agents actively delivering packages
  • Training stable with only ~20GB RAM usage

21.6 6. Second Training Run (1,000,019 Steps)

21.6.1 6.1 Run Summary

Metric Value
Status Completed full 1M steps
Duration ~6 hours 18 minutes
Total Steps 1,000,019

21.6.2 6.2 Key Metrics (from Price’s analysis)

Metric Start End Change
Test Return 0 238.6 Major improvement
Training Return 2.67 231.4 ~770% improvement
Peak Return - 443 at ~920k steps
Q-values -0.11 +8.67 Healthy growth
TD Error Abs 7.61 Low, stable Strong convergence
grad_norm - ~127 Healthy for QMIX
return_std - ~0.01 Nearly deterministic

21.6.3 6.3 Learning Curve Breakdown

Learning Phases:
├── Steps 0-100k: Low returns (exploration)
├── Steps 100k-200k: Rapid learning spike
├── Steps 200k-1M: Consistent 200-250 returns
└── Peak performance: 443 at ~920k steps

21.6.4 6.4 Behavioral Conclusions

  • Policy became deterministic (very low variance)
  • Value network stable across full million steps
  • Q-values strong, no collapse
  • Agents continually delivered packages during training
  • Windows Server capable of completing long-horizon Unity runs

21.7 7. Run Comparison (500k vs 1M)

Metric First Run (500k) Second Run (1M) Verdict
Test Return Mean 95.2 238.6 Major improvement
Peak Return ~110-150 (inferred) 443 Strong high-end learning
Variance (std) 0.0078 0.01 Both extremely stable
Episode Steps 200 200 Fully stable
TD Error 0.49 Lower & consistent Strong convergence
Runtime 3h 6h18m Linear scaling
Deliveries Confirmed Confirmed Real behavior in both runs

Conclusion: Both runs were successful, but the 1M-step run demonstrated full agent maturation.


21.8 8. Why This is a Success

21.8.1 8.1 Local Machine Limitations Removed

Personal Computer Issues:

  • Could not complete long Unity runs
  • Crashed frequently
  • Struggled with memory & GPU load

Windows Server Advantages:

  • Completed two full runs (500k & 1M)
  • Stable execution
  • Plenty of RAM (only ~20GB used)
  • CUDA acceleration enabled
  • Supports production-level Unity training

21.8.2 8.2 Key Insights (Price’s Validation)

  1. Agents learned significantly
    • Test returns soared from 0 to 238
    • Q-values healthy (+8.67)
    • TD error collapsed
  2. Stable and consistent
    • Variance near zero
    • Long episodes (200 steps)
    • No early deaths, no stuck states
    • Very clean training gradients
  3. Deliveries were real
    • “Package delivered…” in Unity console
    • Navigation and pickup behavior confirmed
  4. Production-quality behavior
    • Deterministic learned policy
    • No collapse, no divergence
    • Return curve matched textbook QMIX convergence
  5. Hard evidence
    • QMIX learned end-to-end in Unity
    • Windows Server can complete large-scale MARL runs
    • We now have repeatable, reliable, scalable training

21.9 9. Documentation Coordination

21.9.1 9.1 Tasks Completed

  • Consolidated all team member deliverables (Weeks 1-5)
  • Organized GitHub repository structure
  • Created reproducible experiment configurations
  • Documented installation instructions for multiple platforms
  • Prepared final presentation materials

21.9.2 9.2 Cross-Platform Validation

The Windows Server deployment validated our GitHub instructions work across:

  • macOS (original development)
  • Windows Server 2022 (enterprise deployment)
  • Linux (HPC clusters via team members)

21.10 10. Deliverables Summary

Deliverable Status
Research paper finalization Complete
Introduction/methodology/results/conclusion Complete
Industry expert interview coordination Complete
Windows Server environment setup Complete
First training run (500k steps) Complete
Second training run (1M steps) Complete
Documentation consolidation Complete
Cross-platform validation Complete

21.11 11. Final Summary

21.11.1 11.1 Week 5 Accomplishments

  • Research Paper: Integrated all findings into cohesive publication-ready document
  • Industry Interview: Coordinated interview with Mr. James Nelsen (CIO of 1aAi)
  • Bonus Agile Task: Successfully deployed and ran QMIX in Unity on Windows Server

21.11.2 11.2 Windows Server Deployment Results

  • Both runs (500k and 1M) were successful
  • Agents navigated properly, picked up packages, and delivered them
  • Achieved strong returns with stable, deterministic behavior
  • Windows Server solved the compute bottleneck
  • Dr. Valderrama’s requirement for long training horizons is now met
  • This validates the entire sim-to-Unity training pipeline

21.11.3 11.3 Project Impact

The Windows Server deployment significantly strengthens:

  1. The final research paper with production-scale evidence
  2. The demo video showing learned warehouse coordination
  3. The reproducibility of our work across different platforms

21.12 12. References

  1. Rashid, T., et al. (2018). QMIX: Monotonic Value Function Factorisation. ICML.

  2. Papoudakis, G., et al. (2021). Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms. NeurIPS.

  3. Unity ML-Agents Toolkit: https://github.com/Unity-Technologies/ml-agents

  4. EPyMARL: https://github.com/oxwhirl/epymarl

  5. RWARE: https://github.com/Farama-Foundation/RWARE