MARL Warehouse Robots - Team Deliverables

Multi-Agent Reinforcement Learning for Cooperative Warehouse Automation

Authors

Price Allman

Lian Thang

Dre Simmons

Salmon Riaz

Published

December 1, 2025

MARL Warehouse Robots - Team Project Deliverables

Project Overview

This book compiles the weekly deliverables from our team’s 5-week multi-agent reinforcement learning (MARL) project focused on training cooperative warehouse robots.

Project Goal: Train multi-agent warehouse robots using reinforcement learning to coordinate package retrieval and delivery in increasingly complex environments.

Team Members: - Price Allman: Unity integration, QMIX implementation, learning failure analysis - Lian Thang: Visualization, comparative analysis, CPU/GPU performance studies - Dre Simmons: Code documentation, implementation guides, setup instructions - Salmon Riaz: Research paper compilation and integration

Project Timeline

Week 1: MPE Environment Training

All team members trained on Multi-Particle Environment (MPE) Simple Spread to establish baseline MARL skills with dense rewards.

Key Algorithms: IPPO-LSTM, Behavioral Cloning warm-start

Week 2: RWARE Environment Training

Transition to Robotic Warehouse (RWARE) environment with sparse rewards and grid-based coordination.

Key Focus: Algorithm comparison (Vanilla vs Advanced IPPO), sample efficiency analysis

Week 3: Unity Integration & Hard RWARE

Price: Integrated QMIX with Unity ML-Agents 4.0
Dre: Hard RWARE training
Lian: CPU/GPU performance comparison

Week 4: Extended Training

Price: Deployed enhanced Unity warehouse with package queue system
Partial training run (530k/1M steps)
Emerging package delivery behaviors observed

Week 5: Final Analysis & Deliverables

Price: Critical learning failure analysis - discovered agents rely on exploration, not learned policies
Lian: Created comparative visualizations across all experiments
Dre: Finalized code documentation
Salmon: Research paper integration

Key Findings

Environment Complexity Hierarchy: MPE (dense, simple) → RWARE (sparse, grid) → Unity (sparse, physics)
Algorithm Performance:
- Advanced IPPO: 3× better than vanilla, 4× more sample efficient
- QMIX with Unity: Achieved high training returns (207.96) but failed to learn actual policies
Critical Discovery (Week 5): Pure greedy evaluation (ε=0.0) revealed true learned performance
- Training with exploration: 207.96 return
- Pure greedy: 0.21 return (near-zero!)
- Adding 10% exploration: 191-253 return (904-1207× improvement!)
Root Causes of Learning Failure:
- Sparse rewards insufficient for credit assignment
- Random exploration masked learning failure in training metrics
- Hardware constraints (CPU-only, 16GB RAM) limited training duration

Lessons Learned

Always test with ε=0.0: Pure greedy evaluation reveals true learned performance
High training returns ≠ successful learning: Exploration can mask learning failures
Hardware matters: Serious MARL research requires GPU computing and extended training
Systematic debugging pays off: Hypothesis-driven testing (ε=0.0 vs ε=0.1) exposed root cause

Repository

Full code, documentation, and training checkpoints: MARL-QMIX-Warehouse-Robots