15  Week 4 Deliverable - QMIX Training and Configuration in Unity

Author

Dre Simmons

Published

November 14, 2025

15.1 Project Overview

Our project focuses on training multi-agent warehouse robots using reinforcement learning to navigate a Unity-based warehouse environment, identify packages that are ready for pickup, and deliver them to the correct locations. The goal is to replicate realistic multi-agent coordination behaviors using cooperative RL algorithms.

15.2 Week 4 Accomplishments

15.2.1 Environment Development

  • Worked collaboratively on improving the Unity Warehouse Simulation:
    Assisted in refining elements of the environment so agents received useful observations and the reward structure aligned with the behaviors we wanted to train.

  • Helped review the package-readiness system:
    Verified that blue-highlighted packages were correctly recognized by the agents and that the environment’s logic triggered at appropriate times.

15.2.2 Algorithm Integration

  • QMIX Integration Support:
    Contributed to validating that the QMIX implementation was functioning correctly inside the Unity environment, including confirming data flow between individual Q-values and the mixing network.

  • Hyperparameter Adjustments (My Primary Focus):
    My main contribution this week was working on tuning and evaluating several hyperparameters to improve stability and learning performance, including:

    • Learning rate
    • Batch size
    • Replay buffer size
    • Exploration schedule (ε-decay)
    • Hidden-layer dimensions
  • Assisted in transferring RWARE hyperparameters:
    Helped adapt previously successful settings from the RWARE domain to better match the Unity environment’s observation space and reward dynamics.

15.3 Training Results

I participated in running and monitoring the extended QMIX training sessions alongside my teammate.

15.3.1 Long-Run QMIX Training (Planned: 1,000,000 Timesteps)

  • The training process was jointly executed.
  • The run stopped prematurely around 530,000 timesteps (approximately 8.5 hours) due to a system crash.

15.3.2 Observed Learning Progress

  • Test return increased from 0.00 to about 0.19
  • Training loss decreased by roughly 97%
  • Agents began showing coordinated behaviors, including:
    • Detecting “ready” packages
    • Moving toward and picking up packages
    • Delivering packages to designated drop-off zones

15.4 Next Steps

  • Continue training to reach the full 1,000,000 timesteps
  • Evaluate whether further hyperparameter refinement is needed
  • Support recording a video demonstration for the project results
  • Contribute to the write-up on hyperparameters, training behavior, and experimental methodology