3 Week 1 Deliverable – QMIX on MPE

Author

Dre Simmons

Published

October 23, 2025

4 Week 1 – MPE Training & QMIX Implementation

4.1 1. Environment Setup

Python 3.10 environment created via Conda
Installed PyTorch, NumPy, Matplotlib, TensorBoard
PettingZoo environment installed and verified with simple_spread_v3
Confirmed reproducibility with test scripts

The environment preparation ensures our QMIX training achieves stable, consistent results.

4.2 2. Task Objective

Train a multi-agent reinforcement learning model on the PettingZoo MPE simple_spread environment:

Agents: 3 cooperative agents
Landmarks: 3 targets to cover
Observations: 18-dimensional continuous vector per agent (includes positions, velocities, etc.)
Actions: 5 discrete actions: NOOP, UP, DOWN, LEFT, RIGHT
Rewards: Dense negative distance to landmarks, incentivizing coverage and penalizing collisions

Goal: Agents learn to spread and cover targets efficiently with minimum collisions.

4.3 3. Algorithm Choice: QMIX

QMIX is a value-based MARL method that enables centralized training with decentralized execution (CTDE) by:

Learning individual agent Q-networks to represent each agent’s policy decisions
Combining individual Q-values through a mixing network conditioned on the global state
Enforcing a monotonicity constraint so improving any agent’s local Q-value leads to improved global Q-value

This decouples multi-agent coordination by leveraging global information only during training.

4.4 4. Model Architecture

4.4.1 Agent Q-Networks

Each agent has its own deep neural network estimating Q-values from agent-specific observations:

import torch
import torch.nn as nn

class AgentQNetwork(nn.Module):
    def __init__(self, obs_dim, n_actions):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, n_actions)
        )

    def forward(self, x):
        return self.net(x)  # Q-values for all possible actions

Input: agent’s local observation vector (size 18)
Output: Q-values per discrete action (5)

4.4.2 Mixing Network

Combines per-agent Q-values into a global Q-value conditioned on the full state, enforcing monotonicity through positive weight constraints:

class MixingNetwork(nn.Module):
    def __init__(self, n_agents, state_dim):
        super().__init__()
        self.hyper_w_1 = nn.Linear(state_dim, n_agents * 64)
        self.hyper_w_2 = nn.Linear(state_dim, 64)
        self.hyper_b_1 = nn.Linear(state_dim, 64)
        self.hyper_b_2 = nn.Linear(state_dim, 1)

    def forward(self, agent_qs, state):
        bs = agent_qs.size(0)
        state = state.view(bs, -1)
        w1 = torch.abs(self.hyper_w_1(state)).view(bs, -1, 64)  # weights positive
        b1 = self.hyper_b_1(state).view(bs, 1, 64)
        hidden = torch.relu(torch.bmm(agent_qs.unsqueeze(1), w1) + b1)
        w2 = torch.abs(self.hyper_w_2(state)).view(bs, 64, 1)
        b2 = self.hyper_b_2(state).view(bs, 1, 1)
        q_total = torch.bmm(hidden, w2) + b2
        return q_total.view(-1, 1)

4.5 5. Training Pipeline

Reset environment per episode; agents observe the environment state.
Agents do ε-greedy action selection:
- Mostly selecting actions with highest Q-values
- Sometimes random (for exploration)
Environment steps forward with chosen actions, returning next observations, rewards, and done flags.
Transitions (obs, actions, rewards, next obs, dones) saved to replay buffer.
Random batches sampled from buffer train networks via gradient descent.
Target networks are updated slowly for stability (soft updates).
Epsilon decays over episodes from 1.0 to 0.05 to reduce exploration gradually.

Training Step Example:

for ep in range(episodes):
    obs = env.reset(seed=ep)
    total_reward = 0
    for step in range(max_cycles):
        obs_list = list(obs)
        actions = agent.select_actions(obs_list, epsilon)
        assert len(actions) == len(env.agents)
        action_dict = {a: actions[i] for i,a in enumerate(env.agents)}
        next_obs, rewards, terminations, truncations, _ = env.step(action_dict)
        next_obs_list = list(next_obs)
        reward_list = list(rewards)
        done_list = [terminations[i] or truncations[i] for i in range(len(env.agents))]
        state = np.concatenate(obs_list)
        next_state = np.concatenate(next_obs_list)
        buffer.push(state, np.array(actions), np.array(reward_list), next_state, np.array(done_list))
        loss = agent.train_step(buffer)
        total_reward += np.mean(reward_list)
        obs = next_obs
        if any(done_list):
            break
    epsilon = max(min_epsilon, epsilon * epsilon_decay)

4.6 6. Evaluation Plan

Run 10 to 20 deterministic episodes (ε=0, always pick best action).
Report:
- Average reward per episode.
- Episode length.
- Minimum and average distance to landmarks covered.
- Number of collisions between agents.
Baseline random agent performance to contextualize success.

4.7 7. Deliverables

Bug-free, working QMIX implementation on MPE simple_spread.
Training curves visualizing reward improvements and losses.
Deterministic evaluation report showing multi-agent coordination.
Clear documentation covering:
- Algorithm details.
- Network architecture.
- Key hyperparameters.
- Bug fixes and resolved issues.

4.8 8. References

Rashid, Tabish, et al. “QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning.” ICML (2018).
PettingZoo Multi-Agent RL Environments. https://www.pettingzoo.ml
PyTorch: https://pytorch.org/
Lowe, Ryan, et al. “Multi-agent actor-critic for mixed cooperative-competitive environments.” NeurIPS (2017).
OpenAI. ChatGPT — conversational assistance for brainstorming, drafting, code scaffolding, and debugging.