Test 10: Swarm Reinforcement Learning

Overview

Test 10 implements a distributed multi-agent reinforcement learning system where 100 autonomous cube agents learn to navigate a 3D planetary environment using deep neural networks and policy gradient methods.

Technical Classification

Machine Learning Paradigm

Reinforcement Learning (RL): Agents learn through trial-and-error interaction with the environment
Deep Reinforcement Learning: Uses neural networks as function approximators
Multi-Agent Learning: 100 agents share a single policy network
Batched Processing: All agents processed simultaneously for efficiency

Training Method

Policy Gradient: Direct optimization of the policy (action selection)
Experience Replay: Decorrelates samples by storing and sampling past experiences
Epsilon-Greedy Exploration: Balances exploitation vs exploration
Supervised Pre-training: Initializes network with expert demonstrations

System Architecture

Neural Network

Type: Feedforward Deep Neural Network (Multi-Layer Perceptron)

Architecture:

Input Layer:    15 neurons (state features)
Hidden Layer 1: 64 neurons (ScaledReLU activation)
Hidden Layer 2: 32 neurons (ScaledReLU activation)
Output Layer:   3 neurons (Tanh activation)

Batch Processing: 100 agents × 15 features = 1500 inputs processed per forward pass

State Representation (15 features per agent)

Each agent observes:

Position (3): x, y, z coordinates in world space
Rotation (3): pitch, yaw, roll Euler angles
Linear Velocity (3): vx, vy, vz movement speed
Angular Velocity (3): ωx, ωy, ωz rotational speed
Target Position (3): target_x, target_y, target_z goal location

Action Space (3 outputs per agent)

Continuous Control: Torque commands for 3-axis rotation, plus automatic tangent-plane thrust.

pitch_torque, yaw_torque, roll_torque — network outputs (Tanh, scaled ×100)
linear_velocity — applied each tick toward the next bubble along the planet tangent plane (construct sim is zero gravity; torque alone only spins cubes in place)

Output Range: [-1, 1] (Tanh activation) for torques
Thrust: up to ~10 m/s toward target when aligned (blended with alignment factor)

State sync

Each tick, a background poll sends query_constructs and updates each agent's CurrentPos / Rotation from the physics server. Without this, the RL loop optimizes against frozen spawn positions and never sees movement.

Training Process

Training uses loom/poly Train() on CPU-MC (TrainingModeCPUMC — multi-core tiled CPU backward), not the old hand-rolled forward/backward loop.

Phase 1: Supervised Pre-training

5000 expert demos with normalized 15-D state (~[-1, 1])
Expert torque targets in [-1, 1] (linear output layer; tanh applied at action time)
Batched MSE via poly.Train — loss should decrease each epoch (see go test -run TestPretrainLossDecreases)

Phase 2: Online fine-tuning (live loop)

Every 4 simulation ticks:

Poll physics via query_constructs
Build normalized states for all agents
Label with expert torque from current forward vs target direction
One TrainOneBatch step (loom CPU-MC)

Epsilon-greedy exploration still randomizes actions; the network learns to imitate the expert on live planet states.

Metrics

Pre-train: JSON lines {"epoch":N,"loss":…} — loss should fall
Live: Train Loss: in tick/episode logs — should trend down while cubes align and move

Technical Terms Explained

Experience Replay Buffer:

Stores past (state, action, reward) tuples
Breaks temporal correlation in training data
Allows repeated use of rare experiences

Policy Gradient:

Directly optimizes the policy function π(a|s)
Adjusts policy in direction of higher rewards
Uses advantage function to amplify good actions

Batched Inference:

Processes all 100 agents in one forward pass
GPU/CPU parallelization (100x speedup vs sequential)
Shared weights across all agents

Epsilon-Greedy:

ε probability of random action (exploration)
(1-ε) probability of network action (exploitation)
Balances learning new strategies vs using known good actions

Swarm Characteristics

Distributed Learning

Single Shared Policy: All 100 agents use the same neural network
Collective Experience: Each agent contributes to shared replay buffer
Emergent Behavior: Network learns generalizable navigation strategy

Data Efficiency

100 agents × 50 ticks/episode = 5000 experiences per episode
Faster learning compared to single-agent RL
Diverse situations encountered simultaneously

Performance Metrics

Training Indicators

Average Reward: Mean reward across all agents per episode
Epsilon: Current exploration rate
Buffer Size: Number of experiences stored
Loss: Training loss (MSE between predicted and target actions)

Checkpointing

Model saved every 10 episodes
Format: swarm_model_checkpoint_XXXX.bin
Includes network weights and optimizer state

Physical Environment

Planet: Spherical body with gravity
Bubbles: 10 target waypoints distributed on surface
Spawn Locations: Agents spawn in rings around bubbles
Task: Navigate from current bubble to next bubble

Physics Integration

Engine: Server-side rigid body physics (zero gravity on construct parts)
Control: Torque for orientation + linear_velocity thrust tangent to the planet toward the next bubble
Constraints: Cubes must be polled via query_constructs for RL state; thrust provides translation

Key Implementation Details

Velocity Calculation

dt = 0.05 // 50ms per tick
velocity = (currentPos - lastPos) / dt

Batched Forward Pass

allStates = [agent0_state(15), agent1_state(15), ..., agent99_state(15)]
allOutputs = network.Forward(allStates) // 1500 in → 300 out

Output Extraction

for i, agent := range agents {
    outputOffset := i * 3
    action = Vector3{
        allOutputs[outputOffset + 0], // pitch
        allOutputs[outputOffset + 1], // yaw  
        allOutputs[outputOffset + 2], // roll
    }
}

This system combines:

Multi-Agent Reinforcement Learning (MARL)
Deep Q-Learning principles (experience replay)
Policy Gradient methods (direct policy optimization)
Behavioral Cloning (pre-training)
Batch Learning (efficient parallel processing)

Similar To:

OpenAI's multi-agent hide-and-seek
DeepMind's StarCraft II learning
Robotic swarm coordination
Distributed ML training

Future Enhancements

Potential improvements:

Attention Mechanisms: Allow agents to observe nearby agents
Hierarchical Policies: High-level strategy + low-level control
Curriculum Learning: Progressively harder navigation tasks
Mixed Objectives: Multiple simultaneous goals
Communication Protocols: Explicit agent-to-agent messages
Adversarial Training: Competitive multi-agent scenarios

Technical Stack

Language: Go
ML: loom/poly CPU-MC (ConfigureNetworkForMode + poly.Train)
Physics: Construct TCP server (zero-G parts + client thrust)
Checkpoints: swarm_model_checkpoint_XXXX.bin

Author: Swarm RL Test Suite
Date: 2026-02-04
Version: 1.0 - Batched Feedforward Architecture

Go source

test10.go — run with go run . 10 from the repo root

package main

import (
	"encoding/csv"
	"encoding/json"
	"fmt"
	"math"
	"math/rand"
	"net"
	"os"
	"time"
)

// Agent represents a swarm agent learning to orient towards targets
type Agent struct {
	ID              string
	TargetPos       Vector3
	CurrentPos      Vector3
	PlanetCenter    Vector3
	PlanetRadius    float32
	Rotation        Vector3
	Velocity        Vector3 // Linear velocity
	AngularVelocity Vector3 // Angular velocity
	LastPos         Vector3 // For velocity calculation
	LastReward      float32
	Color           Vector3
	TotalReward     float32
	StepCount       int
}

func (a *Agent) GetForward() Vector3 {
	return GetForwardFromRotation(a.Rotation)
}

// Helper to calculate forward vector from rotation
func GetForwardFromRotation(rot Vector3) Vector3 {
	radX := float64(rot[0]) * math.Pi / 180.0
	radY := float64(rot[1]) * math.Pi / 180.0
	fx := float32(-math.Sin(radY))
	fy := float32(math.Sin(radX) * math.Cos(radY))
	fz := float32(-math.Cos(radX) * math.Cos(radY))
	return Vector3{fx, fy, fz}
}

func (a *Agent) GetState() []float32 {
	return normalizeRLState(a.CurrentPos, a.Rotation, a.Velocity, a.AngularVelocity, a.TargetPos, a.PlanetCenter, a.PlanetRadius)
}

// normalizeRLState scales features to ~[-1, 1] for stable loom CPU training.
func normalizeRLState(pos, rot, vel, angVel, target, planetCenter Vector3, planetRadius float32) []float32 {
	if planetRadius <= 0 {
		planetRadius = 100
	}
	rel := VecMul(VecSub(pos, planetCenter), 1.0/planetRadius)
	toTarget := VecNorm(VecSub(target, pos))
	return []float32{
		rel[0], rel[1], rel[2],
		rot[0] / 180.0, rot[1] / 180.0, rot[2] / 180.0,
		clampUnit(vel[0] / 8.0), clampUnit(vel[1] / 8.0), clampUnit(vel[2] / 8.0),
		clampUnit(angVel[0]), clampUnit(angVel[1]), clampUnit(angVel[2]),
		toTarget[0], toTarget[1], toTarget[2],
	}
}

func clampUnit(v float32) float32 {
	if v > 1 {
		return 1
	}
	if v < -1 {
		return -1
	}
	return v
}

// expertTorqueAction returns a [-1,1] torque hint pointing forward toward target.
func expertTorqueAction(forward, targetDir Vector3) Vector3 {
	cross := Cross(forward, targetDir)
	strength := float32(math.Sqrt(float64(VecDot(cross, cross))))
	if strength < 1e-4 {
		return Vector3{0, 0, 0}
	}
	axis := VecMul(cross, 1.0/strength)
	out := VecMul(axis, strength)
	maxC := float32(math.Max(math.Max(math.Abs(float64(out[0])), math.Abs(float64(out[1]))), math.Abs(float64(out[2]))))
	if maxC > 1 {
		out = VecMul(out, 1.0/maxC)
	}
	return out
}

// Experience represents a single (s, a, r, s', done) tuple for replay
type Experience struct {
	State     []float32
	Action    []float32 // Changed from Vector3 to support variable action sizes
	Reward    float32
	NextState []float32
	Done      bool
}

// ExperienceBuffer implements a circular buffer for experience replay
type ExperienceBuffer struct {
	Buffer   []Experience
	Capacity int
	Index    int
	Size     int
}

func NewExperienceBuffer(capacity int) *ExperienceBuffer {
	return &ExperienceBuffer{
		Buffer:   make([]Experience, capacity),
		Capacity: capacity,
		Index:    0,
		Size:     0,
	}
}

func (eb *ExperienceBuffer) Add(exp Experience) {
	eb.Buffer[eb.Index] = exp
	eb.Index = (eb.Index + 1) % eb.Capacity
	if eb.Size < eb.Capacity {
		eb.Size++
	}
}

func (eb *ExperienceBuffer) Sample(batchSize int) []Experience {
	if eb.Size < batchSize {
		batchSize = eb.Size
	}
	batch := make([]Experience, batchSize)
	for i := 0; i < batchSize; i++ {
		idx := rand.Intn(eb.Size)
		batch[i] = eb.Buffer[idx]
	}
	return batch
}

func (eb *ExperienceBuffer) IsFull() bool {
	return eb.Size >= eb.Capacity
}

// TrainingMetrics tracks training progress
type TrainingMetrics struct {
	Episode   int
	AvgReward float32
	Loss      float32
	Epsilon   float32
	Timestep  int
}

func saveMetrics(filename string, metrics []TrainingMetrics) {
	file, err := os.Create(filename)
	if err != nil {
		return
	}
	defer file.Close()

	w := csv.NewWriter(file)
	defer w.Flush()

	w.Write([]string{"episode", "avg_reward", "loss", "epsilon", "timestep"})
	for _, m := range metrics {
		w.Write([]string{
			fmt.Sprintf("%d", m.Episode),
			fmt.Sprintf("%.4f", m.AvgReward),
			fmt.Sprintf("%.6f", m.Loss),
			fmt.Sprintf("%.4f", m.Epsilon),
			fmt.Sprintf("%d", m.Timestep),
		})
	}
}

// Calculate shaped reward for better learning signal (alignment + tangent progress).
func calculateReward(agent *Agent) float32 {
	fwd := agent.GetForward()
	targetDir := VecNorm(VecSub(agent.TargetPos, agent.CurrentPos))
	alignment := VecDot(fwd, targetDir)
	moveDir := tangentToward(agent.CurrentPos, agent.PlanetCenter, agent.TargetPos)
	progress := VecDot(agent.Velocity, moveDir)
	reward := alignment + progress*0.3
	if alignment > 0.85 {
		reward += 0.3
	}
	return reward
}

// tangentToward returns unit direction on the planet tangent plane pointing at target.
func tangentToward(pos, planetCenter, target Vector3) Vector3 {
	up := VecNorm(VecSub(pos, planetCenter))
	toTarget := VecSub(target, pos)
	tangent := VecSub(toTarget, VecMul(up, VecDot(toTarget, up)))
	if VecDot(tangent, tangent) < 1e-6 {
		return Vector3{0, 0, 0}
	}
	return VecNorm(tangent)
}

func preTrainNetwork(network *Network) {
	fmt.Println("🛰️  Starting Pre-Training on Expert Demonstrations (loom CPU-MC)...")

	numSamples := 5000
	inputs, targets := generateExpertData(numSamples)

	config := DefaultTrainingConfig()
	config.Epochs = 20
	config.LearningRate = 0.005
	config.BatchSize = 128
	config.Verbose = false
	config.LossType = "mse"

	result, err := network.TrainFromSamples(inputs, targets, config)
	if err != nil {
		fmt.Printf("❌ Pre-training failed: %v\n", err)
		return
	}

	for i, loss := range result.LossHistory {
		verboseEpochJSON(i+1, loss)
	}

	fmt.Printf("✅ Pre-Training Complete - Final Loss: %.6f, Time: %v\n",
		result.FinalLoss, result.TotalTime)
}

func RunTest10() {
	fmt.Println("🧠 Starting Test 10: SWARM REINFORCEMENT LEARNING 🧠")

	// 1. Connect to PrimeCraft server
	conn, err := net.Dial("tcp", "localhost:17000")
	if err != nil {
		fmt.Printf("❌ Failed to connect to server: %v\n", err)
		return
	}
	defer conn.Close()

	// 2. Discover environment (bubbles)
	fmt.Println("📡 Querying world state for bubbles...")
	writePacket(conn, []byte(`{"type":"query_state"}`))

	buf := make([]byte, 32768)
	n, _ := conn.Read(buf)
	var state StateResponse
	json.Unmarshal(buf[:n], &state)

	if len(state.Bubbles) == 0 {
		fmt.Println("⚠️  No bubbles found. Cannot start RL training.")
		return
	}

	numBubbles := len(state.Bubbles)
	if numBubbles > 10 {
		numBubbles = 10
	}
	agentsPerBubble := 10
	numAgents := numBubbles * agentsPerBubble

	fmt.Printf("✅ Found %d bubbles. Initializing %d agents...\n", numBubbles, numAgents)

	planetCenter := Vector3{state.PlanetCenter[0], state.PlanetCenter[1], state.PlanetCenter[2]}

	// 3. Create neural network with BATCHED processing
	inputSize := 15 // [pos(3), rot(3), vel(3), angvel(3), target(3)]
	outputSize := 3 // [pitch_torque, yaw_torque, roll_torque]

	fmt.Println("🏗️  Building Batched Neural Network...")
	fmt.Printf("   - Input: %d features per agent\n", inputSize)
	fmt.Printf("   - Architecture: Dense(%d → 64 → 32 → %d)\n", inputSize, outputSize)
	fmt.Printf("   - Batch size: %d agents processed together\n", numAgents)

	// Simple feedforward: Input(15) → Dense(64) → Dense(32) → Dense(3)
	network := NewNetwork(inputSize, 1, 3, 1)
	network.BatchSize = numAgents // Process all agents in one batch

	layer0 := InitDenseLayer(inputSize, 64, ActivationScaledReLU)
	layer1 := InitDenseLayer(64, 32, ActivationScaledReLU)
	layer2 := InitDenseLayer(32, outputSize, ActivationLinear)

	network.SetLayer(0, 0, 0, layer0)
	network.SetLayer(0, 1, 0, layer1)
	network.SetLayer(0, 2, 0, layer2)

	fmt.Println("✅ Training backend: loom/poly CPU-MC (ConfigureNetworkForMode + Train)")

	// Try to load latest checkpoint FIRST
	checkpointLoaded := false
	latestCheckpoint := ""
	latestEpisode := 0

	// Search for checkpoint files
	files, err := os.ReadDir(".")
	if err == nil {
		for _, file := range files {
			if !file.IsDir() && len(file.Name()) > 24 && file.Name()[:24] == "swarm_model_checkpoint_" {
				// Extract episode number from filename
				var ep int
				if _, err := fmt.Sscanf(file.Name(), "swarm_model_checkpoint_%d.bin", &ep); err == nil {
					if ep > latestEpisode {
						latestEpisode = ep
						latestCheckpoint = file.Name()
					}
				}
			}
		}
	}

	if latestCheckpoint != "" {
		fmt.Printf("📂 Found checkpoint: %s (Episode %d)\n", latestCheckpoint, latestEpisode)
		loadedNetwork, err := LoadModel(latestCheckpoint, fmt.Sprintf("swarm_ep_%d", latestEpisode))
		if err == nil {
			network = loadedNetwork
			network.GPU = false
			checkpointLoaded = true
			fmt.Println("✅ Checkpoint loaded - resuming training!")
		} else {
			fmt.Printf("⚠️  Failed to load checkpoint: %v\n", err)
			fmt.Println("Initializing fresh network...")
		}
	}

	// Only initialize weights if no checkpoint was loaded
	if !checkpointLoaded {
		network.InitializeWeights()
		fmt.Println("✅ Fresh network — CPU-MC pre-training next")
		// Pre-train on expert demonstrations
		preTrainNetwork(network)
	} else {
		fmt.Println("✅ Resuming from checkpoint (CPU-MC)")
	}

	// Metrics log (online training uses live expert labels, not replay buffer).
	metricsHistory := []TrainingMetrics{}

	// 6. Spawn agents
	agents := make([]*Agent, numAgents)
	spawnOffset := float32(3.0) // above bubble on planet surface

	fmt.Printf("🚀 Spawning %d agents across %d bubbles...\n", numAgents, numBubbles)

	agentIdx := 0
	for i := 0; i < numBubbles; i++ {
		b := state.Bubbles[i]
		bPos := Vector3{b.Pos[0], b.Pos[1], b.Pos[2]}
		up := VecNorm(VecSub(bPos, planetCenter))

		nextIdx := (i + 1) % numBubbles
		targetBubble := state.Bubbles[nextIdx]
		targetPos := Vector3{targetBubble.Pos[0], targetBubble.Pos[1], targetBubble.Pos[2]}

		for j := 0; j < agentsPerBubble; j++ {
			theta := (float64(j) / float64(agentsPerBubble)) * 2.0 * math.Pi
			ringDist := float32(12.0)
			right, _, forward := MakeBasis(up)
			localOffset := Vector3{
				float32(math.Cos(theta)) * ringDist,
				0,
				float32(math.Sin(theta)) * ringDist,
			}
			worldOffset := TransformPoint(Vector3{0, 0, 0}, right, up, forward, localOffset)

			spawnPos := Vector3{
				bPos[0] + up[0]*spawnOffset + worldOffset[0],
				bPos[1] + up[1]*spawnOffset + worldOffset[1],
				bPos[2] + up[2]*spawnOffset + worldOffset[2],
			}

			initialRot := GetEuler(GetBasis(up))
			id := fmt.Sprintf("rl_cube_%d", agentIdx)

			agents[agentIdx] = &Agent{
				ID:              id,
				CurrentPos:      spawnPos,
				LastPos:         spawnPos,
				PlanetCenter:    planetCenter,
				PlanetRadius:    state.PlanetRadius,
				Rotation:        Vector3{rand.Float32() * 360, rand.Float32() * 360, 0},
				TargetPos:       targetPos,
				Velocity:        Vector3{0, 0, 0}, // Start at rest
				AngularVelocity: Vector3{0, 0, 0}, // Start at rest
				Color:           Vector3{0.2, 0.5, 1.0},
			}

			createReq := ConstructRequest{
				Type:        "create_construct",
				ConstructID: id,
				Parts: []Part{
					{
						ID:     "core",
						Type:   "box",
						Size:   Vector3{2, 2, 2},
						Pos:    spawnPos,
						Rot:    initialRot,
						Color:  agents[agentIdx].Color,
						Locked: false,
						Groups: []string{"lasso_target", "rl_agent"},
					},
				},
			}

			data, _ := json.Marshal(createReq)
			writePacket(conn, data)
			agentIdx++
			time.Sleep(10 * time.Millisecond)
		}
	}

	// 7. Training loop
	fmt.Println("🚀 Starting Swarm Training Loop...")

	// Poll physics state — test10 previously never read positions back, so RL saw frozen agents.
	go func() {
		pollTicker := time.NewTicker(50 * time.Millisecond)
		defer pollTicker.Stop()
		pollBuf := make([]byte, 65536)
		for range pollTicker.C {
			writePacket(conn, []byte(`{"type":"query_constructs"}`))
			_ = conn.SetReadDeadline(time.Now().Add(30 * time.Millisecond))
			n, err := conn.Read(pollBuf)
			_ = conn.SetReadDeadline(time.Time{})
			if err != nil {
				continue
			}
			var response struct {
				Type       string `json:"type"`
				Constructs []struct {
					ID    string `json:"id"`
					Parts []struct {
						ID  string  `json:"id"`
						Pos Vector3 `json:"pos"`
						Rot Vector3 `json:"rot"`
					} `json:"parts"`
				} `json:"constructs"`
			}
			if json.Unmarshal(pollBuf[:n], &response) != nil {
				continue
			}
			for _, construct := range response.Constructs {
				for _, a := range agents {
					if a == nil || a.ID != construct.ID {
						continue
					}
					for _, part := range construct.Parts {
						if part.ID == "core" {
							a.CurrentPos = part.Pos
							a.Rotation = part.Rot
							break
						}
					}
					break
				}
			}
		}
	}()

	ticker := time.NewTicker(50 * time.Millisecond)
	defer ticker.Stop()

	// Online fine-tune via loom CPU-MC (expert torque labels from live physics state).
	learningRate := float32(0.002)
	epsilon := float32(0.3)
	epsilonDecay := float32(0.995)
	epsilonMin := float32(0.05)
	torqueScale := float32(80.0)
	thrustScale := float32(10.0)

	tickCount := 0
	episode := 0
	trainSteps := 0
	lastLoss := float64(0)
	episodeReward := float32(0)

	for range ticker.C {
		tickCount++

		// Collect experiences from all agents
		updates := []UpdateRequest{}
		tickReward := float32(0)

		// Collect all agent states into single input (1500 features)
		allStates := make([]float32, 0, numAgents*inputSize)
		for _, a := range agents {
			// Calculate velocity
			dt := float32(0.05) // 50ms per tick
			a.Velocity = VecMul(VecSub(a.CurrentPos, a.LastPos), 1.0/dt)
			a.LastPos = a.CurrentPos

			state := a.GetState()
			allStates = append(allStates, state...)
		}

		// Single forward pass for entire swarm (1500 in → 300 out)
		allOutputs, _ := network.Forward(allStates)

		// Process each agent's action
		for i, a := range agents {
			// Epsilon-greedy action selection
			var action Vector3
			if rand.Float32() < epsilon {
				// Random exploration
				action = Vector3{
					rand.Float32()*2 - 1,
					rand.Float32()*2 - 1,
					rand.Float32()*2 - 1,
				}
			} else {
				outputOffset := i * outputSize
				action = Vector3{
					float32(math.Tanh(float64(allOutputs[outputOffset+0]))),
					float32(math.Tanh(float64(allOutputs[outputOffset+1]))),
					float32(math.Tanh(float64(allOutputs[outputOffset+2]))),
				}
			}

			// Apply action: torque for orientation + tangent thrust for movement.
			// Construct parts use zero gravity — spinning in place is all torque gives you.
			torque := Mul(action, torqueScale)
			moveDir := tangentToward(a.CurrentPos, a.PlanetCenter, a.TargetPos)
			align := max(0, VecDot(a.GetForward(), VecNorm(VecSub(a.TargetPos, a.CurrentPos))))
			thrust := VecMul(moveDir, thrustScale*(0.25+0.75*align))
			updates = append(updates, UpdateRequest{
				Type:        "update_construct",
				ConstructID: a.ID,
				Updates: []PartUpdate{
					{PartID: "core", Torque: &torque, LinearVelocity: &thrust},
				},
			})

			reward := calculateReward(a)
			tickReward += reward
			a.TotalReward += reward
			a.StepCount++
		}

		episodeReward += tickReward

		// Send all updates to server
		for _, u := range updates {
			d, _ := json.Marshal(u)
			writePacket(conn, d)
		}

		// Online training: supervised expert torques on current states (loom TrainOneBatch).
		if tickCount%4 == 0 {
			expertTargets := make([]float32, 0, numAgents*outputSize)
			for _, a := range agents {
				fwd := a.GetForward()
				targetDir := VecNorm(VecSub(a.TargetPos, a.CurrentPos))
				t := expertTorqueAction(fwd, targetDir)
				expertTargets = append(expertTargets, t[0], t[1], t[2])
			}
			loss, err := network.TrainOneBatch(allStates, expertTargets, learningRate)
			if err == nil {
				lastLoss = loss
				trainSteps++
			}
			if epsilon*epsilonDecay > epsilonMin {
				epsilon = epsilon * epsilonDecay
			}
		}

		// Episode metrics every 50 steps (~2.5s)
		if tickCount%50 == 0 {
			episode++
			avgReward := episodeReward / float32(numAgents)

			metrics := TrainingMetrics{
				Episode:   episode,
				AvgReward: avgReward,
				Loss:      float32(lastLoss),
				Epsilon:   epsilon,
				Timestep:  tickCount,
			}
			metricsHistory = append(metricsHistory, metrics)
			saveMetrics("swarm_training_log.csv", metricsHistory)

			fmt.Printf("📊 Episode %d - Avg Reward: %.3f - Train Loss: %.6f - Epsilon: %.3f - Steps: %d\n",
				episode, avgReward, lastLoss, epsilon, tickCount)
			episodeReward = 0
		} else if tickCount%25 == 0 {
			avgReward := episodeReward / float32(numAgents)
			fmt.Printf("🔄 Tick %d - Episode Reward: %.3f - Train Loss: %.6f - Epsilon: %.3f\n",
				tickCount, avgReward, lastLoss, epsilon)
		}

		// Save checkpoint every 500 steps
		if tickCount%500 == 0 {
			filename := fmt.Sprintf("swarm_model_checkpoint_%04d.bin", episode)
			modelID := fmt.Sprintf("swarm_ep_%d", episode)
			if err := network.SaveModel(filename, modelID); err == nil {
				fmt.Printf("💾 Saved checkpoint: %s\n", filename)
			} else {
				fmt.Printf("⚠️  Failed to save checkpoint: %v\n", err)
			}
		}
	}
}

func max(a, b float32) float32 {
	if a > b {
		return a
	}
	return b
}