Test 11: Walking Skeleton Reinforcement Learning

Overview

Test 11 implements a distributed multi-agent reinforcement learning system where 20 humanoid skeleton creatures learn to walk upright using deep neural networks and policy gradient methods. This builds on the swarm RL concepts from Test 10but applies them to bipedal locomotion.

Technical Classification

Machine Learning Paradigm

Reinforcement Learning (RL): Agents learn locomotion through trial-and-error
Deep Reinforcement Learning: Neural network policy for joint control
Multi-Agent Learning: 20 skeletons share a single policy network
Continuous Control: 8-dimensional continuous action space (joint torques)
Batched Processing: All skeletons processed simultaneously

Training Method

Policy Gradient: Direct optimization of walking policy
Experience Replay: Decorrelates temporal dependencies in walking data
Epsilon-Greedy Exploration: Balances random movements vs learned gaits
Supervised Pre-training: Initializes with cyclic walking patterns (CPG-like)

System Architecture

Neural Network

Type: Feedforward Deep Neural Network (Multi-Layer Perceptron)

Architecture:

Input Layer:    20 neurons (state features)
Hidden Layer 1: 128 neurons (ScaledReLU activation)
Hidden Layer 2: 64 neurons (ScaledReLU activation)
Output Layer:   8 neurons (Tanh activation - joint torques)

Batch Processing: 20 skeletons × 20 features = 400 inputs processed per forward pass

State Representation (20 features per skeleton)

Each skeleton observes:

Torso Position (3): x, y, z coordinates in world space
Torso Rotation (3): pitch, yaw, roll Euler angles
Linear Velocity (3): vx, vy, vz torso movement speed
Angular Velocity (3): ωx, ωy, ωz torso rotational speed
Joint Angles (4): Left leg, right leg, left arm, right arm angles
Forward Direction (2): 2D heading vector (cos/sin of yaw)
Height (1): Vertical distance from spawn point
Distance Traveled (1): Total forward progress

Action Space (8 outputs per skeleton)

Continuous Joint Control: Torque commands for 8 degrees of freedom

Output assignments:

left_hip_forward: Forward/backward leg swing
left_hip_side: Hip abduction/adduction
left_knee: Knee flexion/extension
right_hip_forward: Forward/backward leg swing
right_hip_side: Hip abduction/adduction
right_knee: Knee flexion/extension
left_arm: Arm swing (balance)
right_arm: Arm swing (balance)

Output Range: [-1, 1] (Tanh activation)
Physical Scaling: Multiplied by 30 for actual torque application

Skeleton Morphology

Body Structure (Articulated Rigid Body)

Parts (10 capsules):

Torso: Main body (0.5×1.2m)
Head: Connected via pin joint
Upper Arms (2): Shoulder to elbow
Forearms (2): Elbow to hand
Thighs (2): Hip to knee
Shins (2): Knee to foot

Joints (8 pin constraints):

Neck (torso-head)
Shoulders (2 × torso-upper arm)
Elbows (2 × upper arm-forearm)
Hips (2 × torso-thigh)
Knees (2 × thigh-shin)

Physics Properties

Material: Rigid capsules with collision
Constraints: Pin joints allow rotation but constrain position
Gravity: Pulls skeleton downward
Friction: Ground contact for foot traction

Training Process

Phase 1: Supervised Pre-training

Purpose: Initialize network with basic cyclic walking pattern

Method: Central Pattern Generator (CPG) simulation

Generate 2000 synthetic walking cycles
Simple sinusoidal leg patterns: left = sin(phase), right = sin(phase + π)
Arm swing opposite to legs for balance
Train for 15 epochs using MSE loss

Result: Network learns basic rhythmic coordination

Phase 2: Reinforcement Learning

Algorithm: Policy Gradient with Experience Replay

Training Loop (100 ticks per episode, 20ms per tick):

State Collection:
- Observe torso position, rotation, velocities
- Compute joint angles (if trackable)
- Calculate height and distance metrics
Action Selection: Epsilon-greedy (ε=0.5 → 0.1)
- Exploration: Random joint torques
- Exploitation: Network policy output
Physics Simulation: Apply torques via pin joints

Reward Calculation:

reward = forward_velocity_reward + height_reward + stability_bonus + distance_bonus

forward_velocity_reward = velocity_z × 5.0
height_reward = upright_bonus if |height - target| < 0.5 else -fall_penalty
stability_bonus = 0.5 if angular_velocity < 1.0
distance_bonus = total_distance × 0.1

Experience Storage: Store (state, action, reward) in replay buffer (capacity: 20,000)
Network Update (every 2 ticks if buffer ≥ 64 samples):
- Sample minibatch of 64 experiences
- Forward pass → predict joint torques
- Compute loss: MSE between predicted and advantage-adjusted actions
- Target = action + advantage × 0.05
- Backward pass → update weights
- Learning rate: 0.005
Epsilon Decay: ε = ε × 0.995 (until reaching 0.1 minimum)

Reward Function Breakdown

1. Forward Movement (Primary Objective)

forwardVel := velocity[2] // Z-axis
reward += forwardVel × 5.0

Encourages forward locomotion (main task).

2. Upright Posture

heightDiff := abs(torsoHeight - targetHeight)
if heightDiff < 0.5 {
    heightReward = 1.0 - heightDiff  // Bonus for staying upright
} else {
    heightReward = -heightDiff // Penalty for falling
}
reward += heightReward × 2.0

Prevents falling down (bipedal stability).

3. Stability Bonus

angularMag := sqrt(ωx² + ωy² + ωz²)
if angularMag < 1.0 {
    reward += 0.5
}

Rewards smooth, stable movement over erratic flailing.

4. Distance Bonus

reward += total_distance_traveled × 0.1

Long-term progress incentive.

Locomotion Challenges

Bipedal Walking is Hard!

Walking requires:

Balance: Maintaining upright posture against gravity
Coordination: Synchronizing leg and arm movements
Stability: Preventing falls during weight transfer
Efficiency: Minimizing energy (torque) expenditure
Rhythm: Discovering periodic gait patterns

Expected Learning Curve

Episodes 0-20: Random flailing, frequent falls
Episodes 20-50: Discovering balance, some forward drift
Episodes 50-100: Emerging gait patterns, consistent forward movement
Episodes 100+: Refined walking, possibly running gaits

Key Implementation Details

Velocity Calculation

dt = 0.02 // 20ms per tick
velocity = (currentTorsoPos - lastTorsoPos) / dt

Batched Forward Pass

allStates = [skeleton0_state(20), skeleton1_state(20), ..., skeleton19_state(20)]
allOutputs = network.Forward(allStates) // 400 in → 160 out

Joint Torque Application

for i, skeleton := range skeletons {
    outputOffset := i * 8
    leftHipTorque = Vector3{
        outputs[outputOffset + 0] * 30,  // Forward/back
        0,
        outputs[outputOffset + 1] * 30,  // Side
    }
    // Apply to l_thigh part via UpdateRequest
}

Differences from Test 10 (Swarm RL)

Aspect	Test 10 (Cubes)	Test 11 (Skeletons)
Agents	100 simple cubes	20 articulated skeletons
State Size	15 features	20 features
Action Size	3 torques	8 torques
Complexity	Rotation control only	Full locomotion
Task	Navigate to target	Walk forward upright
Reward	Alignment + exploration	Forward + balance
Episode	50 ticks (1s)	100 ticks (2s)
Difficulty	Moderate	High

This system combines:

Bipedal Locomotion Control
Multi-Agent RL (MARL)
Continuous Action Spaces
Physics-Based Animation
Central Pattern Generators (CPG)
Policy Gradient Methods

Similar To:

DeepMind's MuJoCo humanoid walker
OpenAI Gym Walker2D/BipedalWalker
Evolution Strategies for walking gaits
Reinforcement Learning for prosthetic control
Boston Dynamics robot learning

Future Enhancements

Potential improvements:

Curriculum Learning: Start with crawling, progress to walking/running
Terrain Adaptation: Hills, obstacles, varying surfaces
Multi-Objective: Walking + carrying objects
Inverse Kinematics: Target foot placement
Imitation Learning: Learn from motion capture data
Adversarial Training: Push-recovery, navigation under

disturbances 7. Hierarchical Control: High-level gait selection + low-level joint control 8. Evolution Strategies: Explore morphology variations

Technical Stack

Language: Go
ML Framework: Custom loom/nn neural network library
Physics: Server-side articulated rigid body dynamics via WebSocket
Training: CPU-based backpropagation
Constraints: Pin joints for skeletal structure
Serialization: Binary checkpoint format

Performance Metrics

Training Indicators

Average Reward: Mean reward across 20 skeletons per episode
Epsilon: Current exploration rate
Buffer Size: Experiences stored (max 20,000)
Forward Distance: How far skeleton traveled

Success Criteria

Walking: Consistent forward velocity > 0.5 m/s
Stability: Torso height maintained within 20% of target
Episodes Without Falls: Count before tumbling

Checkpointing

Model saved every 10 episodes
Format: walking_skeleton_checkpoint_XXXX.bin
Includes network weights and optimizer state

Author: Walking Skeleton RL System
Date: 2026-02-04
Version: 1.0 - Distributed Multi-Agent Bipedal Locomotion
Based On: Test 10 Swarm RL Architecture

Go source

test11.go — run with go run . 11 from the repo root

package main

import (
	"encoding/json"
	"fmt"
	"math"
	"math/rand"
	"net"
	"time"
)

// --- Test 11: Walking Skeleton RL ---

// SkeletonAgent represents one learning skeleton creature
type SkeletonAgent struct {
	ID              string
	SpawnPos        Vector3
	TorsoPos        Vector3
	LastTorsoPos    Vector3
	Velocity        Vector3
	TorsoRotation   Vector3
	AngularVelocity Vector3
	TargetPos       Vector3 // Target bubble position to navigate to

	// Joint states (for observation)
	LeftLegAngle  float32
	RightLegAngle float32
	LeftArmAngle  float32
	RightArmAngle float32

	// Learning metrics
	TotalReward      float32
	StepCount        int
	DistanceTraveled float32
}

func (s *SkeletonAgent) GetState() []float32 {
	// Rich 23-feature state representation
	state := []float32{
		// Torso state (9)
		s.TorsoPos[0], s.TorsoPos[1], s.TorsoPos[2],
		s.TorsoRotation[0], s.TorsoRotation[1], s.TorsoRotation[2],
		s.Velocity[0], s.Velocity[1], s.Velocity[2],

		// Angular state (3)
		s.AngularVelocity[0], s.AngularVelocity[1], s.AngularVelocity[2],

		// Joint angles (4)
		s.LeftLegAngle, s.RightLegAngle,
		s.LeftArmAngle, s.RightArmAngle,

		// Forward direction to goal (2D heading)
		float32(math.Cos(float64(s.TorsoRotation[1]) * math.Pi / 180.0)),
		float32(math.Sin(float64(s.TorsoRotation[1]) * math.Pi / 180.0)),

		// Height above ground
		s.TorsoPos[1],

		// Distance traveled so far		float32(s.DistanceTraveled),
		float32(s.DistanceTraveled),
	}

	// Add target position relative to current position (3 features)
	relativeTarget := VecSub(s.TargetPos, s.TorsoPos)
	state = append(state, relativeTarget[0], relativeTarget[1], relativeTarget[2])

	return state
}

func RunTest11() {
	fmt.Println("🦴 Starting Test 11: WALKING SKELETON RL 🦴")

	conn, err := net.Dial("tcp", "localhost:17000")
	if err != nil {
		fmt.Printf("❌ Failed to connect to Construct Server: %v\n", err)
		return
	}
	defer conn.Close()

	// Query planet state for bubbles
	fmt.Println("📡 Querying world state for bubbles...")
	writePacket(conn, []byte(`{"type":"query_state"}`))

	buf := make([]byte, 32768)
	n, _ := conn.Read(buf)
	var state StateResponse
	json.Unmarshal(buf[:n], &state)

	if len(state.Bubbles) == 0 {
		fmt.Println("❌ No bubbles found in world state")
		return
	}

	numBubbles := len(state.Bubbles)
	skeletonsPerBubble := 3
	numSkeletons := numBubbles * skeletonsPerBubble

	fmt.Printf("✅ Found %d bubbles. Initializing %d skeletons...\n", numBubbles, numSkeletons)

	// Configuration
	inputSize := 23 // Extended state: position, rotation, velocities, target, bubble info
	outputSize := 8 // 8 joint torques

	planetCenter := Vector3{state.PlanetCenter[0], state.PlanetCenter[1], state.PlanetCenter[2]}

	// Create neural network
	fmt.Println("🏗️  Building Walking Neural Network...")
	fmt.Printf("   - Input: %d features per skeleton\n", inputSize)
	fmt.Printf("   - Architecture: Dense(%d → 128 → 64 → %d)\n", inputSize, outputSize)
	fmt.Printf("   - Batch size: %d skeletons processed together\n", numSkeletons)

	network := NewNetwork(inputSize, 1, 3, 1)
	network.BatchSize = numSkeletons

	layer0 := InitDenseLayer(inputSize, 128, ActivationScaledReLU)
	layer1 := InitDenseLayer(128, 64, ActivationScaledReLU)
	layer2 := InitDenseLayer(64, outputSize, ActivationTanh)

	network.SetLayer(0, 0, 0, layer0)
	network.SetLayer(0, 1, 0, layer1)
	network.SetLayer(0, 2, 0, layer2)
	network.GPU = false

	fmt.Println("✅ Network initialized")

	// Pre-training
	preTrainWalkingNetwork(network, inputSize, outputSize, numSkeletons)

	// Spawn skeletons around bubbles (like test10)
	skeletons := make([]*SkeletonAgent, numSkeletons)
	spawnOffset := float32(3.0) // Offset above bubble surface

	fmt.Printf("🚀 Spawning %d skeletons across %d bubbles...\n", numSkeletons, numBubbles)

	skeletonIdx := 0
	for i := 0; i < numBubbles; i++ {
		b := state.Bubbles[i]
		bPos := Vector3{b.Pos[0], b.Pos[1], b.Pos[2]}
		up := VecNorm(VecSub(bPos, planetCenter))

		// Target: next bubble in sequence
		nextIdx := (i + 1) % numBubbles
		targetBubble := state.Bubbles[nextIdx]
		targetPos := Vector3{targetBubble.Pos[0], targetBubble.Pos[1], targetBubble.Pos[2]}

		// Spawn skeletons in ring around bubble
		for j := 0; j < skeletonsPerBubble; j++ {
			theta := (float64(j) / float64(skeletonsPerBubble)) * 2.0 * math.Pi
			ringDist := float32(8.0)
			right, _, forward := MakeBasis(up)

			localOffset := Vector3{
				float32(math.Cos(theta)) * ringDist,
				0,
				float32(math.Sin(theta)) * ringDist,
			}
			worldOffset := TransformPoint(Vector3{0, 0, 0}, right, up, forward, localOffset)

			spawnPos := Vector3{
				bPos[0] + up[0]*spawnOffset + worldOffset[0],
				bPos[1] + up[1]*spawnOffset + worldOffset[1],
				bPos[2] + up[2]*spawnOffset + worldOffset[2],
			}

			id := fmt.Sprintf("skeleton_%d", skeletonIdx)

			skeletons[skeletonIdx] = &SkeletonAgent{
				ID:           id,
				SpawnPos:     spawnPos,
				TorsoPos:     Vector3{spawnPos[0], spawnPos[1] + 1.2, spawnPos[2]},
				LastTorsoPos: Vector3{spawnPos[0], spawnPos[1] + 1.2, spawnPos[2]},
				TargetPos:    targetPos, // Navigate to next bubble!
			}

			// Spawn skeleton
			createSkeleton(conn, id, spawnPos)
			skeletonIdx++
		}
	}

	fmt.Println("🚀 Starting Swarm Walking Training Loop...")

	// Start background goroutine to poll skeleton states from server
	go func() {
		pollTicker := time.NewTicker(50 * time.Millisecond) // Poll every 50ms
		defer pollTicker.Stop()

		pollBuf := make([]byte, 65536) // Larger buffer for all construct data

		for range pollTicker.C {
			// Request full state with all constructs
			writePacket(conn, []byte(`{"type":"query_constructs"}`))

			// Read response (non-blocking with timeout would be better but keep it simple)
			conn.SetReadDeadline(time.Now().Add(30 * time.Millisecond))
			n, err := conn.Read(pollBuf)
			if err != nil {
				continue // Skip this poll if timeout/error
			}
			conn.SetReadDeadline(time.Time{}) // Clear deadline

			// Try to parse as a generic response with parts
			var response struct {
				Type       string `json:"type"`
				Constructs []struct {
					ID    string `json:"id"`
					Parts []struct {
						ID  string  `json:"id"`
						Pos Vector3 `json:"pos"`
						Rot Vector3 `json:"rot"`
					} `json:"parts"`
				} `json:"constructs"`
			}

			if json.Unmarshal(pollBuf[:n], &response) == nil && len(response.Constructs) > 0 {
				// Update skeleton positions
				for _, construct := range response.Constructs {
					// Find matching skeleton
					for _, s := range skeletons {
						if s.ID == construct.ID {
							// Find torso part
							for _, part := range construct.Parts {
								if part.ID == "torso" {
									s.TorsoPos = part.Pos
									s.TorsoRotation = part.Rot

									// Update distance traveled
									dx := s.TorsoPos[0] - s.SpawnPos[0]
									dy := s.TorsoPos[1] - s.SpawnPos[1]
									dz := s.TorsoPos[2] - s.SpawnPos[2]
									dist := float32(math.Sqrt(float64(dx*dx + dy*dy + dz*dz)))
									if dist > s.DistanceTraveled {
										s.DistanceTraveled = dist
									}
									break
								}
							}
							break
						}
					}
				}
			}
		}
	}()

	// RL Training parameters
	learningRate := float32(0.005)
	epsilon := float32(0.5) // Start with high exploration
	epsilonMin := float32(0.1)
	epsilonDecay := float32(0.995)

	// Experience replay
	expBuffer := NewExperienceBuffer(20000)

	// Training loop
	ticker := time.NewTicker(20 * time.Millisecond)
	defer ticker.Stop()

	tickCount := 0
	episode := 0
	trainSteps := 0

	for range ticker.C {
		tickCount++

		// Collect all skeleton states
		allStates := make([]float32, 0, numSkeletons*inputSize)
		for _, s := range skeletons {
			// Calculate velocity
			dt := float32(0.02) // 20ms
			s.Velocity = VecMul(VecSub(s.TorsoPos, s.LastTorsoPos), 1.0/dt)
			s.LastTorsoPos = s.TorsoPos

			state := s.GetState()
			allStates = append(allStates, state...)
		}

		// Batched forward pass
		allOutputs, _ := network.Forward(allStates)

		// Apply actions to each skeleton
		updates := []UpdateRequest{}
		totalReward := float32(0)

		for i, s := range skeletons {
			// Extract this skeleton's action (8 torques)
			outputOffset := i * outputSize
			action := make([]float32, outputSize)

			// Epsilon-greedy
			if rand.Float32() < epsilon {
				// Random exploration
				for j := 0; j < outputSize; j++ {
					action[j] = rand.Float32()*2 - 1
				}
			} else {
				// Network policy
				for j := 0; j < outputSize; j++ {
					action[j] = allOutputs[outputOffset+j]
				}
			}

			// Scale torques
			torqueScale := float32(30.0)

			// Map actions to joint torques
			leftHipTorque := Vector3{action[0] * torqueScale, 0, action[1] * torqueScale}
			leftKneeTorque := Vector3{action[2] * torqueScale, 0, 0}
			rightHipTorque := Vector3{action[3] * torqueScale, 0, action[4] * torqueScale}
			rightKneeTorque := Vector3{action[5] * torqueScale, 0, 0}
			leftArmTorque := Vector3{0, 0, action[6] * torqueScale}
			rightArmTorque := Vector3{0, 0, action[7] * torqueScale}

			updates = append(updates, UpdateRequest{
				Type:        "update_construct",
				ConstructID: s.ID,
				Updates: []PartUpdate{
					{PartID: "l_thigh", Torque: &leftHipTorque},
					{PartID: "l_shin", Torque: &leftKneeTorque},
					{PartID: "r_thigh", Torque: &rightHipTorque},
					{PartID: "r_shin", Torque: &rightKneeTorque},
					{PartID: "l_fore", Torque: &leftArmTorque},
					{PartID: "r_fore", Torque: &rightArmTorque},
				},
			})

			// Calculate reward
			reward := calculateWalkingReward(s)
			totalReward += reward

			// Store experience
			stateOffset := i * inputSize
			exp := Experience{
				State:  allStates[stateOffset : stateOffset+inputSize],
				Action: action,
				Reward: reward,
			}
			expBuffer.Add(exp)

			s.TotalReward += reward
			s.StepCount++
		}

		// Send all updates
		for _, u := range updates {
			d, _ := json.Marshal(u)
			writePacket(conn, d)
		}

		// Note: We can't easily read back positions from server without complex state tracking
		// Instead, we rely on velocity calculations and assume physics is working
		// The skeletons will move based on applied torques

		// Training step (every 2 ticks)
		if tickCount%2 == 0 && expBuffer.Size >= 64 {
			batchSize := 64
			batch := expBuffer.Sample(batchSize)

			trainBatchStates := make([]float32, 0, batchSize*inputSize)
			trainBatchTargets := make([]float32, 0, batchSize*outputSize)

			for _, exp := range batch {
				trainBatchStates = append(trainBatchStates, exp.State...)

				// Policy gradient target
				advantage := exp.Reward
				for j := 0; j < outputSize; j++ {
					trainBatchTargets = append(trainBatchTargets, exp.Action[j]+advantage*0.05)
				}
			}

			// Train
			oldBatchSize := network.BatchSize
			network.BatchSize = batchSize

			output, _ := network.Forward(trainBatchStates)

			grad := make([]float32, len(output))
			totalLoss := float32(0)
			for j := 0; j < len(output); j++ {
				err := output[j] - trainBatchTargets[j]
				totalLoss += err * err
				grad[j] = err
			}

			network.Backward(grad)
			network.ApplyGradients(learningRate)

			network.BatchSize = oldBatchSize
			trainSteps++

			// Decay epsilon
			if epsilon*epsilonDecay > epsilonMin {
				epsilon = epsilon * epsilonDecay
			}
		}

		// Episode reset every 100 ticks (2 seconds)
		if tickCount%100 == 0 {
			episode++
			avgReward := totalReward / float32(numSkeletons)

			// Status update
			fmt.Printf("📊 Episode %d - Avg Reward: %.3f - Epsilon: %.3f - Buffer: %d\\n",
				episode, avgReward, epsilon, expBuffer.Size)

			// Save checkpoint every 10 episodes
			if episode%10 == 0 {
				filename := fmt.Sprintf("walking_skeleton_checkpoint_%04d.bin", episode)
				modelID := fmt.Sprintf("walk_ep_%d", episode)
				if err := network.SaveModel(filename, modelID); err == nil {
					fmt.Printf("💾 Saved checkpoint: %s\\n", filename)
				}
			}
		}
	}
}

func createSkeleton(conn net.Conn, id string, basePos Vector3) {
	createReq := ConstructRequest{
		Type:        "create_construct",
		ConstructID: id,
		Parts: []Part{
			// Torso
			{ID: "torso", Type: "capsule", Size: Vector3{0.5, 1.2, 0}, Pos: Vector3{basePos[0], basePos[1] + 1.2, basePos[2]}, Color: Vector3{0.9, 0.7, 0.5}},

			// Head
			{ID: "head", Type: "capsule", Size: Vector3{0.45, 0.9, 0}, Pos: Vector3{basePos[0], basePos[1] + 2.3, basePos[2]}, Color: Vector3{0.98, 0.92, 0.84}},

			// Arms
			{ID: "l_upper", Type: "capsule", Size: Vector3{0.22, 0.8, 0}, Pos: Vector3{basePos[0] - 0.9, basePos[1] + 1.7, basePos[2]}, Color: Vector3{0.44, 0.5, 0.56}, IsHorizontal: true},
			{ID: "l_fore", Type: "capsule", Size: Vector3{0.18, 0.7, 0}, Pos: Vector3{basePos[0] - 1.8, basePos[1] + 1.7, basePos[2]}, Color: Vector3{0.3, 0.3, 0.3}, IsHorizontal: true},
			{ID: "r_upper", Type: "capsule", Size: Vector3{0.22, 0.8, 0}, Pos: Vector3{basePos[0] + 0.9, basePos[1] + 1.7, basePos[2]}, Color: Vector3{0.44, 0.5, 0.56}, IsHorizontal: true},
			{ID: "r_fore", Type: "capsule", Size: Vector3{0.18, 0.7, 0}, Pos: Vector3{basePos[0] + 1.8, basePos[1] + 1.7, basePos[2]}, Color: Vector3{0.3, 0.3, 0.3}, IsHorizontal: true},

			// Legs
			{ID: "l_thigh", Type: "capsule", Size: Vector3{0.28, 0.9, 0}, Pos: Vector3{basePos[0] - 0.45, basePos[1] + 0.6, basePos[2]}, Color: Vector3{0.44, 0.5, 0.56}},
			{ID: "l_shin", Type: "capsule", Size: Vector3{0.22, 0.9, 0}, Pos: Vector3{basePos[0] - 0.45, basePos[1] - 0.3, basePos[2]}, Color: Vector3{0.3, 0.3, 0.3}},
			{ID: "r_thigh", Type: "capsule", Size: Vector3{0.28, 0.9, 0}, Pos: Vector3{basePos[0] + 0.45, basePos[1] + 0.6, basePos[2]}, Color: Vector3{0.44, 0.5, 0.56}},
			{ID: "r_shin", Type: "capsule", Size: Vector3{0.22, 0.9, 0}, Pos: Vector3{basePos[0] + 0.45, basePos[1] - 0.3, basePos[2]}, Color: Vector3{0.3, 0.3, 0.3}},
		},
		Joints: []Joint{
			{Type: "pin", A: "torso", B: "head", Pos: Vector3{basePos[0], basePos[1] + 2.0, basePos[2]}},
			{Type: "pin", A: "torso", B: "l_upper", Pos: Vector3{basePos[0] - 0.55, basePos[1] + 1.7, basePos[2]}},
			{Type: "pin", A: "l_upper", B: "l_fore", Pos: Vector3{basePos[0] - 1.3, basePos[1] + 1.7, basePos[2]}},
			{Type: "pin", A: "torso", B: "r_upper", Pos: Vector3{basePos[0] + 0.55, basePos[1] + 1.7, basePos[2]}},
			{Type: "pin", A: "r_upper", B: "r_fore", Pos: Vector3{basePos[0] + 1.3, basePos[1] + 1.7, basePos[2]}},
			{Type: "pin", A: "torso", B: "l_thigh", Pos: Vector3{basePos[0] - 0.45, basePos[1] + 1.1, basePos[2]}},
			{Type: "pin", A: "l_thigh", B: "l_shin", Pos: Vector3{basePos[0] - 0.45, basePos[1] + 0.1, basePos[2]}},
			{Type: "pin", A: "torso", B: "r_thigh", Pos: Vector3{basePos[0] + 0.45, basePos[1] + 1.1, basePos[2]}},
			{Type: "pin", A: "r_thigh", B: "r_shin", Pos: Vector3{basePos[0] + 0.45, basePos[1] + 0.1, basePos[2]}},
		},
	}

	data, _ := json.Marshal(createReq)
	writePacket(conn, data)
}

func calculateWalkingReward(s *SkeletonAgent) float32 {
	reward := float32(0)

	// 1. Forward movement reward (main objective)
	forwardVel := s.Velocity[2] // Z-axis is forward
	reward += forwardVel * 5.0

	// 2. Upright posture reward
	heightReward := float32(0)
	targetHeight := s.SpawnPos[1] + 1.2
	heightDiff := float32(math.Abs(float64(s.TorsoPos[1] - targetHeight)))
	if heightDiff < 0.5 {
		heightReward = 1.0 - heightDiff
	} else {
		heightReward = -heightDiff // Penalty for falling
	}
	reward += heightReward * 2.0

	// 3. Stability bonus (low angular velocity)
	angularMag := float32(math.Sqrt(float64(
		s.AngularVelocity[0]*s.AngularVelocity[0] +
			s.AngularVelocity[1]*s.AngularVelocity[1] +
			s.AngularVelocity[2]*s.AngularVelocity[2])))
	if angularMag < 1.0 {
		reward += 0.5
	}

	// 4. Energy efficiency (penalize excessive torque)
	// This is implicitly handled by the network learning efficient gaits

	// 5. Distance traveled bonus
	distBonus := s.DistanceTraveled * 0.1
	reward += distBonus

	return reward
}

func preTrainWalkingNetwork(network *Network, inputSize, outputSize, numSkeletons int) {
	fmt.Println("🛰️  Pre-Training on Walking Gaits...")

	// Generate synthetic walking data with cyclic patterns
	numSamples := numSkeletons // Match the actual number of skeletons
	inputs := make([][]float32, numSamples)
	targets := make([][]float32, numSamples)

	for i := 0; i < numSamples; i++ {
		phase := float32(i) * 0.1

		// Random starting state (23 features)
		state := make([]float32, inputSize)
		for j := 0; j < inputSize; j++ {
			state[j] = rand.Float32()*2 - 1
		}
		inputs[i] = state

		// Cyclic walking pattern (simple CPG-like)
		leftLegPhase := float32(math.Sin(float64(phase)))
		rightLegPhase := float32(math.Sin(float64(phase + math.Pi)))

		targets[i] = []float32{
			leftLegPhase * 0.5,   // Left hip forward/back
			0,                    // Left hip side
			leftLegPhase * 0.3,   // Left knee
			rightLegPhase * 0.5,  // Right hip forward/back
			0,                    // Right hip side
			rightLegPhase * 0.3,  // Right knee
			-leftLegPhase * 0.2,  // Left arm swing (opposite)
			-rightLegPhase * 0.2, // Right arm swing (opposite)
		}
	}

	config := DefaultTrainingConfig()
	config.Epochs = 15
	config.LearningRate = 0.01
	config.UseGPU = false
	config.Verbose = true
	config.LossType = "mse"

	network.TrainStandard(inputs, targets, config)

	fmt.Println("✅ Pre-Training Complete")
}