SoulGlitch Tutorials Test 11: Walking Skeleton Reinforcement Learning

Test 11: Walking Skeleton Reinforcement Learning

Overview

Test 11 implements a distributed multi-agent reinforcement learning system where 20 humanoid skeleton creatures learn to walk upright using deep neural networks and policy gradient methods. This builds on the swarm RL concepts from Test 10but applies them to bipedal locomotion.

Technical Classification

Machine Learning Paradigm

  • Reinforcement Learning (RL): Agents learn locomotion through trial-and-error
  • Deep Reinforcement Learning: Neural network policy for joint control
  • Multi-Agent Learning: 20 skeletons share a single policy network
  • Continuous Control: 8-dimensional continuous action space (joint torques)
  • Batched Processing: All skeletons processed simultaneously

Training Method

  • Policy Gradient: Direct optimization of walking policy
  • Experience Replay: Decorrelates temporal dependencies in walking data
  • Epsilon-Greedy Exploration: Balances random movements vs learned gaits
  • Supervised Pre-training: Initializes with cyclic walking patterns (CPG-like)

System Architecture

Neural Network

Type: Feedforward Deep Neural Network (Multi-Layer Perceptron)

Architecture:

Input Layer:    20 neurons (state features)
Hidden Layer 1: 128 neurons (ScaledReLU activation)
Hidden Layer 2: 64 neurons (ScaledReLU activation)
Output Layer:   8 neurons (Tanh activation - joint torques)

Batch Processing: 20 skeletons × 20 features = 400 inputs processed per forward pass

State Representation (20 features per skeleton)

Each skeleton observes:

  1. Torso Position (3): x, y, z coordinates in world space
  2. Torso Rotation (3): pitch, yaw, roll Euler angles
  3. Linear Velocity (3): vx, vy, vz torso movement speed
  4. Angular Velocity (3): ωx, ωy, ωz torso rotational speed
  5. Joint Angles (4): Left leg, right leg, left arm, right arm angles
  6. Forward Direction (2): 2D heading vector (cos/sin of yaw)
  7. Height (1): Vertical distance from spawn point
  8. Distance Traveled (1): Total forward progress

Action Space (8 outputs per skeleton)

Continuous Joint Control: Torque commands for 8 degrees of freedom

Output assignments:

  1. left_hip_forward: Forward/backward leg swing
  2. left_hip_side: Hip abduction/adduction
  3. left_knee: Knee flexion/extension
  4. right_hip_forward: Forward/backward leg swing
  5. right_hip_side: Hip abduction/adduction
  6. right_knee: Knee flexion/extension
  7. left_arm: Arm swing (balance)
  8. right_arm: Arm swing (balance)

Output Range: [-1, 1] (Tanh activation)
Physical Scaling: Multiplied by 30 for actual torque application

Skeleton Morphology

Body Structure (Articulated Rigid Body)

Parts (10 capsules):

  • Torso: Main body (0.5×1.2m)
  • Head: Connected via pin joint
  • Upper Arms (2): Shoulder to elbow
  • Forearms (2): Elbow to hand
  • Thighs (2): Hip to knee
  • Shins (2): Knee to foot

Joints (8 pin constraints):

  • Neck (torso-head)
  • Shoulders (2 × torso-upper arm)
  • Elbows (2 × upper arm-forearm)
  • Hips (2 × torso-thigh)
  • Knees (2 × thigh-shin)

Physics Properties

  • Material: Rigid capsules with collision
  • Constraints: Pin joints allow rotation but constrain position
  • Gravity: Pulls skeleton downward
  • Friction: Ground contact for foot traction

Training Process

Phase 1: Supervised Pre-training

Purpose: Initialize network with basic cyclic walking pattern

Method: Central Pattern Generator (CPG) simulation

  • Generate 2000 synthetic walking cycles
  • Simple sinusoidal leg patterns: left = sin(phase), right = sin(phase + π)
  • Arm swing opposite to legs for balance
  • Train for 15 epochs using MSE loss

Result: Network learns basic rhythmic coordination

Phase 2: Reinforcement Learning

Algorithm: Policy Gradient with Experience Replay

Training Loop (100 ticks per episode, 20ms per tick):

  1. State Collection:

    • Observe torso position, rotation, velocities
    • Compute joint angles (if trackable)
    • Calculate height and distance metrics
  2. Action Selection: Epsilon-greedy (ε=0.5 → 0.1)

    • Exploration: Random joint torques
    • Exploitation: Network policy output
  3. Physics Simulation: Apply torques via pin joints

  4. Reward Calculation:

    reward = forward_velocity_reward + height_reward + stability_bonus + distance_bonus
    
    forward_velocity_reward = velocity_z × 5.0
    height_reward = upright_bonus if |height - target| < 0.5 else -fall_penalty
    stability_bonus = 0.5 if angular_velocity < 1.0
    distance_bonus = total_distance × 0.1
    
  5. Experience Storage: Store (state, action, reward) in replay buffer (capacity: 20,000)

  6. Network Update (every 2 ticks if buffer ≥ 64 samples):

    • Sample minibatch of 64 experiences
    • Forward pass → predict joint torques
    • Compute loss: MSE between predicted and advantage-adjusted actions
    • Target = action + advantage × 0.05
    • Backward pass → update weights
    • Learning rate: 0.005
  7. Epsilon Decay: ε = ε × 0.995 (until reaching 0.1 minimum)

Reward Function Breakdown

1. Forward Movement (Primary Objective)

forwardVel := velocity[2] // Z-axis
reward += forwardVel × 5.0

Encourages forward locomotion (main task).

2. Upright Posture

heightDiff := abs(torsoHeight - targetHeight)
if heightDiff < 0.5 {
    heightReward = 1.0 - heightDiff  // Bonus for staying upright
} else {
    heightReward = -heightDiff // Penalty for falling
}
reward += heightReward × 2.0

Prevents falling down (bipedal stability).

3. Stability Bonus

angularMag := sqrt(ωx² + ωy² + ωz²)
if angularMag < 1.0 {
    reward += 0.5
}

Rewards smooth, stable movement over erratic flailing.

4. Distance Bonus

reward += total_distance_traveled × 0.1

Long-term progress incentive.

Locomotion Challenges

Bipedal Walking is Hard!

Walking requires:

  1. Balance: Maintaining upright posture against gravity
  2. Coordination: Synchronizing leg and arm movements
  3. Stability: Preventing falls during weight transfer
  4. Efficiency: Minimizing energy (torque) expenditure
  5. Rhythm: Discovering periodic gait patterns

Expected Learning Curve

  • Episodes 0-20: Random flailing, frequent falls
  • Episodes 20-50: Discovering balance, some forward drift
  • Episodes 50-100: Emerging gait patterns, consistent forward movement
  • Episodes 100+: Refined walking, possibly running gaits

Key Implementation Details

Velocity Calculation

dt = 0.02 // 20ms per tick
velocity = (currentTorsoPos - lastTorsoPos) / dt

Batched Forward Pass

allStates = [skeleton0_state(20), skeleton1_state(20), ..., skeleton19_state(20)]
allOutputs = network.Forward(allStates) // 400 in → 160 out

Joint Torque Application

for i, skeleton := range skeletons {
    outputOffset := i * 8
    leftHipTorque = Vector3{
        outputs[outputOffset + 0] * 30,  // Forward/back
        0,
        outputs[outputOffset + 1] * 30,  // Side
    }
    // Apply to l_thigh part via UpdateRequest
}

Differences from Test 10 (Swarm RL)

Aspect Test 10 (Cubes) Test 11 (Skeletons)
Agents 100 simple cubes 20 articulated skeletons
State Size 15 features 20 features
Action Size 3 torques 8 torques
Complexity Rotation control only Full locomotion
Task Navigate to target Walk forward upright
Reward Alignment + exploration Forward + balance
Episode 50 ticks (1s) 100 ticks (2s)
Difficulty Moderate High

This system combines:

  • Bipedal Locomotion Control
  • Multi-Agent RL (MARL)
  • Continuous Action Spaces
  • Physics-Based Animation
  • Central Pattern Generators (CPG)
  • Policy Gradient Methods

Similar To:

  • DeepMind's MuJoCo humanoid walker
  • OpenAI Gym Walker2D/BipedalWalker
  • Evolution Strategies for walking gaits
  • Reinforcement Learning for prosthetic control
  • Boston Dynamics robot learning

Future Enhancements

Potential improvements:

  1. Curriculum Learning: Start with crawling, progress to walking/running
  2. Terrain Adaptation: Hills, obstacles, varying surfaces
  3. Multi-Objective: Walking + carrying objects
  4. Inverse Kinematics: Target foot placement
  5. Imitation Learning: Learn from motion capture data
  6. Adversarial Training: Push-recovery, navigation under

disturbances 7. Hierarchical Control: High-level gait selection + low-level joint control 8. Evolution Strategies: Explore morphology variations

Technical Stack

  • Language: Go
  • ML Framework: Custom loom/nn neural network library
  • Physics: Server-side articulated rigid body dynamics via WebSocket
  • Training: CPU-based backpropagation
  • Constraints: Pin joints for skeletal structure
  • Serialization: Binary checkpoint format

Performance Metrics

Training Indicators

  • Average Reward: Mean reward across 20 skeletons per episode
  • Epsilon: Current exploration rate
  • Buffer Size: Experiences stored (max 20,000)
  • Forward Distance: How far skeleton traveled

Success Criteria

  • Walking: Consistent forward velocity > 0.5 m/s
  • Stability: Torso height maintained within 20% of target
  • Episodes Without Falls: Count before tumbling

Checkpointing

  • Model saved every 10 episodes
  • Format: walking_skeleton_checkpoint_XXXX.bin
  • Includes network weights and optimizer state

Author: Walking Skeleton RL System
Date: 2026-02-04
Version: 1.0 - Distributed Multi-Agent Bipedal Locomotion
Based On: Test 10 Swarm RL Architecture

Go source

test11.go — run with go run . 11 from the repo root

package main

import (
	"encoding/json"
	"fmt"
	"math"
	"math/rand"
	"net"
	"time"
)

// --- Test 11: Walking Skeleton RL ---

// SkeletonAgent represents one learning skeleton creature
type SkeletonAgent struct {
	ID              string
	SpawnPos        Vector3
	TorsoPos        Vector3
	LastTorsoPos    Vector3
	Velocity        Vector3
	TorsoRotation   Vector3
	AngularVelocity Vector3
	TargetPos       Vector3 // Target bubble position to navigate to

	// Joint states (for observation)
	LeftLegAngle  float32
	RightLegAngle float32
	LeftArmAngle  float32
	RightArmAngle float32

	// Learning metrics
	TotalReward      float32
	StepCount        int
	DistanceTraveled float32
}

func (s *SkeletonAgent) GetState() []float32 {
	// Rich 23-feature state representation
	state := []float32{
		// Torso state (9)
		s.TorsoPos[0], s.TorsoPos[1], s.TorsoPos[2],
		s.TorsoRotation[0], s.TorsoRotation[1], s.TorsoRotation[2],
		s.Velocity[0], s.Velocity[1], s.Velocity[2],

		// Angular state (3)
		s.AngularVelocity[0], s.AngularVelocity[1], s.AngularVelocity[2],

		// Joint angles (4)
		s.LeftLegAngle, s.RightLegAngle,
		s.LeftArmAngle, s.RightArmAngle,

		// Forward direction to goal (2D heading)
		float32(math.Cos(float64(s.TorsoRotation[1]) * math.Pi / 180.0)),
		float32(math.Sin(float64(s.TorsoRotation[1]) * math.Pi / 180.0)),

		// Height above ground
		s.TorsoPos[1],

		// Distance traveled so far		float32(s.DistanceTraveled),
		float32(s.DistanceTraveled),
	}

	// Add target position relative to current position (3 features)
	relativeTarget := VecSub(s.TargetPos, s.TorsoPos)
	state = append(state, relativeTarget[0], relativeTarget[1], relativeTarget[2])

	return state
}

func RunTest11() {
	fmt.Println("🦴 Starting Test 11: WALKING SKELETON RL 🦴")

	conn, err := net.Dial("tcp", "localhost:17000")
	if err != nil {
		fmt.Printf("❌ Failed to connect to Construct Server: %v\n", err)
		return
	}
	defer conn.Close()

	// Query planet state for bubbles
	fmt.Println("📡 Querying world state for bubbles...")
	writePacket(conn, []byte(`{"type":"query_state"}`))

	buf := make([]byte, 32768)
	n, _ := conn.Read(buf)
	var state StateResponse
	json.Unmarshal(buf[:n], &state)

	if len(state.Bubbles) == 0 {
		fmt.Println("❌ No bubbles found in world state")
		return
	}

	numBubbles := len(state.Bubbles)
	skeletonsPerBubble := 3
	numSkeletons := numBubbles * skeletonsPerBubble

	fmt.Printf("✅ Found %d bubbles. Initializing %d skeletons...\n", numBubbles, numSkeletons)

	// Configuration
	inputSize := 23 // Extended state: position, rotation, velocities, target, bubble info
	outputSize := 8 // 8 joint torques

	planetCenter := Vector3{state.PlanetCenter[0], state.PlanetCenter[1], state.PlanetCenter[2]}

	// Create neural network
	fmt.Println("🏗️  Building Walking Neural Network...")
	fmt.Printf("   - Input: %d features per skeleton\n", inputSize)
	fmt.Printf("   - Architecture: Dense(%d → 128 → 64 → %d)\n", inputSize, outputSize)
	fmt.Printf("   - Batch size: %d skeletons processed together\n", numSkeletons)

	network := NewNetwork(inputSize, 1, 3, 1)
	network.BatchSize = numSkeletons

	layer0 := InitDenseLayer(inputSize, 128, ActivationScaledReLU)
	layer1 := InitDenseLayer(128, 64, ActivationScaledReLU)
	layer2 := InitDenseLayer(64, outputSize, ActivationTanh)

	network.SetLayer(0, 0, 0, layer0)
	network.SetLayer(0, 1, 0, layer1)
	network.SetLayer(0, 2, 0, layer2)
	network.GPU = false

	fmt.Println("✅ Network initialized")

	// Pre-training
	preTrainWalkingNetwork(network, inputSize, outputSize, numSkeletons)

	// Spawn skeletons around bubbles (like test10)
	skeletons := make([]*SkeletonAgent, numSkeletons)
	spawnOffset := float32(3.0) // Offset above bubble surface

	fmt.Printf("🚀 Spawning %d skeletons across %d bubbles...\n", numSkeletons, numBubbles)

	skeletonIdx := 0
	for i := 0; i < numBubbles; i++ {
		b := state.Bubbles[i]
		bPos := Vector3{b.Pos[0], b.Pos[1], b.Pos[2]}
		up := VecNorm(VecSub(bPos, planetCenter))

		// Target: next bubble in sequence
		nextIdx := (i + 1) % numBubbles
		targetBubble := state.Bubbles[nextIdx]
		targetPos := Vector3{targetBubble.Pos[0], targetBubble.Pos[1], targetBubble.Pos[2]}

		// Spawn skeletons in ring around bubble
		for j := 0; j < skeletonsPerBubble; j++ {
			theta := (float64(j) / float64(skeletonsPerBubble)) * 2.0 * math.Pi
			ringDist := float32(8.0)
			right, _, forward := MakeBasis(up)

			localOffset := Vector3{
				float32(math.Cos(theta)) * ringDist,
				0,
				float32(math.Sin(theta)) * ringDist,
			}
			worldOffset := TransformPoint(Vector3{0, 0, 0}, right, up, forward, localOffset)

			spawnPos := Vector3{
				bPos[0] + up[0]*spawnOffset + worldOffset[0],
				bPos[1] + up[1]*spawnOffset + worldOffset[1],
				bPos[2] + up[2]*spawnOffset + worldOffset[2],
			}

			id := fmt.Sprintf("skeleton_%d", skeletonIdx)

			skeletons[skeletonIdx] = &SkeletonAgent{
				ID:           id,
				SpawnPos:     spawnPos,
				TorsoPos:     Vector3{spawnPos[0], spawnPos[1] + 1.2, spawnPos[2]},
				LastTorsoPos: Vector3{spawnPos[0], spawnPos[1] + 1.2, spawnPos[2]},
				TargetPos:    targetPos, // Navigate to next bubble!
			}

			// Spawn skeleton
			createSkeleton(conn, id, spawnPos)
			skeletonIdx++
		}
	}

	fmt.Println("🚀 Starting Swarm Walking Training Loop...")

	// Start background goroutine to poll skeleton states from server
	go func() {
		pollTicker := time.NewTicker(50 * time.Millisecond) // Poll every 50ms
		defer pollTicker.Stop()

		pollBuf := make([]byte, 65536) // Larger buffer for all construct data

		for range pollTicker.C {
			// Request full state with all constructs
			writePacket(conn, []byte(`{"type":"query_constructs"}`))

			// Read response (non-blocking with timeout would be better but keep it simple)
			conn.SetReadDeadline(time.Now().Add(30 * time.Millisecond))
			n, err := conn.Read(pollBuf)
			if err != nil {
				continue // Skip this poll if timeout/error
			}
			conn.SetReadDeadline(time.Time{}) // Clear deadline

			// Try to parse as a generic response with parts
			var response struct {
				Type       string `json:"type"`
				Constructs []struct {
					ID    string `json:"id"`
					Parts []struct {
						ID  string  `json:"id"`
						Pos Vector3 `json:"pos"`
						Rot Vector3 `json:"rot"`
					} `json:"parts"`
				} `json:"constructs"`
			}

			if json.Unmarshal(pollBuf[:n], &response) == nil && len(response.Constructs) > 0 {
				// Update skeleton positions
				for _, construct := range response.Constructs {
					// Find matching skeleton
					for _, s := range skeletons {
						if s.ID == construct.ID {
							// Find torso part
							for _, part := range construct.Parts {
								if part.ID == "torso" {
									s.TorsoPos = part.Pos
									s.TorsoRotation = part.Rot

									// Update distance traveled
									dx := s.TorsoPos[0] - s.SpawnPos[0]
									dy := s.TorsoPos[1] - s.SpawnPos[1]
									dz := s.TorsoPos[2] - s.SpawnPos[2]
									dist := float32(math.Sqrt(float64(dx*dx + dy*dy + dz*dz)))
									if dist > s.DistanceTraveled {
										s.DistanceTraveled = dist
									}
									break
								}
							}
							break
						}
					}
				}
			}
		}
	}()

	// RL Training parameters
	learningRate := float32(0.005)
	epsilon := float32(0.5) // Start with high exploration
	epsilonMin := float32(0.1)
	epsilonDecay := float32(0.995)

	// Experience replay
	expBuffer := NewExperienceBuffer(20000)

	// Training loop
	ticker := time.NewTicker(20 * time.Millisecond)
	defer ticker.Stop()

	tickCount := 0
	episode := 0
	trainSteps := 0

	for range ticker.C {
		tickCount++

		// Collect all skeleton states
		allStates := make([]float32, 0, numSkeletons*inputSize)
		for _, s := range skeletons {
			// Calculate velocity
			dt := float32(0.02) // 20ms
			s.Velocity = VecMul(VecSub(s.TorsoPos, s.LastTorsoPos), 1.0/dt)
			s.LastTorsoPos = s.TorsoPos

			state := s.GetState()
			allStates = append(allStates, state...)
		}

		// Batched forward pass
		allOutputs, _ := network.Forward(allStates)

		// Apply actions to each skeleton
		updates := []UpdateRequest{}
		totalReward := float32(0)

		for i, s := range skeletons {
			// Extract this skeleton's action (8 torques)
			outputOffset := i * outputSize
			action := make([]float32, outputSize)

			// Epsilon-greedy
			if rand.Float32() < epsilon {
				// Random exploration
				for j := 0; j < outputSize; j++ {
					action[j] = rand.Float32()*2 - 1
				}
			} else {
				// Network policy
				for j := 0; j < outputSize; j++ {
					action[j] = allOutputs[outputOffset+j]
				}
			}

			// Scale torques
			torqueScale := float32(30.0)

			// Map actions to joint torques
			leftHipTorque := Vector3{action[0] * torqueScale, 0, action[1] * torqueScale}
			leftKneeTorque := Vector3{action[2] * torqueScale, 0, 0}
			rightHipTorque := Vector3{action[3] * torqueScale, 0, action[4] * torqueScale}
			rightKneeTorque := Vector3{action[5] * torqueScale, 0, 0}
			leftArmTorque := Vector3{0, 0, action[6] * torqueScale}
			rightArmTorque := Vector3{0, 0, action[7] * torqueScale}

			updates = append(updates, UpdateRequest{
				Type:        "update_construct",
				ConstructID: s.ID,
				Updates: []PartUpdate{
					{PartID: "l_thigh", Torque: &leftHipTorque},
					{PartID: "l_shin", Torque: &leftKneeTorque},
					{PartID: "r_thigh", Torque: &rightHipTorque},
					{PartID: "r_shin", Torque: &rightKneeTorque},
					{PartID: "l_fore", Torque: &leftArmTorque},
					{PartID: "r_fore", Torque: &rightArmTorque},
				},
			})

			// Calculate reward
			reward := calculateWalkingReward(s)
			totalReward += reward

			// Store experience
			stateOffset := i * inputSize
			exp := Experience{
				State:  allStates[stateOffset : stateOffset+inputSize],
				Action: action,
				Reward: reward,
			}
			expBuffer.Add(exp)

			s.TotalReward += reward
			s.StepCount++
		}

		// Send all updates
		for _, u := range updates {
			d, _ := json.Marshal(u)
			writePacket(conn, d)
		}

		// Note: We can't easily read back positions from server without complex state tracking
		// Instead, we rely on velocity calculations and assume physics is working
		// The skeletons will move based on applied torques

		// Training step (every 2 ticks)
		if tickCount%2 == 0 && expBuffer.Size >= 64 {
			batchSize := 64
			batch := expBuffer.Sample(batchSize)

			trainBatchStates := make([]float32, 0, batchSize*inputSize)
			trainBatchTargets := make([]float32, 0, batchSize*outputSize)

			for _, exp := range batch {
				trainBatchStates = append(trainBatchStates, exp.State...)

				// Policy gradient target
				advantage := exp.Reward
				for j := 0; j < outputSize; j++ {
					trainBatchTargets = append(trainBatchTargets, exp.Action[j]+advantage*0.05)
				}
			}

			// Train
			oldBatchSize := network.BatchSize
			network.BatchSize = batchSize

			output, _ := network.Forward(trainBatchStates)

			grad := make([]float32, len(output))
			totalLoss := float32(0)
			for j := 0; j < len(output); j++ {
				err := output[j] - trainBatchTargets[j]
				totalLoss += err * err
				grad[j] = err
			}

			network.Backward(grad)
			network.ApplyGradients(learningRate)

			network.BatchSize = oldBatchSize
			trainSteps++

			// Decay epsilon
			if epsilon*epsilonDecay > epsilonMin {
				epsilon = epsilon * epsilonDecay
			}
		}

		// Episode reset every 100 ticks (2 seconds)
		if tickCount%100 == 0 {
			episode++
			avgReward := totalReward / float32(numSkeletons)

			// Status update
			fmt.Printf("📊 Episode %d - Avg Reward: %.3f - Epsilon: %.3f - Buffer: %d\\n",
				episode, avgReward, epsilon, expBuffer.Size)

			// Save checkpoint every 10 episodes
			if episode%10 == 0 {
				filename := fmt.Sprintf("walking_skeleton_checkpoint_%04d.bin", episode)
				modelID := fmt.Sprintf("walk_ep_%d", episode)
				if err := network.SaveModel(filename, modelID); err == nil {
					fmt.Printf("💾 Saved checkpoint: %s\\n", filename)
				}
			}
		}
	}
}

func createSkeleton(conn net.Conn, id string, basePos Vector3) {
	createReq := ConstructRequest{
		Type:        "create_construct",
		ConstructID: id,
		Parts: []Part{
			// Torso
			{ID: "torso", Type: "capsule", Size: Vector3{0.5, 1.2, 0}, Pos: Vector3{basePos[0], basePos[1] + 1.2, basePos[2]}, Color: Vector3{0.9, 0.7, 0.5}},

			// Head
			{ID: "head", Type: "capsule", Size: Vector3{0.45, 0.9, 0}, Pos: Vector3{basePos[0], basePos[1] + 2.3, basePos[2]}, Color: Vector3{0.98, 0.92, 0.84}},

			// Arms
			{ID: "l_upper", Type: "capsule", Size: Vector3{0.22, 0.8, 0}, Pos: Vector3{basePos[0] - 0.9, basePos[1] + 1.7, basePos[2]}, Color: Vector3{0.44, 0.5, 0.56}, IsHorizontal: true},
			{ID: "l_fore", Type: "capsule", Size: Vector3{0.18, 0.7, 0}, Pos: Vector3{basePos[0] - 1.8, basePos[1] + 1.7, basePos[2]}, Color: Vector3{0.3, 0.3, 0.3}, IsHorizontal: true},
			{ID: "r_upper", Type: "capsule", Size: Vector3{0.22, 0.8, 0}, Pos: Vector3{basePos[0] + 0.9, basePos[1] + 1.7, basePos[2]}, Color: Vector3{0.44, 0.5, 0.56}, IsHorizontal: true},
			{ID: "r_fore", Type: "capsule", Size: Vector3{0.18, 0.7, 0}, Pos: Vector3{basePos[0] + 1.8, basePos[1] + 1.7, basePos[2]}, Color: Vector3{0.3, 0.3, 0.3}, IsHorizontal: true},

			// Legs
			{ID: "l_thigh", Type: "capsule", Size: Vector3{0.28, 0.9, 0}, Pos: Vector3{basePos[0] - 0.45, basePos[1] + 0.6, basePos[2]}, Color: Vector3{0.44, 0.5, 0.56}},
			{ID: "l_shin", Type: "capsule", Size: Vector3{0.22, 0.9, 0}, Pos: Vector3{basePos[0] - 0.45, basePos[1] - 0.3, basePos[2]}, Color: Vector3{0.3, 0.3, 0.3}},
			{ID: "r_thigh", Type: "capsule", Size: Vector3{0.28, 0.9, 0}, Pos: Vector3{basePos[0] + 0.45, basePos[1] + 0.6, basePos[2]}, Color: Vector3{0.44, 0.5, 0.56}},
			{ID: "r_shin", Type: "capsule", Size: Vector3{0.22, 0.9, 0}, Pos: Vector3{basePos[0] + 0.45, basePos[1] - 0.3, basePos[2]}, Color: Vector3{0.3, 0.3, 0.3}},
		},
		Joints: []Joint{
			{Type: "pin", A: "torso", B: "head", Pos: Vector3{basePos[0], basePos[1] + 2.0, basePos[2]}},
			{Type: "pin", A: "torso", B: "l_upper", Pos: Vector3{basePos[0] - 0.55, basePos[1] + 1.7, basePos[2]}},
			{Type: "pin", A: "l_upper", B: "l_fore", Pos: Vector3{basePos[0] - 1.3, basePos[1] + 1.7, basePos[2]}},
			{Type: "pin", A: "torso", B: "r_upper", Pos: Vector3{basePos[0] + 0.55, basePos[1] + 1.7, basePos[2]}},
			{Type: "pin", A: "r_upper", B: "r_fore", Pos: Vector3{basePos[0] + 1.3, basePos[1] + 1.7, basePos[2]}},
			{Type: "pin", A: "torso", B: "l_thigh", Pos: Vector3{basePos[0] - 0.45, basePos[1] + 1.1, basePos[2]}},
			{Type: "pin", A: "l_thigh", B: "l_shin", Pos: Vector3{basePos[0] - 0.45, basePos[1] + 0.1, basePos[2]}},
			{Type: "pin", A: "torso", B: "r_thigh", Pos: Vector3{basePos[0] + 0.45, basePos[1] + 1.1, basePos[2]}},
			{Type: "pin", A: "r_thigh", B: "r_shin", Pos: Vector3{basePos[0] + 0.45, basePos[1] + 0.1, basePos[2]}},
		},
	}

	data, _ := json.Marshal(createReq)
	writePacket(conn, data)
}

func calculateWalkingReward(s *SkeletonAgent) float32 {
	reward := float32(0)

	// 1. Forward movement reward (main objective)
	forwardVel := s.Velocity[2] // Z-axis is forward
	reward += forwardVel * 5.0

	// 2. Upright posture reward
	heightReward := float32(0)
	targetHeight := s.SpawnPos[1] + 1.2
	heightDiff := float32(math.Abs(float64(s.TorsoPos[1] - targetHeight)))
	if heightDiff < 0.5 {
		heightReward = 1.0 - heightDiff
	} else {
		heightReward = -heightDiff // Penalty for falling
	}
	reward += heightReward * 2.0

	// 3. Stability bonus (low angular velocity)
	angularMag := float32(math.Sqrt(float64(
		s.AngularVelocity[0]*s.AngularVelocity[0] +
			s.AngularVelocity[1]*s.AngularVelocity[1] +
			s.AngularVelocity[2]*s.AngularVelocity[2])))
	if angularMag < 1.0 {
		reward += 0.5
	}

	// 4. Energy efficiency (penalize excessive torque)
	// This is implicitly handled by the network learning efficient gaits

	// 5. Distance traveled bonus
	distBonus := s.DistanceTraveled * 0.1
	reward += distBonus

	return reward
}

func preTrainWalkingNetwork(network *Network, inputSize, outputSize, numSkeletons int) {
	fmt.Println("🛰️  Pre-Training on Walking Gaits...")

	// Generate synthetic walking data with cyclic patterns
	numSamples := numSkeletons // Match the actual number of skeletons
	inputs := make([][]float32, numSamples)
	targets := make([][]float32, numSamples)

	for i := 0; i < numSamples; i++ {
		phase := float32(i) * 0.1

		// Random starting state (23 features)
		state := make([]float32, inputSize)
		for j := 0; j < inputSize; j++ {
			state[j] = rand.Float32()*2 - 1
		}
		inputs[i] = state

		// Cyclic walking pattern (simple CPG-like)
		leftLegPhase := float32(math.Sin(float64(phase)))
		rightLegPhase := float32(math.Sin(float64(phase + math.Pi)))

		targets[i] = []float32{
			leftLegPhase * 0.5,   // Left hip forward/back
			0,                    // Left hip side
			leftLegPhase * 0.3,   // Left knee
			rightLegPhase * 0.5,  // Right hip forward/back
			0,                    // Right hip side
			rightLegPhase * 0.3,  // Right knee
			-leftLegPhase * 0.2,  // Left arm swing (opposite)
			-rightLegPhase * 0.2, // Right arm swing (opposite)
		}
	}

	config := DefaultTrainingConfig()
	config.Epochs = 15
	config.LearningRate = 0.01
	config.UseGPU = false
	config.Verbose = true
	config.LossType = "mse"

	network.TrainStandard(inputs, targets, config)

	fmt.Println("✅ Pre-Training Complete")
}