# OpenFluke — full corpus

> Complete OpenFluke website text and Loom documentation for LLM ingestion. Navigation index: https://openfluke.com/llms.txt

## Site pages (full text)

---

Source: https://openfluke.com/

# OpenFluke — Sovereign AI on Your Hardware (Golang AI Engine)

> OpenFluke builds Loom, a pure Golang AI engine (Apache 2.0, zero CGO) for CPU, GPU, and WebGPU on every OS; SoulGlitch, a private offline AI digital pet on Google Play; and Primecraft, a voxel simulation engine.

Canonical: https://openfluke.com/

---

Star
openfluke/loom
Open-Source AI Infrastructure Lab
Sovereign AI.
On Your Hardware.
OpenFluke builds foundational tools so intelligence can run locally—private, portable, and free of cloud lock-in.
Loom (v0.79 Bedrock) is the M-POLY-VTD engine: 3D volumetric networks, 21 numeric types, and welvet bindings on every major OS.
SoulGlitch shows what that feels like: a living AI companion that never phones home.
Explore Loom
Why Loom?
Loom docs
SoulGlitch on Play
SoulGlitch for Linux
Source on GitHub
Who We Are
An open-source AI infrastructure lab
Most AI today lives in someone else's data center. OpenFluke is an independent R&D lab building
the opposite: edge-native, privacy-first infrastructure that puts training and inference on the devices people already own.
We ship real software—not slide decks. Loom is the engine. SoulGlitch is the proof.
Everything is open source so developers, researchers, and hobbyists can inspect, extend, and ship without permission.
100% offline capable
Loom Apache 2.0
Phone to server
Your data stays yours
Why Loom vs the industry
No cloud required
Run models on laptops, phones, and browsers. No API keys, no subscriptions, no upload pipeline.
One engine, many surfaces
Drop Loom into Python, JavaScript, Go, Flutter, or WASM—the same weights, the same behavior.
Built to be felt
SoulGlitch turns abstract ML into something expressive: chat, train emotions, and evolve personalities on-device.
The Engine
What is Loom?
Loom is our Apache 2.0 M-POLY-VTD engine—CPU and GPU capable, OS-agnostic,
and designed like SQLite for neural networks : one library you embed, not a cloud you rent.
v0.79 hardens the CPU bedrock : volumetric train → native save → reload → infer (Lucy seven-layer suite),
MHA decode , and C-ABI 461/461 —on top of BitNet, asm Dense forward, step mesh, and welvet on every OS.
The Loom runtime — runs everywhere
Silicon & acceleration
x86_64, ARM64, ARMv7
CPU inference & training
GPU via WebGPU / native paths
WebAssembly in the browser
Operating systems
Windows
macOS & iOS
Linux & Android
Node.js, Bun, browsers
📦 Drop-in portability
Prebuilt native libraries ( .dll , .so , .dylib ) install beside your app. Train once, ship everywhere.
💾 Native precision on disk
21 DTypes from Float64 to 1-bit binary—checkpoints store packed weights per layer, not FP32-only JSON. BitNet and Qwen3 load from Hugging Face via Lucy.
🎯 Bit-exact reproducibility
Deterministic execution across CPU, GPU, and language bindings—same inputs, same outputs, every time.
🧬 Biological learning
3D volumetric networks with target propagation—layers learn locally without classic backprop lock-in.
Deterministic AI on any CPU or GPU
Loom is built as a Deterministic Neural Virtual Machine (DNVM) : the same model weights, prompts, and settings
produce bit-identical behaviour whether you run on Apple Silicon, x86, WebGPU, or inside SoulGlitch via
Lucy and the welvet C-ABI.
Apache 2.0 License
Why Loom?
Loom overview
Read docs
Hugging Face · local inference
Models Lucy & SoulGlitch support
Download checkpoints into your local Hugging Face hub cache. Lucy (CLI) and SoulGlitch share the same
approved model list—load safetensors, run chat offline through Loom/welvet.
Model family
Hugging Face repo
Typical use
SmolLM2 Lite
HuggingFaceTB/SmolLM2-135M-Instruct
Phones · fast reactions · default SoulGlitch brain
SmolLM2 Balanced
HuggingFaceTB/SmolLM2-360M-Instruct
Desktop · everyday chat
SmolLM2 Deep
HuggingFaceTB/SmolLM2-1.7B-Instruct
Stronger private hardware · deeper replies
BitNet b1.58
microsoft/bitnet-b1.58-2B-4T
Low-bit ternary weights · Loom v0.78+ infer · v0.79 native save/reload
Qwen3 Lite
Qwen/Qwen3-0.6B
GPU-friendly · strong quality per GB
Qwen3 Balanced
Qwen/Qwen3-1.7B
Desktop · sharded safetensors
Qwen3 Heavy
Qwen/Qwen3-4B
Large GPU / patient downloads
Custom volumetric networks (XOR, NEAT, DNA splice) are built in Loom/poly—no HF download required.
What We Build
The OpenFluke ecosystem
Infrastructure, a flagship app, and community tools—one vision of local, sovereign AI.
Loom
v0.79 — Bedrock
M-POLY-VTD engine: 3D grids, CPU train/save/reload bedrock (v0.79), BitNet on CPU, transformers via WebGPU, NEAT/DNA evolution.
Apache 2.0 — train once, ship in Python, JS, Go, Flutter, or the browser.
Why Loom?
Documentation
Loom overview
SoulGlitch
Android & Linux
A private, on-device AI companion—on Android and Linux (x86_64). Reactive glitch face, swarm chat, emotion training, all offline.
iOS, macOS, and Windows coming soon.
Get on Google Play
Linux download
Scene Gallery
Live
Screenshots and voxel scenes from SoulGlitch and the Primecraft engine—see the worlds we're building.
Open gallery
For developers
Get started in minutes
Loom is a self-contained C-ABI library ( welvet ) you embed in any stack.
v0.79 validates CPU train/save/reload and transformer decode; rebuild welvet natives after upgrade. BitNet, Donate Compute, and TANHI from v0.78 still ship.
One model file, identical results on Windows, Linux, macOS, iOS, Android, and WASM.
Install via pip install welvet , npm install @openfluke/welvet , or embed natives in Flutter—as SoulGlitch does.
Full reference: openfluke.com/docs .
Python
JavaScript
Go
iOS / Android
WebAssembly
WebGPU
Loom documentation
Product page
Python
Node.js
Go
# Install
pip install welvet
# XOR in 10 lines
from welvet import Network, train
net = Network({
"id": "xor", "depth":1,"rows":1,"cols":1,
"layers_per_cell":2,
"layers": [
{"l":0,"type":"dense","input_height":2,
"output_height":8,"activation":"relu"},
{"l":1,"type":"dense","input_height":8,
"output_height":1,"activation":"sigmoid"}
]
})
losses = train(net,
[[[0,0],[0,1],[1,0],[1,1]]],
[[[0],[1],[1],[0]]],
epochs=100, learning_rate=0.1)
print(f"Final loss: {losses[-1]:.4f}")
# Install
npm install @openfluke/welvet
# Usage
const { Network } = require('@openfluke/welvet');
const net = new Network({
id: "demo", depth:1, rows:1, cols:1,
layers_per_cell: 1,
layers: [{ l:0, type:"dense",
input_height:4, output_height:2 }]
});
// Same model, same weights, identical output
// whether running in Node or a browser.
// go get github.com/openfluke/loom/poly
package main
import (
"fmt"
"github.com/openfluke/loom/poly"
)
func main() {
net := poly.BuildNetwork(poly.Config{
ID: "demo", Depth:1, Rows:1, Cols:1,
LayersPerCell: 1,
Layers: []poly.LayerDef{
{L: 0, Type: "dense",
InputHeight: 4, OutputHeight: 2},
},
})
state := net.NewState(poly.Float32)
state.SetInput([]float64{1,0,1,0})
state.Step()
fmt.Println(state.Output(0))
}
Deeper dive
How Loom differs architecturally
Independent analysis of Loom's 3D volumetric design, compression pipeline, and target-propagation learning—
for readers who want the technical story behind the marketing.
Architecture reference: docs overview · comparative analysis on the research page .
Technical research
Loom: 3D grids & target propagation
Comparative analysis vs PyTorch, JAX, and Go ML stacks—architecture, DNVM determinism, and edge deployment.
Read full analysis
🧊 Thinks in 3D
Signals move across a volumetric grid—not only through a rigid layer stack—closer to spatial brain topology than a factory line.
💾 Up to 98.4% compression
Bit-packed serialization from Float64 down to 1-bit binary—gigabyte-class models can shrink enough to run on a phone, offline.
🧬 Target propagation
Layers learn independently via localized target signals—more biologically plausible, and viable on non-differentiable low-bit models.
⚡ BitNet on device
v0.79 fixes native ternary checkpoints end-to-end—Lucy, SoulGlitch, and welvet infer BitNet-class models on packed CPU paths without a cloud API.
Available on Android
SoulGlitch — chat, train, evolve
SoulGlitch is what local AI feels like when it has a face. Ask a swarm, train emotions on your photos, and watch a
glitchy entity react—all powered by Loom on your phone with no cloud.
Download now on Google Play or Linux (x86_64). Coming soon: App Store (iOS), Mac App Store, and Microsoft Store.
Get on Google Play
Linux
Learn more
Google Play · now
Linux · now
iOS · soon
macOS · soon
Windows · soon
Loom × SoulGlitch — models run on your PC; TANHI streams execution live to your phone so you watch mixed layers and remote links in 3D, in real time.
SoulGlitch trailer — private AI companion on your hardware.
×
‹
›
More open source
Built alongside the ecosystem
Utilities and experiments from the same lab.
Open source · NLP tool
TokenTrove
Find recurring text patterns across millions of documents—n-gram chains, file-level tracking, and parallel processing.
Shown here on 5,000+ FCC filings to surface common boilerplate. Built in Go + Fiber.
openfluke/tokentrove
Pattern mining at scale
Linked n-gram chains across thousands of files—not just word counts, but multi-sentence recurring structures.
Built for real corpora
Web UI, parallel processing, numeric filtering—legal docs, filings, research sets, plagiarism workflows.
Pure Go
Same stack as the rest of OpenFluke. Drop it on any server.
Join the project
Loom is Apache 2.0 and fully open source. Stars help others discover it;
issues and PRs shape what ships next.
Contribute on GitHub
Read the docs →
·
API reference →

---

Source: https://openfluke.com/about

# About Samuel Watson — Golang AI Engineer & OpenFluke Founder

> Samuel Watson is a Golang AI systems engineer and founder of OpenFluke. Programming since 2006, Master of Applied AI (Deakin). Building Loom (pure Go AI), Primecraft, and SoulGlitch.

Canonical: https://openfluke.com/about

---

Samuel Watson
AI Systems Engineer & Founder of OpenFluke — Melbourne, Australia
Programming since 2006
Master of Applied AI
AI Runtime Engineering
openfluke
planetbridging
LinkedIn
About Me
The short version
I've been programming since 2006 — starting with IT support and small automation scripts before working
my way through web development, data engineering, systems programming, and eventually AI runtime research.
Over the years I've written production code in Go, Python, Java, JavaScript, TypeScript, C#, VBA, R, PHP, and Shell.
In my spare time I build OpenFluke — a passion project I work on for fun.
It's centred around Loom , a portable AI engine that runs neural networks natively
across every major platform and language without vendor lock-in. Alongside it I'm building
Primecraft , a simulation engine, and SoulGlitch ,
an AI creature evolution game powered by both. All of it built in my own time, just because I enjoy it.
I've designed and verified cross-language, cross-vendor AI runtimes — achieving bit-level determinism
across 7+ architectures including Apple M4, AMD Ryzen, Intel Arc, NVIDIA, and Qualcomm Adreno —
using WebGPU/Vulkan compute with unified C-ABI bindings for Go, Python, C#, C, and WebAssembly.
Languages & Technologies
Accumulated across ~20 years
Go
Python
JavaScript
TypeScript
C#
Java
C / C-ABI
HTML / CSS
SQL
VBA / Excel
R
PHP
Education
Formal qualifications
Master of Applied Artificial Intelligence
Deakin University — 2023 to 2025
AQF Level 9
ACS Accredited
Seoul Accord
Deep Learning
Reinforcement Learning
Computer Vision
Bachelor of Information Technology
Griffith University — 2019 to 2021
AQF Level 7
Systems Development
IT Project Management
Security Policy
Certifications
Industry credentials
Microsoft Certified: Azure AI Fundamentals
Microsoft
Current Projects
What I'm building at OpenFluke
Loom
A portable, cross-language AI engine that runs neural networks natively across Go, Python, C#, TypeScript, WebAssembly, and more — without vendor lock-in.
Learn more →
Primecraft
A distributed simulation engine with procedural world generation, physics, and embedded neural AI. Available on Android, Windows, Linux, and Steam.
Learn more →
SoulGlitch
An AI creature evolution game built on Primecraft and powered by Loom. Train neural networks through gameplay. In active development.
Learn more →
Portfolio Demos
Selected project video demonstrations
Flamekeeper · RAG / LLM
Local Offline ChatGPT-like System
Built from scratch: offline conversations with RAG-based AI, voice input, TTS, natural language recommendations, and local vector search. Dockerized microservices with React UI.
React · GoFiber · FastAPI · Docker · MongoDB · Ollama · Tacotron2
Bampro · MARL / Distributed AI
WebGPU-Agnostic AI Framework
End-to-end simulation of a WebGPU-agnostic AI framework designed for horizontal scaling in distributed environments. Multi-agent RL with evolutionary neural architecture selection and real-time dashboards.
Go · Fiber · Docker Compose · WebSockets · MARL · React
Deakin University · Team Lead
Game Dev — Neurodiversity & Accessibility
Served as team leader for a game development project at Deakin University, focusing on neurodiversity, accessibility, and innovative thinking in an inclusive educational platform.
Geolocation · OpenStreetMap
Australia Open-Source Lots Map
Integration of OpenStreetMap to display geolocation data for all open-source lots across Australia on an interactive web interface.
GitHub Portfolio
Selected open-source projects — github.com/planetbridging & github.com/openfluke
🔥
Flamekeeper
Multimodal RAG system: speech recognition, TTS, embeddings, and local LLM inference. Dockerized microservices with React UI for document ingestion and vector search.
React · GoFiber · FastAPI · Ollama · Tacotron2
🤖
Bampro
Multi-agent reinforcement learning experiments in a 3D simulation environment. Evolutionary neural architecture selection, real-time dashboards, low-spec cloud orchestration.
Go · Fiber · Docker · WebSockets · MARL · React
🌐
Biocraft
Isomorphic physics + AI sandbox running both natively and in browser. JSON-driven scene import/export, player-to-policy training, GPU-accelerated inference, multi-server monitoring.
Go · WebGPU · Jolt Physics · Three.js · WebAssembly
🕸️
3D Permission Dendrogram
Interactive 3D visualization of hierarchical permission trees — streamed live from a Go backend and rendered in React Three Fiber with WebGL.
Go · WebSockets · React Three Fiber · WebGL · Docker
🔐
CyberSentry Series
Cybersecurity dashboard suite for CVE/CPE lookups and vulnerability enumeration. Real-time APIs with caching layers across MySQL and MongoDB.
TypeScript · React · Bun · Docker · MySQL · MongoDB
🌉
Bridgeware
Real-time microservice framework for secure CPE lookup and encrypted client-server messaging with a React dashboard.
Node.js · React · Express · Socket.IO · Docker · Chakra UI
🎵
Audio Labeling Pipeline
Full-stack ML pipeline for audio data: hierarchical labeling, spectrogram generation, neural architecture search, model training, and secure auth.
React · Node.js · Flask · TensorFlow · MongoDB · Docker
🚁
DJI Tello Autonomous Flight
Computer vision-guided robotics: CNNs trained to recognize individuals and trigger autonomous drone flight sequences via the DJI Tello SDK.
Python · TensorFlow · OpenCV · Keras · ffmpeg
🐾
Paws
Network packet capture and analysis tool in Go — goroutine-based sniffing, REST endpoints, and a responsive web dashboard for traffic inspection.
Go · gopacket · pcap · Bootstrap
🔍
TokenTrove
N-gram chain discovery across millions of documents. Parallel processing, file-level pattern tracking, web UI with real-time stats. Tested on 5,000+ FCC legal filings.
GitHub →
Additional Projects
🏈 AFL Match Prediction (Flask · TensorFlow · Pandas)
📊 Steam vs Android Trends Dashboard (Python · Chart.js)
⚙️ Ansible VMware/vSphere Examples
🗄️ CSV-to-SQL Converter (C#/.NET WPF)
🧟 Zombie Apocalypse Simulation (Node.js · Socket.IO)

---

Source: https://openfluke.com/loom

# Loom — Golang AI Engine, Portable & Zero CGO (v0.79)

> Loom is a pure Golang AI engine: neural networks in Go with zero CGO on Windows, Linux, macOS, Android, iOS, and WASM. v0.79 Bedrock Validation: CPU train/save/reload, MHA decode, seven-layer Lucy suite, C-ABI 461/461. BitNet, WebGPU, 21 dtypes, welvet bindings.

Canonical: https://openfluke.com/loom

---

Open Source · github.com/openfluke/loom
The Universal
AI Engine
M-POLY-VTD — a ground-up neural engine in Go: 3D volumetric grids, 21 numeric types,
and polyglot bindings ( welvet ) for Python, TypeScript, Dart, and WASM.
Train once, run with bit-identical results on CPU, WebGPU, and every major OS.
v0.79.0 — Bedrock
Seven-layer CPU suite
21 DTypes · DNVM
Read the docs
GitHub
Releases
v0.79.0
Bedrock validation · 111/142 checklist
v0.79 — trustworthy CPU train → save → reload → infer
Lucy [7] — 10 layer types × 21 dtypes × 1³/2³/3³ grids · SC/MC · train · native save/reload
MHA layout + KV — [B,S,D] training · autoregressive decode · Poly Talk fixed
Native persistence — BitNet ternary + signed low-bit round-trip · LoomSyncInferenceWeights
Welvet C-ABI — 461/461 export parity (rebuild libwelvet after upgrade)
Still from v0.78: Dense asm forward · BitNet CPU · WebGPU · Donate Compute · TANHI
AI Deep Research
Independent AI Analysis of Loom
Comparative research on M-POLY-VTD vs PyTorch and JAX — plus the full engine reference on this site,
synced from loom/docs .
Architecture & research
3D grids, target propagation, DNVM
Start with the overview — volumetric dispatch, WeightStore morphing, step mesh, transformers, and v0.79 bedrock validation .
Why Loom?
All Loom docs
Research write-up
🧊
AI that thinks in 3D
Most AI frameworks process data in a straight line, like an assembly line. Loom uses a
three-dimensional grid — more like how your brain's neurons actually connect, jumping across regions rather than always going layer by layer.
💾
Fits AI on a USB stick
Loom can compress AI models by up to 98.4%. A model that normally takes gigabytes of storage
can shrink to a fraction — small enough to run on a phone or an old laptop with no internet required.
🧬
Learns like biology, not math
Traditional AI learning requires freezing everything to calculate one massive equation.
Loom's Target Propagation lets each part of the network learn independently — more like
how neurons fire and strengthen in a real brain.
Read the Full Technical Breakdown
For Non-Technical People
What is Loom, exactly?
"Think of Loom like SQLite — but for AI."
SQLite is a tiny database that runs inside your app with no server needed.
Loom is the same idea for neural networks: a self-contained engine you can
drop into any project, on any device, with no cloud account, no GPU server,
no complicated setup.
🧠
Train it like a brain
A neural network learns by seeing examples — like showing a child thousands of
pictures of cats until they know what a cat is. Loom provides all the tools
to build and teach these networks.
📦
Pack it anywhere
Once trained, your model is a tiny file. Drop it into your Python script,
your phone app, your website, or a game engine. Loom runs it everywhere
with the exact same output.
🔒
No cloud needed
Unlike ChatGPT or other AI services, Loom runs 100% locally on your device.
Your data never leaves your machine. Perfect for privacy-sensitive apps
or offline use.
⚡
WebGPU acceleration
On supported devices, Loom uses your GPU through WebGPU — achieving
17× to 65× faster training than CPU. Works in browsers too.
🌍
Every language
Python developer? pip install welvet . JavaScript? npm install @openfluke/welvet .
Go, C, C#, Rust? There are bindings for all of them. One model, every language.
🎯
Deterministic on CPU & GPU
Loom's Deterministic Neural Virtual Machine (DNVM) delivers bit-identical behaviour across
Apple Silicon, x86, WebGPU, and language bindings. Lucy and SoulGlitch depend on this for reproducible local inference.
🧬
Evolution built in
Loom includes a full NEAT evolution engine — models can mutate and breed
like living organisms. This powers SoulGlitch's creature evolution system.
Lucy & SoulGlitch
Supported Hugging Face models
Approved checkpoints share the same list in loom/lucy and SoulGlitch—download once, run offline via welvet.
SmolLM2
135M · 360M · 1.7B Instruct — mobile to server brains
Qwen3
0.6B · 1.7B · 4B — GPU-friendly chat models
BitNet b1.58
microsoft/bitnet-b1.58-2B-4T — packed ternary CPU path (v0.78+, native save/reload in v0.79)
Plus custom Loom/poly networks (training, NEAT, DNA) with no HF download.
Get Started
Install in 30 seconds
Pick your language and paste the command. No account required.
Python
Node.js
Go
WebAssembly
$
pip install welvet
Copy
Ships with precompiled native libraries for Windows, Linux, macOS, iOS, and Android.
Zero Python dependencies.
PyPI page →
$
npm install @openfluke/welvet
Copy
Works in Node.js and browsers via WebAssembly.
npm page →
$
go get github.com/openfluke/loom/poly
Copy
Pure Go module. No CGO. Works with standard go build .
Quick reference →
· Source →
Download main.wasm from the releases page
Download
6.9 MB WASM bundle. Drop into any web page and run Loom in the browser.
All releases →
Platform Support
Runs everywhere
Prebuilt native libraries for every major platform — just download and go.
Windows
x86-64, ARM64
Linux
x86-64, ARM64, ARM v7, x86
macOS
x86-64, ARM64 (M-series), Universal
Android
ARM64, x86-64
iOS
ARM64, Simulator, XCFramework
WebAssembly
Browser + Node.js
WebGPU
Forward + Backward pass, 17×–65× speedup
PyPI
welvet — zero dependencies
For Developers
What's under the hood
Loom isn't just a wrapper around PyTorch. It's a ground-up engine built for portability and precision.
All major layer types
Dense, MHA, SwiGLU, RMSNorm, LayerNorm, CNN 1D/2D/3D, Transposed Conv, RNN, LSTM, Embedding, KMeans, Softmax, Parallel, Sequential, Residual.
21 numeric types
float64 all the way down to binary (1-bit), including fp8, fp4, int4, and ternary. Choose precision vs. model size at runtime.
NEAT evolution + DNA
A full neuroevolution engine with mutation, crossover, and fitness selection. Models have a "DNA" signature for reproducible evolution.
98.4% compression
Native bit-packed serialization shrinks model files by 98.4% compared to raw float storage. Plus SafeTensors support for HuggingFace compatibility.
Target propagation
An alternative to backpropagation where each layer is given a direct target. More biologically plausible and works for non-differentiable layers.
Step mesh engine
Clock-cycle 3D grid with double-buffered layers, spatial remote links, BPTT, and neural target propagation — online learning without a rigid layer stack.
BitNet & low-bit CPU
BitNet b1.58–style checkpoints with packed ternary linear layers. Lucy pulls from Hugging Face; welvet C-ABI exposes CPU inference paths.
Operation mesh
Donate Compute (LAN TCP model sharing), TANHI UDP layer telemetry for SoulGlitch HUD, tiled forward/backward, and Qwen3-family HF ingest.
Full documentation
Deployment guide
BitNet CPU
Watch It Work
See Loom In Action
Real demos — Loom models running in real time, live TANHI telemetry to SoulGlitch on your phone, benchmarks, and 3D visualization.
Loom × SoulGlitch · live
TANHI × Regional Mix — models on your PC, view on your phone
Watch Loom AI models run in real time on a regional_mix harness (Dense, MHA, SwiGLU, RNN, LSTM with remote links across 3D topologies).
Execution streams over UDP as TANHI telemetry into SoulGlitch on your local phone — a spatial, time-scrubbable trace instead of numbers in a terminal.
TANHI docs →
·
YouTube →
Performance Benchmark
Forget Llama.cpp: WebGPU Inference in Pure Go
SmolLM2-135M benchmarks: 68 tok/s on RTX 1650 Super, 143 tok/s on Linux i5, 229 tok/s on Mac M4.
Zero CGO. FlashPoly Tiling. Bit-level deterministic across OS boundaries.
Visualization
Loom: Visualizing 3D Neural Networks in Real-Time
Watch the AI "think" in real-time. Stepping mode, 3D grid topology, Zig-Zag and Starburst routing patterns — the black box, opened.
Android · Airplane Mode
Offline LLM Inference on Android via Loom AI
Loom v0.0.8 running 100% locally on Android — device locked in Airplane Mode throughout. Zero cloud dependency. Pure on-device compute from first principles in Go.
Open Source Tool
NeuralWave: 3D Neural Network Visualization & Weight Analysis
Real-time model discovery from HuggingFace, interactive 3D layer inspection, attention head visualization.
Built on Loom + Go backend + Three.js.
Star Loom on GitHub
Loom is free, open-source, and built in the open. Stars help others find it and fuel continued development.
Star openfluke/loom
Report an Issue
Star
Fork

---

Source: https://openfluke.com/why-loom

# Why Loom? Golang AI vs PyTorch, llama.cpp & Cloud AI — OpenFluke

> Why choose Loom: pure Golang AI engine (Apache 2.0, zero CGO), offline DNVM, 21 dtypes, WebGPU, vs PyTorch, JAX, llama.cpp, GoMLX, and cloud chatbots. Open source engine + SoulGlitch proof.

Canonical: https://openfluke.com/why-loom

---

Golang AI · Edge · Open Source
Why Loom vs the rest of AI
Cloud chatbots rent intelligence. PyTorch rents a Python runtime. Loom is a
pure Go AI engine you embed—offline, deterministic, Apache 2.0—with a shipped app
( SoulGlitch ) that proves it on real phones.
Interactive 3D
Loom overview
Documentation
GitHub
Skip to comparisons ↓
What you get
The OpenFluke stack
Not a single API—an open-source AI infrastructure lab : engine, bindings, docs, and products built on the same runtime.
Loom (Apache 2.0)
M-POLY-VTD engine: train + infer, 21 dtypes, WebGPU, BitNet CPU, C-ABI welvet , native release binaries.
Polyglot bindings
Python, TypeScript/npm, Go, Dart, C#, Java, WASM—one engine, same weights, embed like SQLite for neural nets.
SoulGlitch (product)
Offline AI companion on Google Play—swarm Q&A, emotion training, reactive face. Living proof of on-device Loom.
Primecraft + lab tools
Voxel simulation with embedded AI, scene gallery, Lucy CLI for local HF models—same sovereignty story.
Open source means Loom: source, license, and rebuildable natives on GitHub.
Releases ship prebuilt .so / .dylib / wheels so you don't have to compile Go—same pattern as PyTorch pip wheels or llama.cpp binaries.
SoulGlitch is a product on Google Play (app code not necessarily OSS).
Model weights come from Hugging Face under their own licenses.
Advantages
What Loom does differently
Compared to cloud AI, Python frameworks, LLM-only runners, and other Go ML libraries.
Sovereign & offline
No API keys. Prompts and training stay on your hardware—privacy by architecture, not policy PDFs.
Pure Go, zero CGO
Golang AI without a Python runtime or CUDA-only trap. Single-binary deployment story for edge and servers.
3D volumetric mesh
Networks as spatial grids—not only nn.Sequential . Native target propagation and step mesh learning.
21 dtypes + BitNet
Float64 down to 1-bit binary per layer. Native packed checkpoints with verified save/reload (v0.79). BitNet b1.58 on CPU since v0.78.
DNVM determinism
Bit-identical behaviour across CPU, WebGPU, and bindings—reproducible research and embedded systems.
WebGPU everywhere
Cross-vendor GPU: Windows, Linux, macOS, Android, browser—without shipping CUDA toolchains per platform.
DNA & NEAT built-in
Topological comparison of whole networks, evolution in-engine—not just weight checkpoint diffing.
Shipped proof
SoulGlitch on Play Store today. Not slides—a consumer app running local LLMs via Loom/welvet.
Vs the industry
Quick comparisons
Cloud AI (ChatGPT, etc.)
Them: Intelligence in their datacenter
Loom: Engine in your process
Them: No embeddable runtime
Loom: C-ABI for your app
PyTorch / JAX
Them: Python + huge CUDA stack
Loom: Go binary, edge-first
Them: 1D autograd DAG
Loom: 3D mesh + target propagation
llama.cpp / Ollama
Them: LLM inference focus
Loom: Train + small nets + NEAT + DNA
Them: GGUF decode excellence
Loom: Full engine for products
GoMLX / Born ML
Them: 1D stacks or OpenXLA/CGO
Loom: Zero CGO + WebGPU
Them: Narrower scope
Loom: DNVM, BitNet, DNA, shipped app
Feature matrix
Loom vs PyTorch & Go ML (summary)
Capability
Loom
PyTorch / JAX
llama.cpp
Core language
Pure Go (golang AI)
Python + C++/CUDA
C/C++
Offline / embed
First-class (C-ABI, WASM)
Possible, heavy
Inference-focused
Training + custom nets
3D mesh, NEAT, DNA
Autograd ecosystem
Mostly inference
Quantization
21 native dtypes + BitNet CPU
TorchAO add-ons
GGUF quants
GPU path
WebGPU (cross-platform)
CUDA / ROCm / TPU
CPU/GPU backends
Determinism (DNVM)
Bit-exact claim
Not guaranteed
Varies
Open source
Apache 2.0 engine + binaries
Framework OSS
OSS inference
Deep dive: M-POLY-VTD architecture research ·
docs overview
Fit
When to choose Loom
Choose Loom if you need…
Offline AI inside your app (Flutter, Go, WASM)
A golang AI / Go ML stack without Python
Bit-exact, auditable local inference
BitNet or sub-byte models on CPU
3D / NEAT / DNA research in one engine
Apache 2.0 you can fork and ship
Use something else if you need…
Largest cloud models with zero setup (use hosted APIs)
Massive PyTorch ecosystem & HF fine-tune recipes day one
Fastest GGUF Llama on Mac CPU only (benchmark llama.cpp)
Enterprise MLOps (Kubeflow, etc.) out of the box
Ready to try the golang AI engine?
Star the repo, read the docs, or install SoulGlitch and run models offline today.
openfluke/loom
Deploy with welvet
SoulGlitch

---

Source: https://openfluke.com/loom/research

# Loom M-POLY-VTD — Golang AI Architecture Deep Research

> Technical analysis of Loom's pure Go AI stack (M-POLY-VTD): volumetric tensor dispatch, 21-type polymorphism, step mesh, neural target propagation, topological DNA vs PyTorch, JAX, Born ML, GoMLX.

Canonical: https://openfluke.com/loom/research

---

AI Deep Research · Technical Analysis
M-POLY-VTD: The Loom Architecture
An exhaustive technical analysis of the Loom framework — covering Volumetric Tensor Dispatch,
Multi-Numerical Polymorphism, Systolic Grid Propagation, Neural Target Propagation,
the Topological DNA Engine, and a rigorous comparison against PyTorch, JAX, and the Go ML ecosystem.
3D Volumetric Grid
21 Numeric Types
Neural Target Propagation
WebGPU Native
Pure Go · Zero CGO
AI-Generated Deep Research · Podcasts & PDFs
Three release-era briefings from the Loom lab — listen in-browser or download the matching PDF report from files.openfluke.com .
v0.78 · Bedrock
Loom Poly AI Engine Research
Flagship M-POLY-VTD deep dive — volumetric dispatch, 21 dtypes, target propagation, and Go ML comparisons.
Your browser does not support the audio element.
PDF
MP3
v0.76
Operation Mesh Shrinks Local AI
How Loom’s operation mesh and release trajectory tighten the local-AI deployment story on consumer hardware.
Your browser does not support the audio element.
PDF
MP3
v0.75
Mac Mini Beats RTX 4090 with Loom
AI engine tiling update analysis — cache-aware dispatch and why Apple Silicon + Loom can outrun big discrete GPUs on the right workloads.
Your browser does not support the audio element.
PDF
MP3
View Source
Back to Loom
Section 1
The Paradigm Shift: Volumetric Tensor Dispatch (VTD)
Traditional deep learning frameworks — including PyTorch and TensorFlow — construct neural networks as
directed acyclic graphs (DAGs) or sequential layer lists . While mathematically sound,
this one-dimensional abstraction creates rigid execution pipelines that struggle to implement complex,
biologically inspired routing.
The Loom architecture fundamentally dismantles this constraint by introducing a
3D Volumetric Coordinate System . Every layer is assigned a geometric address (z, y, x, l)
within a pre-allocated spatial grid. A flattening algorithm maps these 3D coordinates to contiguous 1D memory,
maintaining hardware cache locality despite the logical 3D abstraction.
Spatial Hopping
In standard sequential models, data must flow strictly from layer N to layer N+1. In the Loom volumetric grid,
data signals can bypass adjacent layers and jump across geometric coordinates — mimicking
biological cortical columns. If a layer has an IsRemoteLink flag, the dispatcher fetches the remote
layer dynamically via TargetZ, TargetY, TargetX, TargetL and injects it into the local execution
path without graph recompilation.
Dynamic Branching via Polymorphic Routing: The LayerParallel and
LayerSequential container types aggregate sub-branches within the coordinate space.
When ParallelForwardPolymorphic executes, the dispatcher routes input to multiple
coordinate-mapped branches simultaneously, then merges using configurable topological modes:
🔗
concat
Standard tensor concatenation across parallel branches.
➕
add
Residual aggregation — sum branch outputs for skip connections.
〰️
avg
Ensemble smoothing via averaged output tensors.
🔀
grid_scatter
Spatial distribution of tensors across the volumetric grid.
🎛️
filter (MoE)
Mixture-of-Experts gating: a FilterGateConfig layer generates Softmax coefficients to compute a dynamically weighted sum.
Section 2
Multi-Numerical Polymorphism (M-POLY)
A critical bottleneck in edge-device inference is memory bandwidth — streaming weight matrices from global
VRAM to compute units. The Loom engine addresses this through native multi-numerical polymorphism .
Unlike standard frameworks that require exporting to a fixed lower precision, Loom layers operate as
fluid polymorphic units .
The WeightStore struct maintains a master Float32 representation as the absolute
source of truth, alongside a localized cache of actively morphed target precisions keyed by DType .
Loom supports 21 distinct numerical types :
Float64
Float32
BFloat16
Float16
FP8 E4M3
FP8 E5M2
FP4
Int64
Int32
Int16
Int8
Int4
Int2
UInt8
UInt4
UInt2
Ternary
Binary (1-bit)
NF4
E2M1
E3M0
Hardware Emulation via SimulatePrecision: For extreme low-bit types lacking native CPU/GPU register
support (FP4, 2-bit quantization), Loom employs a universal fallback that mathematically forces the Float32 master
weight to behave exactly as its lower-bit counterpart — simulating exponent/mantissa bounds for FP8E4M3,
restricting to four discrete scaling levels for Int2, and clamping to ±1 for Binary.
This enables Quantization-Aware Training (QAT) without complex fake-quantization node
injections (as required by PyTorch). Different spatial coordinates can operate at different precisions
simultaneously — a reasoning node in Float16 while an embedding lookup runs in 2-bit.
98.4% On-Disk Compression
By packing low-bit representations, the Loom architecture achieves up to 98.4% on-disk compression
for localized model deployment — effectively breaking the 192 GB/s memory bandwidth wall that stifles
traditional inference on consumer graphics cards like Turing-class GPUs.
Section 3
Systolic Grid Propagation: The Discrete-Time Neural Mesh
Standard deep learning inference operates in a continuously flowing waterfall pattern — layer 1 finishes,
passes memory to layer 2, and so on. Loom introduces Systolic Grid Propagation , modelled
after the hardware systolic arrays used in Google's TPUs.
Under this model, the 3D Volumetric Grid is a discrete-time neural mesh . The
SystolicForward function advances the entire 3D grid by a single temporal "tick" — every
coordinate calculates its output simultaneously based solely on input states from the previous tick.
🔁
Double Buffering
The network maintains ReadBuffer and WriteBuffer per tensor state. During dispatch,
every layer reads from ReadBuffer and writes results exclusively to WriteBuffer. CommitSystolicState
then atomically swaps buffers — preventing race conditions in concurrent environments.
⏱️
Temporal Pattern Learning
Information takes time to propagate geometrically across the network. This fundamentally alters how
sequence data is processed — enabling true temporal learning that standard feedforward networks cannot achieve.
🔀
Asynchronous Layers
Because layers operate asynchronously relative to continuous data flow, the systolic mesh supports
online learning patterns that are impossible in standard sequential epoch-based training.
Section 4
Neural Target Propagation (TargetProp)
Backpropagation is widely criticized for its biological implausibility: it requires global error computation,
exact weight symmetry, and freezing the forward activity while gradients are sequentially
calculated backward through the chain rule. Loom implements an advanced alternative:
Neural Target Propagation .
Instead of computing continuous derivatives, TargetProp computes a proposed "target" state for each
hidden layer. Each layer's objective is no longer to minimize the global loss via partial derivatives, but
simply to map its forward activation to the proposed backward target .
How TargetProp Works in Loom
During the forward pass, actual activations are captured in ForwardActs . During optimization,
CalculateTargetPropGaps executes an inverse estimation: for Dense layers, estimated targets are
generated via weighted importance of downstream targets relative to master weights. For LSTM layers, the engine
aggregates backward through input, forget, cell, and output gates simultaneously, creating a synthesized
target for the previous recurrent time step.
Gap-Based Hebbian Optimization: Once targets are generated, ApplyTargetPropGaps
applies a local Hebbian-style learning rule. The weight update follows:
ΔW = η · input · (target − actual)
Loom introduces an advanced stability mechanism via LinkBudget — dynamically calculated
from the cosine similarity between the forward activation vector and the backward target vector.
If the target signal is highly misaligned (cosine similarity below 0.2), the layer simply
ignores the update . This prevents catastrophic forgetting and exploding signals.
Crucially, because TargetProp does not require differentiable functions, Loom can natively optimize
extreme architectures like binary (1-bit) or ternary networks where standard gradients
would vanish or shatter.
Section 5
The Topological DNA Engine
Because layers can dynamically hop across a 3D coordinate space and shift their numerical precision,
traditional cryptographic hashing or PyTorch state-dict comparisons would instantly register a complete
mismatch even when underlying logic is intact. Loom integrates a native DNA Engine
based on principles from Topological Data Analysis (TDA).
ExtractDNA converts every layer into a LayerSignature capturing spatial coordinates,
layer type, DType, and a dimensionally normalized weight representation . The
SimulatePrecision function expands all active WeightStore versions back to unified Float32 before
unit vector normalization — ensuring the geometric "direction" of weights is captured independently of bit-depth magnitude.
Logic Shift Detection
CompareNetworks identifies Logic Shifts — when a layer signature in Model A aligns
with high cosine similarity (>0.8) to a layer in Model B, but at a different spatial coordinate .
This allows researchers to observe how architectural search algorithms or systolic propagation patterns
naturally migrate logic pathways to more efficient regions of the 3D grid over time.
Section 6
Native WebGPU Acceleration & Hardware-Aware Tiling
Loom achieves 70+ tokens/second on consumer hardware through low-level optimization. The
hardware.go module executes deep OS-level system calls ( sysctl on Darwin,
/sys/devices/system/cpu/cpu0/cache/ on Linux) to determine exact L1/L2 cache byte sizes.
Dynamic L1/L2 Cache Tiling: CalculateOptimalTileSize restricts matrix multiplication
blocks so that the entire sub-block remains resident in L1 cache — significantly reducing global memory fetch
latency. This delivers major speedups for operations like swigluTiledProjectGateUp .
WGSL Shader Workgroup Optimization: For WebGPU execution, Loom queries
MaxComputeWorkgroupStorageSize and MaxComputeInvocationsPerWorkgroup directly from the
WebGPU adapter. MHA shaders allocate shared arrays for Keys and Values, using workgroupBarrier()
synchronization, sized to consume exactly half of available workgroup storage — achieving optimal execution
across Apple Silicon, NVIDIA CUDA, and integrated mobile GPUs.
Section 7
Sub-System Autonomy: Tokenization, Ensembling & Telemetry
🔤
Native BPE Tokenizer
A full Byte-Pair Encoding tokenizer written in Go, natively parsing HuggingFace tokenizer.json
schemas. Includes a byte-fallback mechanism ( gpt2ByteEncode/Decode ) for unknown Unicode
characters — enabling completely standalone, offline string-to-tensor processing.
🧮
Mathematical Ensembling
FindComplementaryMatches assesses binary correctness masks of multiple models, calculating
combined coverage ratio and cosine similarity of success rates — enabling optimized "Mixture of Models"
pipelines that complement each other's weaknesses.
📊
Differentiable K-Means
KMeansForwardPolymorphic transforms standard K-Means into an end-to-end differentiable operation
using temperature-scaled distance metrics and Softmax gating, allowing classification topologies anywhere
in the volumetric grid.
📡
Microsecond Telemetry
The PolyObserver interface enables real-time tensor interception during forward/backward passes.
AdaptationTracker monitors degradation and recovery via moving windows of outputs, accuracy,
and throughput ( OutputsPerSec ).
Section 8
Comparative Analysis: Loom vs Python Ecosystem (2026)
The global deep learning industry has historically been dominated by Python-based frameworks.
Comparing Loom to these heavyweights highlights distinct philosophical and technical divergences.
Feature
Loom (M-POLY-VTD)
PyTorch (+ TorchAO)
JAX (+ Flax/Optax)
Execution Paradigm
3D Volumetric Mesh / Spatial Routing
1D Sequential / Dynamic DAG
Functional / Compiled Static Graph (XLA)
Language
Pure Go (Compiled Native Binary)
Python (C++ / CUDA backend)
Python (C++ / XLA backend)
Quantization
21 types native (FP64 down to Binary 1-bit)
Native FP8, INT4, INT2, 1-bit via TorchAO
Native FP8, INT8; sub-byte via custom libs
QAT (Hardware Emulation)
Built-in polymorphic SimulatePrecision
FakeQuantize modules (complex node injection)
Custom JAX primitives
Optimization Engine
Polymorphic BPTT + Native Target Propagation
Native Autograd (reverse-mode AD)
Functional forward & reverse AD
Target Propagation
First-class native
Requires extensive custom class overrides
High research support via custom logic flows
GPU Acceleration
WebGPU (cross-platform, edge & browser)
CUDA, ROCm, Metal (vendor-specific)
TPU, CUDA, ROCm (heavy compiler reliance)
Structural Analysis
Topological DNA Engine + Logic Shifts
Standard dict/parameter hashing
Standard dict/parameter hashing
Deployment Footprint
Single binary, zero dependencies
Large runtime (PyTorch + CUDA variables)
Large runtime (JAX + XLA toolchains)
Section 9
Comparative Analysis: Loom vs Go ML Ecosystem (2026)
Feature
Loom
Born ML
GoMLX
Gorgonia (Legacy)
Core Architecture
3D Spatial Grid (Volumetric routing)
1D Sequential module stacks
1D Sequential computation graphs
Static graph (Theano/TF1 style)
Compute Backend
Pure Go + WebGPU (Zero CGO)
Pure Go + WebGPU (Zero CGO)
OpenXLA (Heavy C++ bindings)
CGO / CUDA (C++ bindings)
Modern LLM Topology
MHA, SwiGLU, RMSNorm, RoPE
MHA, GQA, SwiGLU, KV-Cache, RMSNorm
Gemma support / ONNX translation
None (basic perceptrons/CNNs only)
Quantization Spectrum
21 types (FP64 down to Binary 1-bit)
Standard (FP32/FP16)
Standard (dictated by XLA compiler)
FP32/FP64 only
Optimization Engine
Backprop (BPTT) + Native Target Propagation
Automatic Differentiation (Autograd)
Automatic Differentiation via XLA
Symbolic & Automatic Differentiation
Non-Standard Layers
Native Differentiable K-Means Clustering
Requires external implementation
Requires external implementation
Requires external implementation
System Telemetry
Advanced window-based Adaptation Tracking
Standard terminal logging
Standard terminal logging
Standard terminal logging
Conclusions
Strategic Outlook
The Loom M-POLY-VTD architecture represents a radical divergence from established norms of deep learning
engineering in 2026. By replacing the 1D computational graph with a cycle-accurate 3D Volumetric Grid,
the framework physically maps neural structures in a manner that accommodates advanced biological routing —
spatial hopping, systolic parallelism, and polymorphic precision.
Its exhaustive 21-type polymorphism and simulated precision mechanisms directly confront the hardware memory
bandwidth crisis, enabling dynamic on-the-fly quantization to 1-bit precision without structural memory
reallocation. Neural Target Propagation provides a mathematically viable path for continuous, asynchronous
training on power-constrained edge hardware.
Complemented by the DNA Engine's topological signature matching, native BPE tokenization, and pure-Go
WebGPU acceleration, Loom provides a self-contained, enterprise-grade ecosystem — vastly
surpassing legacy Go frameworks, matching Born ML's deployment efficiency, and introducing architectural
innovations previously reserved for experimental Python and JAX research environments.
View Loom on GitHub
Loom Documentation
Back to Loom Overview

---

Source: https://openfluke.com/soulglitch

# SoulGlitch — Offline AI Digital Pet (Google Play)

> SoulGlitch is a private on-device AI companion on Google Play (Android) and Linux x86_64. Reactive glitch face, swarm Q&A, emotion training, 100% offline—powered by Loom. iOS, macOS, Windows coming soon.

Canonical: https://openfluke.com/soulglitch

---

Now on Google Play · offline AI pet
SoulGlitch
A private AI digital pet that lives entirely on your device. Chat with it, train its emotional reactions on
your own data,
share moments with it, or ask a swarm of personalities and watch them vote on what you should do.
Get it on Google Play
Linux
Founder unlock
Roadmap
Available now
Android — download from Google Play.
Linux — x86_64 zip below. No cloud, no accounts, no tracking.
Coming soon: Apple App Store (iOS), Mac App Store, and Microsoft Store (Windows).
Google Play
Linux
iOS
macOS
Windows
Ask a swarm, not just one bot
Build your own panel of personalities, ask them anything, and watch them respond with different tones,
opinions,
and chaos levels — all locally on your phone.
“Should I fart on a crowded train?”
Gremlin Mode YES
Absolutely. This is a once-in-a-lifetime social experiment.
Polite One NO
Please do not weaponise public transport.
Chaos Analyst YES
Data suggests the emotional outcome will be unforgettable.
Swarm Result: 2 / 3 personalities say yes
Loom live on your phone
Watch models run in real time — via TANHI
Loom trains and executes on your computer while SoulGlitch on your local phone receives live UDP telemetry.
See mixed layer types, remote links, and volumetric topology as a spatial trace you can scrub through — not just log lines.
TANHI × Regional Mix — Loom’s regional_mix harness pairs Dense, MHA, SwiGLU, RNN, and LSTM branches with regional remote links.
TANHI (Tensor Activation Network Holographic Interface) streams layer activity to SoulGlitch over your LAN.
Read the TANHI docs
Screenshots
See SoulGlitch in action
Tap any screenshot to view it larger and explore how the app feels and behaves.
×
‹
›
What it does
More expressive than a chatbot. More private than the cloud.
SoulGlitch is designed as a playful, reactive local AI experience. It is meant to feel alive, weird,
emotional,
and personal — not like another generic wrapper around a text box.
👁️
Reactive glitch face
A living full-screen face that blinks, shifts, glitches, and morphs through emoji-driven expression changes
in real time.
🧠
Train your entity
Use text, images, and eventually video to shape how your AI interprets emotion and reacts to your world.
🗳️
Swarm personalities
Create multiple agents with different prompts and let them respond, disagree, and vote on questions
together.
📸
Share experiences
Share a photo or video with your pet and let it react emotionally based on the worldview you trained into
it.
📱
Runs on-device
Chats stay local. Training stays local. Reactions stay local. Your AI lives on your hardware, not someone
else’s server.
🌀
Built to expand
SoulGlitch starts as a digital pet and grows toward encrypted sessions, swarm networking, future dimensions,
and richer local agent behaviour.
Free vs Founder
Start free. Unlock the inner layer.
The free tier is the expressive local AI pet. Founder unlock opens the deeper customization and multi-entity
system behind the hidden door.
Free tier
Local AI pet
The playful, reactive outer layer — designed to be approachable, expressive, and fun from the first launch.
Talk to a full-screen glitch face with emoji-driven reactions
Train text sentiment and emotional response behaviour
Train on images and shape your entity’s reactions
Share photos and let the AI react to them
Share videos and build toward video-based emotional reactions
Export and share moments to social media
Founder unlock • $8.88
Inner layer access
For people who want the deeper system: more control, more entities, more weirdness, more experimentation.
Create more than one entity with different instructions and personalities
Ask a question to a swarm of agents and collect a group vote
Choose from a large entity pool to build your own custom groups
Customize entity colour, shape, and visual style
Access prototype features and in-progress experiments behind the hidden layer
Support development while locking in the early founder tier
Why SoulGlitch
Not another serious AI assistant.
SoulGlitch is intentionally playful. It is built for people who want to experiment, laugh, share, and see how
local AI can feel expressive and alive
without sending their whole life into the cloud.
😂
Chaotic by design
Ask dumb questions, get strange answers, and treat it like a digital creature rather than a corporate
productivity dashboard.
🔒
Privacy-first
The fun part is the personality. The serious part is that your chats and training stay on your own device.
✨
Expressive visuals
Face morphs, emojis, glitch states, future dimensions, and more — the interface is part of the personality,
not just a skin around a text box.
Roadmap
What’s coming next
SoulGlitch is live on Google Play (Android)
and Linux (x86_64) .
iOS, macOS, and Windows Store builds are coming soon.
01
Polish the core pet
Sharpen the face, refine emotion training, expand sharing, improve reaction quality, and make the free
tier instantly fun.
02
Deepen the founder layer
Multi-entity setup, better swarm workflows, richer customization, and more experiments hidden behind the
inner door.
03
Expand the system
Encrypted chats, swarm networking, better face expressions, device-to-device syncing, cross-platform
builds, and future dimension spaces.
04
Enter its dimension
Audio and video training, optional distributed compute, 3D environments, and a much more embodied version
of the entity.
SoulGlitch
Train your entity. Ask a swarm. Don’t take it too seriously.
The outer layer is a strange little offline AI pet. The inner layer is where the system starts mutating into
something bigger.
Download on Google Play
Linux
Dev notes

---

Source: https://openfluke.com/soulglitch/notes

# SoulGlitch Design Notes — Vision & Mechanics

> Design document for SoulGlitch: creature evolution, Hall of Fame, swarm networking, emotion training, privacy-first AI companion—not a cloud chatbot wrapper.

Canonical: https://openfluke.com/soulglitch/notes

---

Developer notes • current direction • public build plan
SoulGlitch Notes
These are the live notes behind SoulGlitch’s current direction: a local offline AI pet with emotional reactions,
swarm personalities, founder-only inner layers, and a longer-term path toward richer on-device agent behaviour.
Back to SoulGlitch
Pricing & Tiers
Coming Features
🧬
What SoulGlitch is now
The current app direction
SoulGlitch is no longer framed as the old creature-simulation placeholder. The current product is an
offline AI digital pet with a glitch face, emotional reactions, local training, and a swarm layer for
asking
multiple personalities the same question and collecting their answers.
The goal is to make local AI feel expressive, playful, reactive, and personal — something that feels alive
on your phone,
not another sterile cloud chatbot. Privacy matters, but the front-facing experience is still meant to be
fun.
📱
Core app loop
How the main experience is supposed to feel
Open the app and interact with a living full-screen face rather than a plain message list.
Talk to the entity and watch it react visually with glitches, emoji-driven moods, and changing
expression.
Train its sentiment and emotional behaviour using your own data.
Share a photo or video and let it react based on what you previously taught it.
Move into the founder layer and ask a swarm of personalities the same question to get a group vote.
💸
Free, Founder, and later pricing
The monetization direction as it stands right now
Free tier
The expressive outer layer
Talk to the glitch face and train text sentiment reactions
Train on images and influence emotional response
Share a photo and get a mood-based reaction
Share video and move toward video-aware emotional reactions
Export and share moments to social media
Founder unlock • $8.88
The hidden inner layer
Customize entity colour, shape, and other visual traits
Create more than one entity with different prompts and instructions
Ask a question to a swarm of agents and collect their votes
Build your own custom groups from a larger entity pool
Access deeper prototype features behind the hidden door
🌀
Coming features
The longer-term buildout pushing toward the higher tier
Later roadmap • $18.88 direction
System expansion
Encrypted chats
Network swarm hosting
Windows, Linux, and Mac releases
Device-to-device encrypted syncing
Better face expressions and richer emotional display
Audio and video training
Optional distributed compute / donate device power
3D environments to engage with the entity in its own dimension
Kernel fusion and other inference speed improvements
🎯
Targeting notes
Who this appears to be for right now
Current likely audience
AI-curious adults who like playful tech, weird chat experiences, expressive interfaces, privacy-first
software,
and experimental apps that do not feel overly corporate or serious.
Why the tone matters
SoulGlitch works better when framed as fun, strange, reactive, and slightly chaotic. The privacy story
is important,
but “don’t take AI too seriously” is likely the sharper outer hook.
🧠
Model layer notes
How the local brain tiers are being thought about
135M — light brain: fast reactions, playful tone, lower load, ideal for lively mobile
interaction.
360M — balanced brain: stronger all-round local personality, better for fuller
interaction on more capable devices.
1.7B — deep brain: heavier reasoning tier intended more for server or stronger
hardware contexts.
🚀
Launch notes
How rollout is currently framed
The rollout path starts on Android, where the current build and local model story make the most sense.
The founder direction then expands slowly into iOS, desktop storefronts, and eventually Steam.
The app that launches first is not the final universe. It is the first stable layer: local pet, emotional
reactions,
founder swarm, and enough weirdness to prove the concept.
SoulGlitch is being built in layers: first the expressive local pet, then the founder inner system, then the
broader dimensional expansion.

---

Source: https://openfluke.com/primecraft

# Primecraft — Voxel Simulation Engine with Embedded AI

> Primecraft is OpenFluke's distributed simulation engine: procedural worlds, physics, voxel scenes, and embedded neural AI. Android, Windows, Linux, Steam.

Canonical: https://openfluke.com/primecraft

---

Early Access Available Now
Primecraft
A distributed simulation engine for procedural worlds and embedded neural agents —
delivered as a playable game across mobile, desktop, and the web.
Android
Windows
Linux
Wishlist on Steam
Alpha v0.30.0 — Model Sharing Patch
8B+
Planets
100%
Offline
Real
Neural AI
Overview
What Is Primecraft?
More than a game — it's an experimental engine exploring AI, physics, procedural generation, and distributed gameplay.
Primecraft is an experimental, physics-driven sandbox built on top of a custom simulation engine.
The game blends procedural world generation, real neural-network AI, player-driven construction, and multi-device gameplay.
It is also the foundation the creature game SoulGlitch is built on.
In simple terms:
Procedural Worlds + Real AI + Multiplayer Sandbox
Explore billions of procedural planets — each deterministically generated from its coordinates
Train on-device neural networks — companions that learn and evolve with you
Drop into bubble scenes — mini 3D levels loaded directly from the web
Build or import levels — using a simple JSON-based scene format
Couch co-op & LAN sync — play together on one device or across your network
Physics-driven gameplay — destructible objects, planetary gravity, and dynamic abilities
Fully offline — including AI training, no cloud required
Features
Core Gameplay Features
Experience a unique blend of exploration, creation, and AI-driven gameplay.
Procedural Universe
Navigate through billions of unique planets, each with distinct terrain, resources, and environmental conditions. Every world is algorithmically crafted for endless exploration.
Neural AI Companions
Train real neural networks directly on your device. Your companions learn from gameplay, developing unique behaviors and abilities through the Loom AI framework.
Physics Sandbox
Experience realistic planetary gravity, destructible environments, and physics-based abilities. Every object in the world responds to forces and collisions.
Bubble Scenes
Discover and enter bubble scenes — self-contained 3D levels created by players and loaded from the web. Play puzzles, challenges, and custom worlds.
Multiplayer Experiences
Play couch co-op on a single device or sync across multiple devices over LAN. Online multiplayer and server-hosted worlds are in active development.
Level Creation
Build your own worlds using the web-based editor. Export as JSON and share with the community, or import others' creations into your game.
Videos
See Primecraft In Action
Real footage of the engine being stress-tested and explored — from physics simulations to procedural planetary constructs.
Preview · v0.20.0
AI Returns Home, Planet Travel & Couch Co-Op
Fly across planetary space, train an AI to fly itself home, jump into bubble scenes, and play local co-op with synced AI movement — all in one unscripted preview.
Engine Test · Physics
Low-Fidelity Simulation Stress Testing
Discrete Element stress test: thousands of rigid bodies to find the saturation point of the physics solver. Tracking frame latency, CPU physics process time, and static RAM.
Full Suite · Tests 1–9
Procedural Planetary Constructs
9 tests in one: SnakeBots, animated skeletons, procedural bestiary (Walkers, Worms, Star-creatures), planetary skyscrapers aligned to spherical surfaces,
discovery satellites, defensive grids, and the Great Transfiguration — 150+ magical objects spawned onto a single planet.
Watch on YouTube
Technical
Under the Hood
Primecraft is built as a cross-platform simulation engine, not just a game.
Godot / C# Runtime
Native performance
High-performance physics, rendering, and input handling for mobile and desktop builds. Optimized for real-time simulation with thousands of entities.
TypeScript / Web Runtime
Browser-based tooling
Scene editing, constraint systems, and AI tooling that runs directly in your browser. Shares the same scene format for seamless interoperability.
Engine Capabilities
Deterministic Planet Generator — 8–15 billion reachable locations with consistent generation
Embedded Neural Runtime (Loom) — native inference for on-device AI
AI Training Layer — movement, control, and companion behaviour learning
Authoritative Multiplayer — server architecture in development
Web-Based Scene Editor — create puzzles, levels, and simulation experiments
FAQ
Frequently Asked Questions
Is this a game or a research project?
Both. Primecraft is a fully playable game, but it's also an experimental engine exploring AI, physics, procedural generation, and distributed input systems.
How does the AI work?
Each companion uses a real neural network running natively on your device — no cloud, no external servers. You train them through gameplay inside bubble scenes using the Loom framework.
Does Primecraft collect my data?
No. All AI models and training stay entirely on your device unless you explicitly choose to export or publish your scenes and models.
How many planets are there?
The coordinate system supports over 8 billion unique planets. Each one is deterministically generated from its grid location, ensuring consistency across sessions.
Can I make my own levels?
Absolutely! Use the web-based editor to create scenes, then load them directly into the game. Levels are stored as human-readable JSON files.
Is multiplayer supported?
Couch co-op and LAN play are available now. Online multiplayer with server-hosted bubble scenes is actively in development.
Why does Primecraft look chaotic?
By design! It's a physics-first sandbox with experimental AI. The emergent chaos is part of the experience — players are encouraged to break things creatively.
Audience
Who Is Primecraft For?
Built for curious minds who love experimentation and creative chaos.
Sandbox Enthusiasts
Love physics sandboxes and chaotic emergent gameplay
AI Hobbyists
Train neural networks in a real-time environment
Level Creators
Build 3D worlds without complex tools
Researchers
Explore embodied AI and procedural ecosystems
Explorers
Discover a weird, beautiful universe to mess around in
Join the Universe
Start building, training, and exploring today. The cosmos awaits.
Android
Windows
Linux
Wishlist on Steam

---

Source: https://openfluke.com/gallery

# Primecraft Scene Gallery — Voxel Worlds by OpenFluke

> Browse voxel scenes built in Primecraft while testing the simulation engine—3D worlds, reflex automation, and neural networks from the OpenFluke lab.

Canonical: https://openfluke.com/gallery

---

Scene Gallery
… scenes
← Prev
Page 1 of 1
Next →
×
Scene Name
Associations
Finding related items...
View in Biocraft Lab
Copy Path
Download JSON

---

Source: https://openfluke.com/privacy

# Privacy Policy — OpenFluke

> How OpenFluke, Loom, SoulGlitch, and Primecraft handle your data. SoulGlitch runs offline on-device; the website uses minimal analytics via Cloudflare.

Canonical: https://openfluke.com/privacy

---

Legal
Privacy Policy
Last Updated: November 28, 2025
1 Introduction
OpenFluke ("we", "our", or "us") is committed to protecting your privacy. This Privacy Policy explains how your information is handled when you use our services, including Primecraft , Biocraft , and the OpenFluke website .
2 Information Collection
We prioritize data minimization. We do not store your personal data on our own servers beyond what is necessary for core functionality. Our services use the following third-party providers:
Google Sign-In: Used for authentication. We only receive basic profile information (name, email, profile picture) to identify your account.
Google Play Services: Used for achievements, leaderboards, and cloud saves in Primecraft.
Google Play Billing: Handles in-app purchases securely. We never see your payment information.
3 How Information is Used
Account Management: To identify you and provide access to your Lab and saved content.
Game Progress: To save your game state, scenes, and unlocks locally or via cloud sync.
Diagnostics: We collect anonymous crash data to improve stability and fix bugs.
4 Data Security
We rely on the robust security measures provided by Google Cloud, the Android operating system, and industry-standard encryption to protect your data. While no method of transmission is 100% secure, we use commercially acceptable means to protect your information.
5 Your Rights
You have the right to access, correct, or delete your personal information. You can revoke Google sign-in permissions at any time through your Google account settings. Deleting your OpenFluke account will remove all associated data from our systems.
6 Children's Privacy
Our services are not directed at children under 13. We do not knowingly collect personally identifiable information from children under 13. If you believe we have collected such information, please contact us immediately.
7 Changes to This Policy
We may update this Privacy Policy from time to time. Changes will be posted on this page with an updated revision date. Continued use of our services constitutes acceptance of the updated policy.
8 Contact Us
If you have any questions about this Privacy Policy, please reach out:
support@openfluke.com

---

Source: https://openfluke.com/terms

# Terms of Service — OpenFluke

> Terms of service for OpenFluke websites, Loom open-source software, SoulGlitch, and Primecraft.

Canonical: https://openfluke.com/terms

---

Legal
Terms of Service
Last Updated: December 10, 2025
1 Acceptance of Terms
By accessing or using OpenFluke services, including Primecraft , Soulglitch ,
Biocraft , the OpenFluke website , and the LOOM AI framework ,
you agree to be bound by these Terms of Service. If you do not agree to these terms, please do not use our
services.
2 Description of Services
OpenFluke provides a platform for creating, sharing, and running physics simulations and AI-driven
experiences. Our services include:
Primecraft: A native game application for exploring procedural worlds and training AI
companions.
Biocraft: A browser-based studio for creating and testing physics scenes.
Your Lab: A personal workspace for managing your scenes and AI models.
LOOM: An open-source neural network framework for AI training.
3 User Accounts
To access certain features, you must create an account using Google Sign-In. You are responsible for
maintaining the confidentiality of your account and for all activities that occur under your account.
Your account is free and gives you access to your personal Lab, scene publishing, and AI
training features.
4 User Content
You retain ownership of any content you create using our services ("User Content"). By publishing User
Content, you grant OpenFluke a non-exclusive, worldwide, royalty-free license to display and distribute your
content through our platforms.
You agree not to publish content that:
Is illegal, harmful, threatening, abusive, or violates any laws
Infringes on intellectual property rights of others
Contains malware, viruses, or malicious code
Is designed to harm, exploit, or mislead other users
5 Intellectual Property
The OpenFluke name, logo, Primecraft, Biocraft, and associated branding are trademarks of OpenFluke. The
LOOM AI framework and IsoCard scene format are released under open-source
licenses and may be used according to their respective terms.
6 Disclaimer of Warranties
Our services are provided "as is" and "as available" without warranties of any kind. We do not guarantee that
our services will be uninterrupted, secure, or error-free. Your use of our services is at your own risk.
7 Limitation of Liability
To the maximum extent permitted by law, OpenFluke shall not be liable for any indirect, incidental, special,
consequential, or punitive damages arising from your use of our services.
8 Termination
We reserve the right to suspend or terminate your account at any time for violations of these terms or for any
other reason at our discretion. You may delete your account at any time through your account settings.
9 Changes to Terms
We may update these Terms of Service from time to time. Material changes will be communicated through our
services. Continued use after changes constitutes acceptance of the updated terms.
10 Contact Us
If you have any questions about these Terms of Service, please reach out:
support@openfluke.com

---

Source: https://openfluke.com/docs

# Loom Docs — Golang AI Engine Reference

> Official Loom Golang AI docs: M-POLY-VTD overview, v0.79 bedrock validation, Go layers & dispatch, GPU/WebGPU, quantization, Lucy seven-layer suite, BitNet CPU, deployment, TANHI telemetry.

Canonical: https://openfluke.com/docs

---

Loom Documentation
M-POLY-VTD Engine Docs
Complete reference for Loom's poly package — the Multi-numerical POLYmorphic
Volumetric Tiled-tensor Dispatcher that powers every Loom integration.
Architecture Overview
Quick Reference
GitHub
Where to start?
New? Read Overview → Layers → Training. Deploying to web or mobile? Go to Deployment. Need a snippet? Quick Reference has everything copy-paste ready.
Architecture
Layers
Training
Deployment
GPU
Quick Reference
Read docs

## Loom documentation (full text)

---

## M-POLY-VTD: Architecture Overview

Source: https://openfluke.com/docs/overview
Markdown: https://openfluke.com/docs/overview.md

# M-POLY-VTD: Architecture Overview

**Multi-numerical POLYmorphic Volumetric Tiled-tensor Dispatcher**

M-POLY-VTD is a neural inference and training engine built from first principles in Go. It treats a neural network not as a sequential stack of layers, but as a **spatial 3D grid** where each cell can hold any layer type, and every layer can morph its numerical precision on demand.

> [!NOTE]
> Current version: **0.79.0 (Bedrock Validation)**. Previous: **0.78.0 (ASM CPU)**. The Loom stack is **Go + `poly/asm` + WebGPU** only. **Numerical Tiling (SC/MC)** is live across all 21 DTypes; **Dense forward** can use Plan 9 assembly via `UseAsmForward`. **v0.79** hardens CPU train/save/reload, MHA layout + KV decode, and C-ABI parity (see [`bedrock_validation.md`](bedrock_validation.md)). Checkpoints save **native packed weights per layer dtype** (not FP32-only JSON). See [`poly/README.md`](../poly/README.md) for the live checklist and [`testing_and_validation.md`](testing_and_validation.md) for Lucy log interpretation.

---

## The Full Architecture

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                        M-POLY-VTD ARCHITECTURE                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │               POLYGLOT BINDINGS (C-ABI FFI Layer)                    │   │
│  │  Python │ TS (@openfluke/welvet) │ C# │ Java │ Dart │ WASM Browser    │   │
│  └─────────────────────────────┬────────────────────────────────────────┘   │
│                                │                                            │
│                                ▼                                            │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                  VolumetricNetwork (3D Grid)                          │   │
│  │                                                                      │   │
│  │   Depth × Rows × Cols × LayersPerCell                                │   │
│  │                                                                      │   │
│  │   ┌───────────┐  ┌───────────┐  ┌───────────┐                       │   │
│  │   │ (0,0,0,0) │  │ (0,0,1,0) │  │ (0,0,2,0) │   ← Depth=0, Row=0  │   │
│  │   │VolumetricL│  │VolumetricL│  │VolumetricL│                       │   │
│  │   │ayer       │  │ayer       │  │ayer       │                       │   │
│  │   └─────┬─────┘  └─────┬─────┘  └─────┬─────┘                       │   │
│  │         │               │               │                            │   │
│  │   ┌─────▼─────┐  ┌─────▼─────┐  ┌─────▼─────┐                       │   │
│  │   │ (0,1,0,0) │  │ (0,1,1,0) │  │ (0,1,2,0) │   ← Depth=0, Row=1  │   │
│  │   └───────────┘  └───────────┘  └───────────┘                       │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                │                                            │
│              ┌─────────────────┼──────────────────────┐                     │
│              ▼                 ▼                      ▼                     │
│  ┌───────────────┐  ┌──────────────────┐  ┌───────────────────────────┐    │
│  │  CPU Backend  │  │  Step mesh engine │  │  WebGPU Backend (WGPU)    │    │
│  │               │  │                  │  │                           │    │
│  │ ForwardPoly-  │  │ StepForward  │  │ BeginFrame / FlushFrame   │    │
│  │ morphic[T]    │  │ StepBackward │  │ DispatchForwardLayer      │    │
│  │               │  │ Tween (NTP) │  │ DispatchBackwardLayer     │    │
│  │ All 21 DTypes │  │                  │  │ WGSL compute shaders      │    │
│  └───────────────┘  └──────────────────┘  └───────────────────────────┘    │
│                                                                             │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │              WeightStore (Morphic Precision Engine)                   │   │
│  │                                                                      │   │
│  │  Master []float32  ──┬──▶  Versions[DTypeFP4]  []int8               │   │
│  │  (Source of Truth)   ├──▶  Versions[DTypeInt8] []int8               │   │
│  │                      ├──▶  Versions[DTypeBinary] []int8             │   │
│  │                      └──▶  GPUWeights[DTypeFloat32] *wgpu.Buffer    │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                     DNA Engine                                        │   │
│  │  ExtractDNA ──▶ LayerSignature[] ──▶ CompareNetworks ──▶ SI Score    │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘
```

---

## The Six Core Pillars

### I. Multi-Numerical Architecture (the "M")

The engine natively dispatches forward and backward passes across **21 distinct numerical types** (DTypes), from `float64` all the way down to 1-bit `binary`. Each layer stores its weights in a `WeightStore` that holds a `float32` master copy plus optional converted versions for inference.

```
DType Hierarchy:
┌────────────────────────────────────────────────────────┐
│ High-Precision  │ Float64, Int64, Uint64               │
│ Standard        │ Float32, Int32, Uint32, Int16, Uint16│
│ Optimized       │ Float16, BFloat16, Int8, Uint8       │
│ Low-Bit         │ FP8E4M3, FP8E5M2, Int4, Uint4, FP4  │
│ Extreme         │ Int2, Uint2, Ternary, Binary         │
└────────────────────────────────────────────────────────┘
```

Layers are not restricted to a single precision. The dispatcher reads `layer.DType`, fetches the right version from the `WeightStore`, and falls back to the master FP32 weights if no converted version exists. See [numerical_types.md](./numerical_types.md) for the full breakdown.

### II. Polymorphic Layer-Morphing (the "POLY")

Every layer is a **polymorphic processing unit**. Its numerical representation can be changed at any time via `WeightStore.Morph(dtype)` without reallocating the layer structure. The master FP32 weights are never destroyed—they remain the source of truth.

```
Metamorphosis sequence:
  FP32 (training) ──▶ Morph(INT8) ──▶ Morph(FP4) ──▶ Morph(Binary)
       ▲                                                     │
       └──── Unpack(dtype) ──── always recoverable ─────────┘
```

After gradients are applied via `WeightStore.ApplyGradients`, all cached low-bit versions are **automatically cleared**, forcing re-quantization on the next forward pass.

### III. Volumetric Tensor Dispatch (the "VTD")

The network is a **4D array** of `VolumetricLayer` values indexed by `(Depth, Row, Col, LayerIndex)`. The flattened index is:

```
idx = z * Rows * Cols * LayersPerCell
    + y * Cols * LayersPerCell
    + x * LayersPerCell
    + l
```

Data flows through the grid in reading order: Z outer loop, then Y, then X, then L. This gives the programmer a spatial metaphor to compose complex non-linear topologies.

#### Remote Links (Spatial Hopping)

Any layer can set `IsRemoteLink = true` and point to any other coordinate via `TargetZ / TargetY / TargetX / TargetL`. When the step mesh engine fires that layer, it reads input from the *target* coordinate's output buffer instead of the preceding layer. This enables biological-style feedback loops anywhere in the grid.

```
Normal flow:          Remote link (skip connection):
 (0,0,0)               (0,0,0)
    │                     │    ◄────────────────────────┐
    ▼                     ▼                             │
 (0,0,1)               (0,0,1)  ─ IsRemoteLink ──▶ (0,2,3)
    │                     │
    ▼                     ▼
 (0,0,2)               (0,0,2)
```

### IV. The Dispatcher Pattern

`DispatchLayer[T]` and `DispatchLayerBackward[T]` are **generic runtime jump tables**. They inspect `layer.Type` and call the correct polymorphic function, returning `(preAct, postAct)` tensors of the same type `T`. The separation from the grid traversal loop makes GPU kernel fusion possible—the driver can look ahead and pre-load the next tile's weights while the current tile computes.

```go
func DispatchLayer[T Numeric](layer *VolumetricLayer, input, skip *Tensor[T]) (preAct, postAct *Tensor[T])
```

There are 19 `LayerType` values routed here. An unknown type falls through to `DenseForwardPolymorphic`.

**Numerical tiling** is orthogonal to volumetric traversal: `ForwardPolymorphic` can walk the grid in spatial tiles or sequentially (`network.UseTiling`). **CPU** layers use a **single** tile map (`CPUTileSizes`). **GPU** layers carry **two** maps (`GPUSCTileSizes`, `GPUMCTileSizes`); **`EnableMultiCoreTiling`** on `VolumetricNetwork` selects MC vs SC dispatch (see [dispatch.md](./dispatch.md) and [gpu.md](./gpu.md)).

### V. The Step Mesh Engine

Unlike `ForwardPolymorphic`, which executes the entire network per input in one pass, `StepForward` fires **all layers simultaneously** every clock cycle. Each layer reads from the previous cycle's output buffer (`LayerData`) and writes to `NextBuffer`. After all layers have fired, the buffers are swapped. This double-buffering pattern is race-condition-free and supports parallel tile dispatch via goroutines.

### VI. The DNA Engine

`ExtractDNA` converts a network into a slice of `LayerSignature` values. Each signature contains the layer's 3D coordinates, type, DType, and a **normalized** (unit-vector) representation of its weights after precision simulation. `CompareNetworks(dna1, dna2)` then uses cosine similarity to produce an `OverallOverlap` score and identifies `LogicShift` events where a functional pattern has migrated to a different spatial coordinate.

---

## Key Types at a Glance

| Type | File | Role |
|:-----|:-----|:-----|
| `VolumetricNetwork` | `poly.go` | The 3D grid container |
| `VolumetricLayer` | `poly.go` | A single processing unit with coordinates |
| `WeightStore` | `weights.go` | Master FP32 + versioned low-bit storage |
| `Tensor[T Numeric]` | `poly.go` | Generic data container with `Shape` and `Nested` |
| `DType` | `poly.go` | 21-value enum for numerical types |
| `LayerType` | `poly.go` | 19-value enum for layer kinds |
| `WGPUContext` | `wgpu_context.go` | GPU device, queue, pipeline cache |
| `StepState[T]` | `step.go` | Double-buffered temporal mesh state |
| `NetworkDNA` | `dna.go` | `[]LayerSignature` topological blueprint |
| `TrainingConfig` | `training.go` | Epochs, LR, loss type, GPU flag |

---

## The `Tensor[T]` Type

```go
type Tensor[T Numeric] struct {
    Data   []T
    DType  DType
    Shape  []int
    Nested []*Tensor[T]  // activation tree for Parallel/Sequential layers
}
```

`Nested` is the key structural innovation. During a `ParallelForward` pass, each branch produces its own `preAct` tensor, and these are stored in `Nested` on the returned preAct. The backward pass reads them back, routing gradients to the correct branch without any external bookkeeping. This recursive tree property makes arbitrary nesting of `Parallel` and `Sequential` layers fully differentiable.

---

## Performance Snapshot

From the README benchmark table, measured on a GTX 1650 Super (Vulkan/WebGPU):

| Layer type | CPU Tiled | GPU | Speedup |
|:-----------|:----------|:----|:--------|
| Dense | 5.42ms | 400µs | 13.6x |
| CNN 1D | 4.34ms | 195µs | 22.3x |
| CNN 2D | 182ms | 100µs | 1826x |
| CNN 3D | 1522ms | 200µs | 7602x |
| RMSNorm | 1.16ms | 103µs | 11.3x |

End-to-end GPU training (20 epochs):

| Architecture | CPU | GPU | Speedup |
|:-------------|:----|:----|:--------|
| Dense MLP (128→512→512→8) | 12.1s | 693ms | 17.5x |
| CNN 2D (3ch×32×32 → 16f→32f→8) | 1m57s | 1.81s | 64.8x |
| Deep Dense (128→512×4→8) | 31.7s | 1.23s | 25.7x |

---

## Next Steps

- [numerical_types.md](./numerical_types.md) — DType system, WeightStore, Metamorphosis
- [layers.md](./layers.md) — Every layer type in detail
- [dispatch.md](./dispatch.md) — The dispatcher pattern and 3D coordinates
- [training.md](./training.md) — Forward/backward, optimizers, Tween
- [gpu.md](./gpu.md) — WebGPU backend and BeginFrame/FlushFrame pattern
- [step.md](./step.md) — The step mesh engine
- [quick_reference.md](./quick_reference.md) — Common code snippets

---

## Deployment: TypeScript, WASM, and NPM

Source: https://openfluke.com/docs/deployment
Markdown: https://openfluke.com/docs/deployment.md

# Deployment: TypeScript, WASM, and NPM

Loom is designed to be **isomorphic**, meaning the exact same mathematical engine runs in both Node.js (backend) and the Browser (frontend) via a bit-perfect WebAssembly (WASM) bridge.

---

## 📦 The NPM Package: `@openfluke/welvet`

The primary way to use Loom in the JavaScript ecosystem is through the **Welvet** SDK.

### Installation
```bash
npm install @openfluke/welvet
```

### Quick Start (Node.js)
```typescript
import { init, createNetwork } from '@openfluke/welvet';

// Initialize the WASM runtime
await init();

// Build a network from a JSON specification
const net = await createNetwork({
    id: "demo-net",
    depth: 1, rows: 2, cols: 1, layers_per_cell: 1,
    layers: [
        { z: 0, y: 0, x: 0, l: 0, type: "Dense", input_height: 128, output_height: 64, activation: "ReLU" },
        { z: 0, y: 1, x: 0, l: 0, type: "Dense", input_height: 64, output_height: 10, activation: "Linear" }
    ]
});

// Run a forward pass
const input = new Float32Array(128).fill(0.5);
const output = await net.sequentialForward(input);
console.log("Network output:", output);
```

---

## 🌐 WASM & FFI Bridge

The TypeScript SDK communicates with the Go-compiled core via the **Universal C-ABI**. This ensures that complex logic (like NEAT evolution or DNA extraction) remains fast while providing a high-level, idiomatic JS interface.

### Verified Capabilities (v0.74.0)
The isomorphic bridge has been verified through a 36-count diagnostic suite:
- **Core Exports**: 8/8 internal WASM symbols verified.
* **Network Methods**: 16/16 functional wrappers (Forward, DNA, Morph, etc.) passed.
* **NEAT Population**: 8/8 evolutionary logic methods verified.
* **Bit-Perfect Parity**: 0.000000% divergence vs the Go native reference.

---

## 🖼️ Browser Deployment (WebGPU)

When running in the browser, the WASM runtime can automatically detect and utilize **WebGPU** for massive parallel speedups.

```typescript
import { setupWebGPU } from '@openfluke/welvet';

// Initialize WebGPU context
await setupWebGPU();

// Networks created after this point will utilize GPU kernels
// for forward and backward passes.
```

### Performance Tiers
| Environment | Backend | Best For |
| :--- | :--- | :--- |
| **Node.js** | WASM (SIMD) | Backend inference, server-side DNA comparison. |
| **Browser** | WASM + WebGPU | High-performance interactive AI, on-device training. |
| **Mobile Web** | WASM | Lightweight edge execution. |

---

## 🧬 DNA & Evolution in JS

The TypeScript SDK provides full access to the DNA logic:
- **`net.extractDNA()`**: Generates a topological fingerprint.
- **`compareLoomDNA(dnaA, dnaB)`**: Cross-platform similarity score.
- **`createLoomNEATPopulation(id, size, cfg)`**: High-speed evolutionary architecture search.

For more details on the underlying DNA math, see [dna.md](dna.md).

---

## Donate compute (TCP)

Source: https://openfluke.com/docs/donate-compute
Markdown: https://openfluke.com/docs/donate-compute.md

# Donate compute (TCP)

The **`donate_compute_*.go`** files in `poly/` implement an optional **TCP protocol** so a **donor** machine can accept inference-style work from clients on the same network (or loopback). Work is exchanged as **length-prefixed JSON** frames over a single connection — there is no HTTP server inside `poly` for this path.

**Status:** The server’s inference and prompt paths are **stubs** (`stubInfer` / `stubPrompt`) until wired to real model loading, `poly` execution, or subprocess hooks.

---

## Why it exists

- **LAN-friendly**: bind to `0.0.0.0` (or a specific interface) and let another host submit jobs without bundling a separate HTTP stack in `poly`.
- **Two modes** (see below): push weights + token **`infer`**, or **`prompt`**-only against a local LM path advertised in the hello.

---

## Wire format (`donate_compute_framing.go`)

Each message is:

1. **`uint32` length**, little-endian (4 bytes).
2. **UTF-8 JSON object** of exactly that length.

Constants:

| Constant | Value | Meaning |
| :--- | :--- | :--- |
| `DonateComputeDefaultPort` | **17001** | Default listen/dial port (adjacent to construct TCP dev on **17000**). |
| `MaxDonateFrameBytes` | 64 MiB | Maximum single-frame payload; large models use **many** weight chunks, not one giant frame. |

Helpers: **`WriteDonateFrame`**, **`ReadDonateFrame`**.

---

## Message types (`donate_compute_types.go`)

Version **v1** uses a `"type"` string discriminator. Constants include:

- **`hello`** — first frame from server after connect; client may echo hello.
- **`model_begin`**, **`weights_chunk`**, **`model_commit`**, **`model_status`** — **model_push** upload lifecycle.
- **`infer`**, **`infer_result`** — token-ID jobs against a **mounted** pushed model.
- **`prompt`**, **`prompt_result`** — text jobs for **local_lm** nodes.
- **`queue_status`**, **`error`** — optional / error paths.

Structs (`DonateHello`, `DonateModelBegin`, `DonateWeightsChunk`, `DonateInfer`, `DonatePrompt`, …) mirror the JSON fields.

---

## Server (`donate_compute_server.go`)

**`ServeDonateComputeTCP(opts DonateComputeServerOptions)`** returns a **`net.Listener`**. It:

- Sends a **`DonateHello`** immediately after each accept (mode, role `server`, optional `LocalLmPath`, queue capacity hint).
- Parses frames in a loop per connection.
- Enqueues **`infer`** / **`prompt`** jobs on a **global FIFO channel**; **one worker** drains the queue (serial execution — not N parallel model mounts).

**`DonateComputeServerMode`:**

| Mode | Behavior |
| :--- | :--- |
| **`model_push`** | Client sends **`model_begin`** (config JSON + expected weight length), **`weights_chunk`** (base64 slices), **`model_commit`**. Server acknowledges with **`model_status`**. Then client may send **`infer`** with `input_ids` / `max_tokens`. |
| **`local_lm`** | **`infer`** is rejected; client uses **`prompt`** with full text. Server may advertise **`LocalLmPath`** in hello (informational). |

**`CloseDonateListener`** closes the listener.

---

## Client (`donate_compute_client.go`)

- **`DialDonateCompute(addr)`** — TCP dial (default `127.0.0.1:17001` if empty), read server hello, send client hello; returns **`DonateClient`** + **`DonateHello`**.
- **`PutModel(configJSON, weights)`** — stream model for **model_push** nodes.
- **`EnqueueInfer`**, **`EnqueuePrompt`** — send one job and wait for the matching result frame.

---

## Tests

**`donate_compute_test.go`** covers framing and client/server interaction.

---

## Security

**v1 has no TLS and no authentication.** It is intended for **trusted networks** (e.g. same Wi‑Fi lab). Do not expose the raw port to the public Internet without a VPN, SSH tunnel, or application-layer gateway.

---

## File map

| File | Role |
| :--- | :--- |
| `donate_compute_types.go` | v1 constants and JSON structs |
| `donate_compute_framing.go` | Frame encode/decode, default port, size limits |
| `donate_compute_server.go` | TCP server, modes, queue, stubs |
| `donate_compute_client.go` | Dial, `PutModel`, `EnqueueInfer`, `EnqueuePrompt` |
| `donate_compute_test.go` | Tests |

---

## TANHI — UDP layer telemetry

Source: https://openfluke.com/docs/tanhi
Markdown: https://openfluke.com/docs/tanhi.md

# TANHI — UDP layer telemetry

**TANHI** streams **sparse, non-blocking JSON-line events** over **UDP** so external tools (notably the **SoulGlitch → TANHI** HUD) can visualize **per-layer forward/backward** activity, timing, dtypes, and **routing links** (parallel branches / sequential substeps).

Implementation: **`poly/tanhi.go`**. Integration hooks live in **`poly/forward.go`**, **`poly/backward.go`**, and **`poly/wgpu_forward.go`** (GPU transformer path). Optional **Welvet C-ABI** exports: **`welvet/cabi/tanhi_ext.go`**.

---

## Defaults

| Constant / env | Default | Meaning |
| :--- | :--- | :--- |
| **`poly.DefaultTanhiUDPPort`** | **17481** | UDP destination port (IANA unassigned range). |
| Host | `127.0.0.1` | When `TanhiUDPConfig.Host` is empty. |
| Disabled | `nil` / off | `VolumetricNetwork.Tanhi == nil` or `Enabled == false` → no UDP. |

---

## Configuration (`TanhiUDPConfig`)

Set on **`VolumetricNetwork.Tanhi`**:

- **`Enabled`** — master switch.
- **`Host`**, **`Port`** — UDP listener address (engine **sends** to this address).
- **`SendShape`** — include approximate tensor **`shape`** in each event (CPU path uses activations when available; GPU path uses **`TanhiGPULayerShapeHint`** — no readback).

Telemetry is **best-effort**: a buffered queue (**1024** packets); **overflow drops** silently so training/inference never blocks on HUD lag.

---

## Wire format

Each datagram payload is **one JSON object per line** (newline-terminated). Schema version **`v`: `"tanhi1"`**.

Typical fields:

| Field | Meaning |
| :--- | :--- |
| `seq` | Monotonic sequence number |
| `phase` | `"fwd"` or `"bwd"` |
| `idx` | Layer index in traversal (-1 or special indices possible on GPU fused paths) |
| `z`, `y`, `x`, `l` | Volumetric coordinate |
| `layer` | Layer type string |
| `dtype` | Integer dtype code |
| `connections` | Fan-out hint from weight masters (or override for GPU LM head) |
| `t0_ns`, `t1_ns` | Wall-clock nanoseconds around the layer |
| `shape` | Optional shape slice when `SendShape` is true |
| `links` | Optional routing targets for **LayerParallel** / **LayerSequential** (capped, for arc drawing) |

---

## SoulGlitch / Glitch CLI

- **`GLITCH_TANHI=1`** — enable when running **`loom/glitch`** interactively (or answer the prompt).
- **`TANHI_HOST`**, **`TANHI_PORT`**, **`TANHI_SHAPE=1`** — override host, port, and shape inclusion (same conventions in **`glitch/measure/*`** harnesses).

Open SoulGlitch **first**; set the listener **port** to match **`TANHI_PORT`** / **17481**.

---

## C-ABI (Welvet)

- **`LoomNetworkTanhiConfigure`** — enable/disable, host C string, port (0 → default **17481**), send-shape flag.
- **`LoomNetworkTanhiDisable`** — clear `Tanhi` on the network handle.
- **`LoomTanhiDefaultPort`** — returns **`DefaultTanhiUDPPort`**.

---

## Security note

UDP telemetry is **localhost-oriented** by default. Pointing **`Host`** at a remote machine sends layer metadata and timing to that address — use only on **trusted networks** when not using **`127.0.0.1`**.

---

## Numerical Types, DType System, and WeightStore

Source: https://openfluke.com/docs/numerical-types
Markdown: https://openfluke.com/docs/numerical-types.md

# Numerical Types, DType System, and WeightStore

This document covers all 21 `DType` values, the `Numeric` generic constraint, the `WeightStore` master/versioned architecture, and the Metamorphosis mechanism that lets a layer switch precision on the fly.

---

## The 21 DTypes

```go
type DType int
```

Every `VolumetricLayer` carries a `DType` field that controls which numerical format its weights are active in. The full set:

```
┌─────┬───────────────┬──────────────────────────────────────────────┐
│ ID  │ Name          │ Description                                  │
├─────┼───────────────┼──────────────────────────────────────────────┤
│  0  │ DTypeFloat64  │ IEEE 754 double (8 bytes per weight)         │
│  1  │ DTypeFloat32  │ Standard single (4 bytes) — Master baseline  │
│  2  │ DTypeFloat16  │ 16-bit float (simulated, stored as f32)      │
│  3  │ DTypeBFloat16 │ Brain Float: 8 exp bits, 7 mantissa          │
│  4  │ DTypeFP8E4M3  │ 8-bit FP, 4-exponent 3-mantissa             │
│  5  │ DTypeFP8E5M2  │ 8-bit FP, 5-exponent 2-mantissa             │
│  6  │ DTypeInt64    │ 64-bit signed integer                        │
│  7  │ DTypeInt32    │ 32-bit signed integer                        │
│  8  │ DTypeInt16    │ 16-bit signed integer                        │
│  9  │ DTypeInt8     │ 8-bit signed integer (0.625–1.0 B/weight)   │
│ 10  │ DTypeUint64   │ 64-bit unsigned integer                      │
│ 11  │ DTypeUint32   │ 32-bit unsigned integer                      │
│ 12  │ DTypeUint16   │ 16-bit unsigned integer                      │
│ 13  │ DTypeUint8    │ 8-bit unsigned integer                       │
│ 14  │ DTypeInt4     │ 4-bit signed (2 weights per byte)           │
│ 15  │ DTypeUint4    │ 4-bit unsigned (2 weights per byte)         │
│ 16  │ DTypeFP4      │ 4-bit floating point E2M1 (2 per byte)     │
│ 17  │ DTypeInt2     │ 2-bit signed (4 weights per byte)           │
│ 18  │ DTypeUint2    │ 2-bit unsigned (4 weights per byte)         │
│ 19  │ DTypeTernary  │ 2-bit ternary: -1, 0, +1                    │
│ 20  │ DTypeBinary   │ 1-bit XNOR-Net (8 weights per byte)        │
└─────┴───────────────┴──────────────────────────────────────────────┘
```

### Storage Size per Weight

```
┌────────────────────────────────────────────────────────┐
│  DType        Bits/weight   Bytes/1024 weights          │
├────────────────────────────────────────────────────────┤
│  Float64      64            8192                        │
│  Float32      32            4096                        │
│  Float16      16            2048                        │
│  BFloat16     16            2048                        │
│  FP8E4M3      8             1024                        │
│  FP8E5M2      8             1024                        │
│  Int8/Uint8   8             1024                        │
│  Int4/Uint4   4              512   (2 per byte)         │
│  FP4          4              512   (2 per byte)         │
│  Int2/Uint2   2              256   (4 per byte)         │
│  Ternary      2              256   (4 per byte)         │
│  Binary       1              128   (8 per byte) ← 98.4% │
│                                    compression vs FP32  │
└────────────────────────────────────────────────────────┘
```

### Parsing DTypes from Strings

`ParseDType(s string) DType` accepts aliases:

| Input strings | Result |
|:-------------|:-------|
| `"float32"`, `"fp32"`, `"f32"` | `DTypeFloat32` |
| `"bfloat16"`, `"bf16"` | `DTypeBFloat16` |
| `"fp8e4m3"`, `"fp8"` | `DTypeFP8E4M3` |
| `"int4"` | `DTypeInt4` |
| `"fp4"`, `"f4"` | `DTypeFP4` |
| `"ternary"` | `DTypeTernary` |
| `"binary"` | `DTypeBinary` |

---

## The `Numeric` Constraint

```go
type Numeric interface {
    ~int | ~int8 | ~int16 | ~int32 | ~int64 |
        ~uint | ~uint8 | ~uint16 | ~uint32 | ~uint64 |
        ~float32 | ~float64
}
```

This constraint makes `Tensor[T]`, `DispatchLayer[T]`, `ForwardPolymorphic[T]`, and all other generic functions work across any of Go's numeric primitives. The constraint is deliberately limited to types the compiler can generate native arithmetic for—no reflection, no `interface{}` boxing at the hot path.

> [!NOTE]
> FP4, FP8, BFloat16, and other non-native types are **simulated** via PTQ. Weights are stored as `float32` masters and quantized to the target dtype at GPU upload time via `MorphToFloat32ForGPU` (quantize → dequantize round-trip). On GPU, Dense/SwiGLU/MHA use native packed payloads in WGSL shaders; other layer types receive the pre-simulated float32 buffer.

---

## The WeightStore

```go
type WeightStore struct {
    Master     []float32          // Source of truth — always FP32
    Versions   map[DType]any      // Cached conversions (e.g., []int8 for INT8)
    GPUWeights map[DType]any      // VRAM-resident wgpu.Buffer references
    GPUScales  map[DType]*wgpu.Buffer  // Per-block scale buffers for quantized types
    Scale      float32            // Global quantization scale factor
}
```

The `Master` slice is allocated with `AlignedFloat32(n)` which aligns to 64-byte boundaries (one CPU cache line), enabling AVX-width SIMD operations.

### Creating and Initializing

```go
ws := NewWeightStore(inputSize * outputSize)
ws.Scale = 1.0
ws.Randomize(seed, 0.1)  // fills Master with uniform [-0.1, 0.1]
```

After `Randomize`, all `Versions` and `GPUWeights` maps are cleared, ensuring no stale low-bit versions survive.

### The Morphic Version System

```
WeightStore.Morph(dtype DType):

  Master (FP32)
       │
       ▼
  DTypeFloat64  ──▶  []float64  (direct cast)
  DTypeBFloat16 ──▶  []float32  (bits masked to 16-bit BF16)
  DTypeInt8     ──▶  []int8     (quantized: int8(v / Scale))
  DTypeInt4     ──▶  []int8     (quantized, stored 1-per-int8)
  DTypeBinary   ──▶  []int8     (sign bit only: +1 or -1)
```

The BFloat16 path uses a bit-masking trick:

```go
u32 := math.Float32bits(wVal)
u32 &= 0xFFFF0000   // zero the lower 16 mantissa bits
return math.Float32frombits(u32)
```

This preserves the exponent and upper mantissa exactly as BFloat16 would.

### Metamorphosis: Switching Precision On the Fly

A layer starts life as FP32. Before inference you can call:

```go
layer.WeightStore.Morph(DTypeInt8)
layer.DType = DTypeInt8
```

Now `DenseForwardPolymorphic` will find the `[]int8` version in `Versions[DTypeInt8]` and use the native INT8 fast-path loop. The FP32 master is untouched.

After training (`ApplyGradients`), the master is updated and **all cached versions are automatically purged**:

```go
func (ws *WeightStore) ApplyGradients(gradWeights *Tensor[float32], lr float32) {
    for i := 0; i < limit; i++ {
        ws.Master[i] -= lr * gradWeights.Data[i]
    }
    // Stale — force re-quantize on next forward:
    ws.Versions = make(map[DType]any)
    ws.GPUWeights = make(map[DType]any)
}
```

This guarantees the layer never silently uses outdated quantized weights.

```
┌──────────────────────────────────────────────────────────────┐
│                   Metamorphosis Lifecycle                    │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  NewWeightStore(n)                                           │
│       │                                                      │
│       ▼                                                      │
│  Randomize(seed, scale) ──▶ Master filled, Versions={}      │
│       │                                                      │
│       ▼                                                      │
│  layer.DType = DTypeInt8                                     │
│       │                                                      │
│       ▼                                                      │
│  Forward() ──▶ Morph(DTypeInt8) if Versions[INT8]==nil      │
│       │                 │                                    │
│       │          Versions[DTypeInt8] = []int8{...}          │
│       │                                                      │
│       ▼                                                      │
│  INT8 fast-path arithmetic executes                         │
│       │                                                      │
│       ▼                                                      │
│  ApplyGradients(gW, lr) ──▶ Master updated                  │
│                         ──▶ Versions = {} (cleared)         │
│                                                              │
│  Next Forward() ──▶ Morph(DTypeInt8) again from new Master  │
│                                                              │
└──────────────────────────────────────────────────────────────┘
```

### Unpacking for Deserialization

When loading a model saved in a low-bit format:

```go
ws.Versions[dtype] = decoded  // e.g., []int8 from bit-packed JSON
ws.Unpack(dtype)              // reconstructs Master: Master[i] = packed[i] * Scale
```

This ensures the FP32 master is always available for gradient-based fine-tuning, even on a model that was serialized in INT4.

---

## MorphToFloat32ForGPU

This is the PTQ simulation path used when uploading weights to the GPU for layers without a dedicated packed shader (CNN1-3, RNN, LSTM, Embedding):

```go
func (ws *WeightStore) MorphToFloat32ForGPU(dtype DType) []float32
```

It calls `ws.Morph(dtype)` to produce the quantized version, then dequantizes back to float32 by multiplying by `ws.Scale`. The GPU shader sees float32 weights that already reflect quantization rounding loss — no new shader needed.

| DType | Round-trip behaviour |
|:------|:---------------------|
| Float32, Float64 | Master returned as-is (no loss) |
| BFloat16 | Upper 16 bits of mantissa preserved; lower 16 zeroed |
| FP8, Int8, Uint8 | `round(w/scale) * scale` |
| Int4, Uint4, FP4 | `trunc(w/scale) * scale`, range ±7 |
| Int2, Uint2 | 4-level round-trip |
| Ternary | Threshold snap to `{-scale, 0, +scale}` |
| Binary | Sign only: `±scale` |

The `scale` comes from `WeightStore.Scale`, set during `Morph` from the max absolute value of the master weights.

---

## The Q4_0 Block Format (GPU Quantization)

For GPU inference, the engine uses the Q4_0 block format, matching llama.cpp compatibility:

```
Q4_0Block:
┌────────────────────────────────────────────────────────┐
│  Scale: float32  (4 bytes)                             │
│  Weights: [16]byte  (32 nibbles = 32 × 4-bit weights)  │
│                                                        │
│  Total: 20 bytes for 32 weights = 0.625 bytes/weight   │
└────────────────────────────────────────────────────────┘
```

`QuantizeQ4_0(weights []float32) []Q4_0Block` finds the max absolute value in each block of 32, sets `scale = maxAbs / 7.0`, then quantizes each weight to a signed 4-bit integer (`-8` to `7`) packed two-per-byte.

On the GPU, the WGSL shader receives the packed uint32 array plus the float32 scales array, and dequantizes on the fly inside the shader without a CPU roundtrip.

---

## CastWeights

`CastWeights[T Numeric](weights any) []T` is the universal extraction helper. It type-switches on all 10 concrete slice types and uses `ConvertSlice[In, Out]` to re-cast the values into the requested type `T`. When `DispatchLayer` cannot find a dedicated fast-path for the layer's DType, it falls through to `CastWeights` on the pre-quantized `Versions` data.

---

## Bit-Packed Serialization Ratios

From the README, verified across 378 model permutations:

| DType | Bytes/weight (serialized) | vs FP32 |
|:------|:--------------------------|:--------|
| Float32 | 4 | 1.0x |
| Float16 | 2 | 0.5x |
| Int8 | 1 | 0.25x |
| Int4/FP4 | 0.5 | 0.125x |
| Int2/Ternary | 0.25 | 0.0625x |
| Binary | 0.125 | 0.0313x ← **98.4% reduction** |

The packing/unpacking logic lives in `encodeNativeWeights` and `decodeNativeWeights` in `persistence.go`. Binary packs 8 weights per byte using bit shifts; Ternary packs 4 per byte using 2-bit fields; FP4 packs 2 per byte using nibbles.

---

## Layer Reference

Source: https://openfluke.com/docs/layers
Markdown: https://openfluke.com/docs/layers.md

# Layer Reference

This document describes every `LayerType` in `poly/`. For each layer: what it computes, which fields of `VolumetricLayer` configure it, weight layout in the `WeightStore`, and an ASCII data-flow diagram.

---

## LayerType Constants

```go
const (
    LayerDense              LayerType = 0
    LayerMultiHeadAttention LayerType = 1
    LayerSwiGLU             LayerType = 2
    LayerRMSNorm            LayerType = 3
    LayerCNN1               LayerType = 4
    LayerCNN2               LayerType = 5
    LayerCNN3               LayerType = 6
    LayerRNN                LayerType = 7
    LayerLSTM               LayerType = 8
    LayerLayerNorm          LayerType = 9
    LayerConvTransposed1D   LayerType = 10
    LayerConvTransposed2D   LayerType = 11
    LayerConvTransposed3D   LayerType = 12
    LayerEmbedding          LayerType = 13
    LayerKMeans             LayerType = 14
    LayerSoftmax            LayerType = 15
    LayerParallel           LayerType = 16
    LayerSequential         LayerType = 17
    LayerResidual           LayerType = 18
)
```

> [!NOTE]
> There is no explicit `LayerGRU` constant; GRU is implemented in `rnn.go` as a variant of the RNN pattern referenced through the same dispatcher slot.

---

## Dense (LayerDense = 0)

**What it does:** Fully-connected linear transformation: `output = input × W^T + b`, followed by an activation function. Every input connects to every output.

**Key fields:**

| Field | Meaning |
|:------|:--------|
| `InputHeight` | Number of input features |
| `OutputHeight` | Number of output features |
| `Activation` | One of ReLU, SiLU, GELU, Tanh, Sigmoid, Linear |
| `DType` | Active numerical type |
| `UseTiling` | Enables tiled fast paths where implemented (CPU block tiling, sequential propagation to sub-layers, etc.) |
| `TileSize` | Legacy scalar fallback when per-dtype maps are empty; prefer **`CPUTileSizes`** on CPU and **`GPUSCTileSizes` / `GPUMCTileSizes`** on GPU after `RefreshRuntimeTileSizes()` |
| `EnableMultiCoreTiling` | **GPU:** aligned with `VolumetricNetwork.EnableMultiCoreTiling`; transformer forwards use the network flag to choose **`GetGPUMCTileSize`** vs **`GetGPUSCTileSize`**. **CPU:** often set `true` with training loaders for parity; **does not** switch between two CPU tile maps (only `CPUTileSizes` exists) |

**Weight layout:** `WeightStore.Master` is a flat `[OutputHeight × InputHeight]` row-major matrix. No bias is stored in the Master by default (the polymorphic engine absorbs bias via zero-biased initialization).

```
Input [batch, inputSize]
      │
      ▼
┌─────────────────────────────────────────────┐
│  preAct[b, o] = Σᵢ  input[b, i] × W[o, i]  │
│                                             │
│  W shape: [OutputHeight, InputHeight]       │
└─────────────────────────────────────────────┘
      │
      ▼
   Activation(preAct)
      │
      ▼
Output [batch, outputSize]
```

The tiled variant (`DenseForwardTiled`) loads input tiles into a local buffer and unrolls the dot product 4× to help the compiler auto-vectorize. The INT8 and Binary tiled paths each have their own hot loops in `denseForwardTiledInt8` and `denseForwardTiledBinary`.

---

## CNN1 / CNN2 / CNN3 (LayerCNN1–3 = 4–6)

**What they do:** Convolutional layers across 1D sequences, 2D images, and 3D volumes respectively. A learnable kernel is slid across the spatial dimensions and a dot product is computed at each position.

**Key fields:**

| Field | Meaning |
|:------|:--------|
| `InputChannels` | Channels in the input |
| `Filters` | Number of output channels (kernels) |
| `KernelSize` | Spatial size (k for CNN1, k×k for CNN2, k×k×k for CNN3) |
| `Stride` | Step between kernel positions |
| `Padding` | Zero-padding added on each side |
| `InputHeight` / `InputWidth` / `InputDepth` | Input spatial dimensions |
| `OutputHeight` / `OutputWidth` / `OutputDepth` | Output spatial dimensions |

**Weight layout:** `Filters × InputChannels × KernelSize^N`

```
CNN2 Data Flow:

Input [batch, inChannels, H, W]
      │
      ▼  slide kernel [f, c, kH, kW] over H, W
┌─────────────────────────────────────────────────────────────┐
│  for each filter f:                                         │
│    for each (oh, ow):                                       │
│      out[b,f,oh,ow] = Σ_c Σ_kh Σ_kw  in[b,c,oh+kh,ow+kw]  │
│                                       × W[f,c,kh,kw]       │
└─────────────────────────────────────────────────────────────┘
      │
      ▼  Activation
Output [batch, Filters, outH, outW]
```

Output size formula (same for each spatial dimension):

```
outDim = (inDim + 2*Padding - KernelSize) / Stride + 1
```

> [!TIP]
> CNN3 on GPU achieves over 7600x speedup versus CPU tiling because the 3D spatial loop maps perfectly to 3D WebGPU workgroups. Always prefer GPU for CNN3.

---

## ConvTransposed1D / 2D / 3D (LayerConvTransposed1D–3D = 10–12)

**What they do:** Transposed convolution (also called "deconvolution"). It inverts the spatial compression of a regular convolution — used in decoder networks and generative models to upsample feature maps.

**Key fields:** Same as CNN variants plus `OutputPadding` for controlling output dimensions.

**Weight layout:** `InputChannels × Filters × KernelSize^N`

```
ConvTransposed2D conceptual reverse:

  CNN2:  [H, W] ──kernel──▶  [H', W']     (downsample)
  ConvT: [H', W'] ──kernel──▶ [H, W]      (upsample)

  Internal mechanism: insert (Stride-1) zeros between input elements,
  then apply regular convolution with kernel flipped.
```

---

## RNN (LayerRNN = 7)

**What it does:** Vanilla recurrent network. Processes a sequence step-by-step, feeding the hidden state forward through time.

```
h_t = tanh(x_t × W_ih^T + h_{t-1} × W_hh^T + b_h)
```

**Key fields:**

| Field | Meaning |
|:------|:--------|
| `InputHeight` | Input feature size |
| `OutputHeight` | Hidden state size |
| `SeqLength` | Number of time steps |

**Weight layout in Master:**

```
[  W_ih  |  W_hh  |  b_h  ]
   ihSize   hhSize   hSize
```

Where `ihSize = hiddenSize × inputSize`, `hhSize = hiddenSize × hiddenSize`, `hSize = hiddenSize`.

```
Step 0:          Step 1:          Step t:
 x₀   h₋₁=0     x₁    h₀         xₜ    h_{t-1}
  │      │        │     │           │      │
  └──┬───┘        └──┬──┘           └──┬───┘
     ▼               ▼                 ▼
   [RNN cell]     [RNN cell]        [RNN cell]
     │               │                 │
     ▼               ▼                 ▼
    h₀              h₁               hₜ
```

---

## LSTM (LayerLSTM = 8)

**What it does:** Long Short-Term Memory. Adds a cell state `c_t` and three gating mechanisms (forget, input, output) to control information flow through time. Solves the vanishing gradient problem for long sequences.

**Gate equations:**

```
i_t = σ(x_t × W_i^T + h_{t-1} × U_i^T + b_i)   ← input gate
f_t = σ(x_t × W_f^T + h_{t-1} × U_f^T + b_f)   ← forget gate
g_t = tanh(x_t × W_g^T + h_{t-1} × U_g^T + b_g) ← cell gate
o_t = σ(x_t × W_o^T + h_{t-1} × U_o^T + b_o)   ← output gate
c_t = f_t ⊙ c_{t-1} + i_t ⊙ g_t
h_t = o_t ⊙ tanh(c_t)
```

**Weight layout:** Four gate blocks concatenated:

```
[ W_i | U_i | b_i | W_f | U_f | b_f | W_g | U_g | b_g | W_o | U_o | b_o ]
  ←── gate i ──────────▶ ←── gate f ──────────▶ ...
  gateWeightCount = ihSize + hhSize + hiddenSize
  Total = 4 × gateWeightCount
```

```
                   ┌─────────────────────────────────────┐
    c_{t-1} ──────▶│                                     │──▶ c_t
                   │   Forget ×  +  Input × Cell         │
    h_{t-1} ──────▶│                                     │──▶ h_t
                   │       Output gate × tanh(c_t)       │
    x_t     ──────▶│                                     │
                   └─────────────────────────────────────┘
```

---

## GRU

GRU (Gated Recurrent Unit) is implemented in `rnn.go` alongside the vanilla RNN. It uses two gates (reset and update) and eliminates the separate cell state.

```
z_t = σ(x_t × W_z + h_{t-1} × U_z + b_z)   ← update gate
r_t = σ(x_t × W_r + h_{t-1} × U_r + b_r)   ← reset gate
n_t = tanh(x_t × W_n + (r_t ⊙ h_{t-1}) × U_n + b_n)
h_t = (1 - z_t) ⊙ h_{t-1} + z_t ⊙ n_t
```

---

## MultiHeadAttention (LayerMultiHeadAttention = 1)

**What it does:** Standard multi-head scaled dot-product attention with optional RoPE positional encoding, Grouped Query Attention (GQA), and a KV cache for autoregressive decoding.

**Key fields:**

| Field | Meaning |
|:------|:--------|
| `DModel` | Model dimension (total embedding size) |
| `NumHeads` | Number of query heads |
| `NumKVHeads` | Number of key/value heads (< NumHeads for GQA/MQA) |
| `HeadDim` | Dimension per head (usually DModel / NumHeads) |
| `SeqLength` | Current sequence length |
| `RoPEFreqBase` | RoPE frequency base (default 10000.0) |
| `MaxSeqLen` | KV cache capacity |
| `KVCacheK` / `KVCacheV` | CPU-side KV cache tensors |
| `KVOffset` | Current filled position in the KV cache |

**Weight layout:**

```
Master = [ Q_W | K_W | V_W | O_W | Q_b | K_b | V_b | O_b ]

  Q_W: [DModel × DModel]
  K_W: [DModel × kvDim]        (kvDim = NumKVHeads × HeadDim)
  V_W: [DModel × kvDim]
  O_W: [DModel × DModel]
  biases follow
```

**Attention computation:**

```
Q = input × Q_W^T + Q_b     [seqLen, DModel]
K = input × K_W^T + K_b     [seqLen, kvDim]
V = input × V_W^T + V_b     [seqLen, kvDim]

Apply RoPE to Q, K (rotate pairs by position-dependent angle)

For each head h:
  q_h = Q[:, h*headDim:(h+1)*headDim]     [seqLen, headDim]
  k_h = K[:, kv_head_idx*headDim:...]     [seqLen, headDim]
  v_h = V[:, kv_head_idx*headDim:...]

  scores = q_h × k_h^T / sqrt(headDim)   [seqLen, seqLen]
  weights = softmax(scores, causal_mask)
  out_h = weights × v_h                   [seqLen, headDim]

output = concat(out_0..out_{numHeads-1}) × O_W^T
```

---

## SwiGLU (LayerSwiGLU = 2)

**What it does:** Gated feedforward block used in modern LLMs. Two parallel linear projections, one acting as a gate through SiLU activation, combined element-wise before a down projection.

```
gate = SiLU(x × W_gate^T + b_gate)
up   = x × W_up^T + b_up
hidden = gate ⊙ up
output = hidden × W_down^T + b_down
```

**Key fields:** `InputHeight` (in), `OutputHeight` (intermediate/hidden size). The actual output to the next layer is back to `InputHeight` via the down projection.

**Weight layout:**

```
Master = [ W_gate | W_up | W_down | b_gate | b_up | b_down ]
          in×int    in×int  int×in    int      int    in
```

Where `int = OutputHeight` (intermediate size).

```
Input [seqLen, in]
  │
  ├──────────────────────────────────┐
  │                                  │
  ▼                                  ▼
W_gate (in → int)              W_up (in → int)
  │                                  │
SiLU                                 │
  │                                  │
  └──────────── ⊙ (element multiply) ┘
                    │
                    ▼
               W_down (int → in)
                    │
                    ▼
              Output [seqLen, in]
```

---

## RMSNorm (LayerRMSNorm = 3)

**What it does:** Root Mean Square normalization. Divides each element by the RMS of the vector, then scales by a learned gamma parameter.

```
rms = sqrt( mean(x²) + ε )
output = (x / rms) × γ
```

**Key fields:** `InputHeight` (size), `DType`. **Always kept in FP32 on GPU** — the `SyncToGPU` code explicitly refuses to quantize RMSNorm weights.

**Weight layout:** `Master` is a flat `[InputHeight]` gamma vector (no beta/bias term).

---

## LayerNorm (LayerLayerNorm = 9)

**What it does:** Layer normalization. Computes mean and variance across the feature dimension, normalizes, then applies learnable gamma and beta.

```
μ = mean(x),  σ² = var(x)
x_hat = (x - μ) / sqrt(σ² + ε)
output = γ ⊙ x_hat + β
```

**Weight layout:** `Master` is `[2 × InputHeight]`: first half is gamma, second half is beta.

---

## Embedding (LayerEmbedding = 13)

**What it does:** Token lookup table. Given a vector of integer token IDs, returns the corresponding rows from the embedding matrix.

**Key fields:** `VocabSize`, `EmbeddingDim`.

**Weight layout:** `[VocabSize × EmbeddingDim]` row-major matrix.

```
Token IDs: [42, 7, 115]
                │
                ▼  lookup rows 42, 7, 115
┌──────────────────────────────────────────────────┐
│  Embedding Table [VocabSize × EmbeddingDim]      │
│                                                  │
│  Row 7:   [0.12, -0.33, 0.87, ...]              │
│  Row 42:  [0.55,  0.11, -0.22, ...]             │
│  Row 115: [-0.01, 0.77, 0.44, ...]              │
└──────────────────────────────────────────────────┘
                │
                ▼
Output [3, EmbeddingDim]  (gradient only applied to used rows)
```

---

## KMeans (LayerKMeans = 14)

**What it does:** Differentiable clustering. Computes soft assignment probabilities (or raw feature distances) between the input and a set of learnable cluster centroids.

**Key fields:**

| Field | Meaning |
|:------|:--------|
| `NumClusters` | K — number of cluster centers |
| `InputHeight` | Feature vector size |
| `KMeansTemperature` | Controls sharpness of soft assignment |
| `KMeansOutputMode` | `"probabilities"` or `"features"` |

**Weight layout:** `[NumClusters × InputHeight]` centroid matrix.

```
Input [batch, featureDim]
      │
      ▼  compute squared distance to each centroid
  dist[b, k] = ||input[b] - centroid[k]||²
      │
      ▼  temperature-scaled negative softmax
  p[b, k] = softmax(-dist / temperature)
      │
      ▼
Output [batch, NumClusters]  (if mode="probabilities")
    or [batch, featureDim]   (if mode="features")
```

---

## Softmax (LayerSoftmax = 15)

**What it does:** Normalizes a vector (or matrix rows) into a probability distribution. Has 10 variants controlled by `SoftmaxType`. See [softmax.md](./softmax.md) for the full variant reference.

**Key fields:** `SoftmaxType`, `Temperature`, `SoftmaxRows`, `SoftmaxCols`, `HierarchyLevels`, `EntmaxAlpha`, `Mask`, `GumbelNoise`.

No weights — `WeightStore` is nil for Softmax layers.

---

## Parallel (LayerParallel = 16)

**What it does:** Fans the input to N sub-layers simultaneously and combines their outputs. Supports five combination modes.

**Key fields:**

| Field | Meaning |
|:------|:--------|
| `ParallelBranches` | `[]VolumetricLayer` — the sub-layer definitions |
| `CombineMode` | `"add"`, `"avg"`, `"concat"`, `"filter"`, `"grid_scatter"` |
| `FilterGateConfig` | Optional gate network for MoE routing (filter mode) |

```
            Input
              │
   ┌──────────┼──────────┐
   ▼          ▼          ▼
Branch 0   Branch 1   Branch 2
   │          │          │
   └──────────┼──────────┘
              │
        CombineMode:
        ┌─────────────────────────────────────────┐
        │ "add"         element-wise sum          │
        │ "avg"         element-wise average      │
        │ "concat"      [b0, b1, b2] concatenated │
        │ "filter"      gate × b0 + gate × b1 ... │
        │ "grid_scatter" same as concat            │
        └─────────────────────────────────────────┘
              │
           Output
```

The `preAct` tensor returned by `ParallelForwardPolymorphic` stores the branch preActs in `preAct.Nested`, enabling correct recursive backpropagation. See [parallel_sequential.md](./parallel_sequential.md).

---

## Sequential (LayerSequential = 17)

**What it does:** Chains N sub-layers in series. Each sub-layer receives the output of the previous one. The sub-layers can be of any type — this enables composing mini-architectures inside a single grid cell.

**Key fields:** `SequentialLayers []VolumetricLayer`

```
Input
  │
  ▼
Sub-layer 0 ──▶ Sub-layer 1 ──▶ Sub-layer 2
                                      │
                                   Output
```

Each step container stores `[bPre, bInput, bSkip]` in the nested tensor for accurate backward computation through skip connections within the sequence.

---

## Residual (LayerResidual = 18)

**What it does:** Skip connection — adds the input directly to the output of its sub-network.

```
        Input
          │
     ┌────┴────┐
     │         │ skip
     ▼         │
 Sub-layers    │
     │         │
     ▼         │
   ┌───┐       │
   │ + │◀──────┘
   └─┬─┘
     │
   Output = SubLayers(Input) + Input
```

The skip tensor is passed as the second argument to `DispatchLayer` and is added inside `ResidualForwardPolymorphic`. Gradients flow back both through the sub-layers and directly through the skip branch.

---

## Activation Functions

All layers that produce a `preAct` / `postAct` pair apply an activation via `Activate[T](v T, act ActivationType)`:

| Constant | Formula |
|:---------|:--------|
| `ActivationReLU` (0) | `max(0, x)` |
| `ActivationSilu` (1) | `x × σ(x)` |
| `ActivationGELU` (2) | `0.5x(1 + tanh(√(2/π)(x + 0.044715x³)))` |
| `ActivationTanh` (3) | `tanh(x)` |
| `ActivationSigmoid` (4) | `1/(1+e^−x)` |
| `ActivationLinear` (-1) | `x` (identity — no nonlinearity) |

`ActivateDerivative[T]` returns the analytic derivative for backpropagation.

---

## Layer Summary Table

| Layer | Parameters | GPU Forward | GPU Backward |
|:------|:-----------|:-----------|:------------|
| Dense | in×out | EXACT | EXACT |
| CNN1 | f×c×k | EXACT | EXACT |
| CNN2 | f×c×k² | EXACT | EXACT |
| CNN3 | f×c×k³ | EXACT | EXACT |
| RNN | ih+hh+b | EXACT | — |
| LSTM | 4×(ih+hh+b) | EXACT | — |
| MHA | 4×d² + biases | BROKEN (dets) | pending |
| SwiGLU | 3×in×int | BROKEN (dets) | not wired |
| RMSNorm | hidden | EXACT | EXACT |
| LayerNorm | 2×hidden | — | — |
| Embedding | vocab×dim | EXACT (DW) | — |
| KMeans | k×dim | — | — |
| Softmax | none | — | — |
| Parallel | per-branch | — | — |
| Sequential | per-layer | — | — |
| Residual | per-sub | — | — |

---

## The Dispatcher Pattern and 3D Coordinate System

Source: https://openfluke.com/docs/dispatch
Markdown: https://openfluke.com/docs/dispatch.md

# The Dispatcher Pattern and 3D Coordinate System

This document explains how `DispatchLayer` and `DispatchLayerBackward` work as runtime jump tables, how the 3D coordinate system maps to `VolumetricLayer` positions, and how `IsRemoteLink` enables spatial hopping across the grid.

---

## Why a Dispatcher?

A naive implementation of a polymorphic neural network would embed a large `switch` inside the forward loop:

```go
// Naive — thread-divergence on GPU, hard to fuse
for _, layer := range layers {
    switch layer.Type {
    case LayerDense:   output = denseForward(layer, input)
    case LayerCNN2:    output = cnn2Forward(layer, input)
    // ...
    }
}
```

M-POLY-VTD separates concerns: the **traversal loop** iterates coordinates, and the **dispatcher** makes the type-specific call. This decoupling is what makes GPU kernel fusion possible in the future — the driver can inspect a group of same-type layers and launch a single batched shader rather than 19 separate ones.

---

## DispatchLayer

```go
func DispatchLayer[T Numeric](
    layer *VolumetricLayer,
    input, skip *Tensor[T],
) (preAct, postAct *Tensor[T])
```

This is a generic function. The type parameter `T` is inferred from `input`. Every call returns two tensors:

- `preAct` — the layer's internal state before the final activation. For Parallel/Sequential layers this carries the nested activation tree in `preAct.Nested`.
- `postAct` — the result of applying the activation function to `preAct`. This is what flows to the next layer.

The full routing table:

```
layer.Type ──switch──▶ function called
───────────────────────────────────────────────────────────────
LayerResidual          ResidualForwardPolymorphic(layer, input, skip)
LayerDense             DenseForwardPolymorphic(layer, input)
LayerCNN1              CNN1ForwardPolymorphic(layer, input)
LayerCNN2              CNN2ForwardPolymorphic(layer, input)
LayerCNN3              CNN3ForwardPolymorphic(layer, input)
LayerRNN               RNNForwardPolymorphic(layer, input)
LayerLSTM              LSTMForwardPolymorphic(layer, input)
LayerMultiHeadAttention MHAForwardPolymorphic(layer, input)
LayerSwiGLU            SwiGLUForwardPolymorphic(layer, input)
LayerRMSNorm           RMSNormForwardPolymorphic(layer, input)
LayerLayerNorm         LayerNormForwardPolymorphic(layer, input)
LayerConvTransposed1D  ConvTransposed1DForwardPolymorphic(layer, input)
LayerConvTransposed2D  ConvTransposed2DForwardPolymorphic(layer, input)
LayerConvTransposed3D  ConvTransposed3DForwardPolymorphic(layer, input)
LayerEmbedding         EmbeddingForwardPolymorphic(layer, input)
LayerKMeans            KMeansForwardPolymorphic(layer, input)
LayerSoftmax           SoftmaxForwardPolymorphic(layer, input)
LayerParallel          ParallelForwardPolymorphic(layer, input)
LayerSequential        SequentialForwardPolymorphic(layer, input)
default                DenseForwardPolymorphic(layer, input)
───────────────────────────────────────────────────────────────
```

---

## DispatchLayerBackward

```go
func DispatchLayerBackward[T Numeric](
    layer *VolumetricLayer,
    gradOutput, input, skip, preAct *Tensor[T],
) (gradInput, gradWeights *Tensor[T])
```

The mirror of `DispatchLayer`. Returns:

- `gradInput` — the gradient to pass to the layer that produced `input` (propagates error upstream)
- `gradWeights` — the gradient for this layer's own weights (used to update `WeightStore.Master`)

The routing table is symmetric to the forward pass. The `skip` argument is used only by `ResidualBackwardPolymorphic`.

---

## The 3D Grid Traversal

`ForwardPolymorphic[T]` iterates the grid in reading order:

```go
for z := 0; z < n.Depth; z++ {
    for y := 0; y < n.Rows; y++ {
        for x := 0; x < n.Cols; x++ {
            for l := 0; l < n.LayersPerCell; l++ {
                idx := n.GetIndex(z, y, x, l)
                layer := &n.Layers[idx]
                // ...
                _, post := DispatchLayer(layer, currentTensor, nil)
                currentTensor = post
            }
        }
    }
}
```

The flattened index formula:

```
idx = z * (Rows * Cols * LayersPerCell)
    + y * (Cols * LayersPerCell)
    + x * (LayersPerCell)
    + l
```

Visually, for a (Depth=1, Rows=2, Cols=3, LayersPerCell=1) network:

```
z=0:
  ┌─────────────┬─────────────┬─────────────┐
  │ (0, 0, 0,0) │ (0, 0, 1,0) │ (0, 0, 2,0) │  ← idx 0,1,2
  │   idx=0     │   idx=1     │   idx=2     │
  ├─────────────┼─────────────┼─────────────┤
  │ (0, 1, 0,0) │ (0, 1, 1,0) │ (0, 1, 2,0) │  ← idx 3,4,5
  │   idx=3     │   idx=4     │   idx=5     │
  └─────────────┴─────────────┴─────────────┘

Data flows: idx=0 ──▶ idx=1 ──▶ idx=2 ──▶ idx=3 ──▶ idx=4 ──▶ idx=5
```

`BackwardPolymorphic` walks in reverse (z, y, x, l all reversed), using cached `inputs[idx]` and `preActs[idx]` from the forward pass.

---

## Tiled Traversal

When `n.UseTiling = true`, `ForwardPolymorphic` uses a blocked spatial traversal with tile size 4:

```
for zTile := 0; zTile < Depth; zTile += 4 {
  for yTile := 0; yTile < Rows; yTile += 4 {
    for xTile := 0; xTile < Cols; xTile += 4 {
      // Process 4×4×4 tile of cells
    }
  }
}
```

This is the CPU-side analogue of the GPU workgroup tile strategy. The intent is to improve data locality: all layers in a 4×4×4 spatial neighborhood execute together, keeping their weight data warm in L2/L3 cache.

### SC (single-workgroup) vs MC (multi-workgroup) tiling

There are **two different “tiling” knobs** in `poly`:

1. **`VolumetricNetwork.UseTiling`** (see [Tiled Traversal](#tiled-traversal) above) — spatial blocking of the **3D grid** in `ForwardPolymorphic` (4×4×4 cells). Unrelated to transformer matmul tiles.
2. **Per-layer matmul / GPU workgroup tiling** — `RefreshRuntimeTileSizes()` fills per-dtype maps from layer geometry and (for GPU) `WGPUContext` limits.

#### GPU: two tile maps, configurable SC vs MC

On **GPU**, each layer gets **`GPUSCTileSizes`** and **`GPUMCTileSizes`** (see `refreshRuntimeGPUTileSizes` in `tile_detection.go`). At dispatch, **`VolumetricNetwork.EnableMultiCoreTiling`** chooses which map to use: `GetGPUMCTileSize(dtype)` when `true` (larger / higher-throughput tiles where limits allow), `GetGPUSCTileSize(dtype)` when `false` (smaller workgroups, friendlier to tight limits). So **MC vs SC on GPU is a real switch** — you are not stuck in one profile; set `EnableMultiCoreTiling` (or use **`TrainingModeGPUSC` / `TrainingModeGPUMC`** in `trainBatchWGPU`, which pick tile sizes the same way).

Transformer-style forwards in `wgpu_forward.go` read **`network.EnableMultiCoreTiling`** (not per-layer) for that choice. `WGPUContext.GPUTileSize` is the device-tuned baseline that feeds how those SC/MC maps are built, not the only number used at dispatch.

#### CPU: one tile map (not an SC/MC pair on the layer)

On **CPU**, each layer has a **single** per-dtype map, **`CPUTileSizes`**, via `GetCPUTileSize` — there is **no** `CPUSCTileSizes` / `CPUMCTileSizes` pair. Tiled matmul-style loops (Dense, SwiGLU, CNN, etc.) all use that one size.

`TrainingModeCPUSC` and `TrainingModeCPUMC` exist in the enum (and show up in benchmarks), but **`ConfigureNetworkForMode` applies the same wiring to all CPU modes** (`UseTiling`, `EnableMultiCoreTiling`, `RefreshRuntimeTileSizes`), and **`executeBatchCPU` does not receive the mode** — so there is **no** separate “CPU SC tile path” vs “CPU MC tile path” in the layer maps today. **`EnableMultiCoreTiling` on CPU** is set for consistency with GPU-bound nets and training tooling; it does **not** flip between two CPU tile sizes because only one map exists.

`WGPUContext.GPUTileSize` is the auto-detected base hint (from limits); concrete SC/MC sizes per layer type on GPU live in the two GPU maps, not in that single int alone.

---

## VolumetricLayer: The Coordinate Record

Every `VolumetricLayer` contains its own position:

```go
type VolumetricLayer struct {
    Network     *VolumetricNetwork  // back-pointer
    Type        LayerType
    Activation  ActivationType
    DType       DType
    WeightStore *WeightStore

    Z int  // Depth coordinate
    Y int  // Row coordinate
    X int  // Col coordinate
    L int  // Layer index within cell

    // Spatial Routing
    IsRemoteLink bool
    TargetZ, TargetY, TargetX, TargetL int

    // ... configuration fields
}
```

The `(Z, Y, X, L)` fields are set during `NewVolumetricNetwork` and are the canonical address. `GetLayer(z, y, x, l)` returns a pointer into the flat `Layers` slice using `GetIndex`.

---

## IsRemoteLink: Spatial Hopping

A layer with `IsRemoteLink = true` does not receive its input from the previous layer in reading order. Instead, it reads from the output of whatever layer lives at `(TargetZ, TargetY, TargetX, TargetL)`.

This enables:

1. **Skip connections** — hop over several layers in the grid
2. **Feedback loops** — target a layer at an *earlier* coordinate (biological recurrence)
3. **Parallel expert routing** — multiple layers at different positions all reading the same source
4. **Cross-depth signals** — connect depth=0 outputs to depth=2 inputs

```
Standard flow:             Remote link (skip):

 (0,0,0) → (0,0,1)          (0,0,0) ────────────────────┐
              │               (0,0,1) → (0,0,2) → ...    │
           (0,0,2)                                        │
              │               (0,2,0) ←── IsRemoteLink ──┘
           (0,0,3)             └── reads output of (0,0,0)

Feedback loop:

 (0,0,0)
    │
 (0,0,1)
    │
 (0,0,2) ─── IsRemoteLink ──▶ TargetZ=0, TargetY=0, TargetX=0
                                (reads from cycle N-1's output
                                 of layer (0,0,0) — step mesh only)
```

In `ForwardPolymorphic`, a remote-linked layer simply receives `currentTensor` like any other layer; the remote link semantic is only fully honored by `StepForward`, which maintains per-layer output buffers across time steps.

In `ParallelForwardPolymorphic` and `SequentialForwardPolymorphic`, remote links are resolved by calling `layer.Network.GetLayer(branch.TargetZ, ...)` and dispatching the resolved layer pointer.

---

## The GPU Dispatch Path

When `n.UseGPU = true`, the training loop calls `ctx.DispatchForwardLayer(l, batchSize, curBuf, preBuf)` instead of `DispatchLayer`. This function is in `wgpu_forward.go` and routes to the appropriate WGSL compute shader based on `l.Type`.

The same dispatcher philosophy applies: one function, one switch, explicit routing. The difference is that inputs and outputs are `*wgpu.Buffer` handles in VRAM rather than `*Tensor[T]` in RAM.

```
trainBatchWGPU:

  BeginFrame()  ← create shared CommandEncoder
     │
     ├── for each layer forward:
     │   └── ctx.DispatchForwardLayer(l, ...) ← records into encoder
     │
     ├── DispatchMSEGradPartialLoss(...)       ← records into encoder
     │
     ├── for each layer backward (reverse):
     │   ├── ctx.DispatchActivationBackward(...)
     │   ├── ctx.DispatchBackwardLayer(l, ...)
     │   └── ctx.DispatchApplyGradients(...)
     │
  FlushFrame()  ← ONE submit for entire forward + backward + weight update
     │
  ReadBuffer(partialsBuf) ← only reads back tiny loss scalars
```

This single-submission design reduces Go-to-GPU driver overhead from ~150+ round trips per batch to exactly 1.

---

## Disabled Layers

Setting `layer.IsDisabled = true` causes both `ForwardPolymorphic` and `StepForward` to skip the layer entirely. In `StepForward`, a disabled layer passes its input buffer through to `NextBuffer` unchanged. This is the mechanism for implementing sparse MoE expert activation — gate layers can conditionally disable branches.

---

## Training: Forward Pass, Backward Pass, Optimizers, and Learning

Source: https://openfluke.com/docs/training
Markdown: https://openfluke.com/docs/training.md

# Training: Forward Pass, Backward Pass, Optimizers, and Learning

This document covers the full training pipeline: the forward and backward pass mechanics, loss computation, weight update strategies, gradient clipping, Tween, and the `VGStepBP` adaptive rate.

---

## The Training Loop

```go
result, err := poly.Train[float32](network, batches, config)
```

`Train[T Numeric]` is the high-level entry point. It wraps `trainBatchCPU` or `trainBatchWGPU` depending on `config.UseGPU`.

```go
type TrainingConfig struct {
    Epochs       int
    LearningRate float32
    LossType     string   // "mse" or "cross_entropy"
    GradientClip float32  // 0 = no clipping
    Verbose      bool
    UseGPU       bool
    DeviceID     int
    TrackPerf    bool
}
```

A `TrainingBatch[T]` pairs `Input *Tensor[T]` with `Target *Tensor[T]`. Multiple batches are provided as a slice — the loop iterates over batches for each epoch, averages the loss, and prints progress if `Verbose = true`.

---

## Runtime tiling (`ConfigureNetworkForMode`)

Before the training loop runs, `Train` wires the network through `ConfigureNetworkForMode` (`training.go`), which aligns tiling flags with the selected `TrainingMode`:

- **CPU modes** (`TrainingModeCPUNormal`, `TrainingModeCPUSC`, `TrainingModeCPUMC`): **all three are configured the same way** — `EnableMultiCoreTiling = true`, `RefreshRuntimeTileSizes()`, then `UseTiling` and `EnableMultiCoreTiling` on **every** layer. The CPU forward (`executeBatchCPU`) does **not** branch on `TrainingMode`; poly has **one** CPU tile map per layer (`CPUTileSizes`), not separate SC/MC maps. The SC/MC names in the enum are for labeling and benchmarks, not a second tile-size profile on CPU today.
- **GPU modes** (`TrainingModeGPUNormal`, `TrainingModeGPUSC`, `TrainingModeGPUMC`): initializes WebGPU if needed, `RefreshRuntimeTileSizes()`, resets the bind-group cache, `SyncToGPU()`, and ensures FP32 master buffers exist for backward. **`trainBatchWGPU`** uses **`TrainingModeGPUSC`** vs **`TrainingModeGPUMC`** to select **`GetGPUSCTileSize`** vs **`GetGPUMCTileSize`** per layer; **`GPUNormal`** uses untiled or generic dispatch per layer type.

For **interactive inference** (no explicit training mode), toggling **`VolumetricNetwork.EnableMultiCoreTiling`** chooses GPU SC vs MC tile maps (`wgpu_forward.go`), the same underlying maps training uses.

---

## CPU Training: Step by Step

```go
func trainBatchCPU[T Numeric](n *VolumetricNetwork, batch TrainingBatch[T], config *TrainingConfig) float64
```

### 1. Forward Pass with History Capture

```
histIn  [numLayers]*Tensor[T]  ← input to each layer
histPre [numLayers]*Tensor[T]  ← preAct from each layer

curr = batch.Input
for each layer idx:
    histIn[idx] = curr
    pre, post = DispatchLayer(layer, curr, nil)
    histPre[idx] = pre
    curr = post
```

The history arrays are what make backpropagation possible without a tape. Every layer caches what it received and what it produced before activation.

### 2. Loss and Gradient Computation

```
gradOut = ComputeLossGradient(curr, batch.Target, "mse")
lossVal = CalculateLoss(curr, batch.Target, "mse")
```

**MSE loss:**
```
L = (1/N) Σᵢ (output[i] - target[i])²

gradOut[i] = (2/N) × (output[i] - target[i])
```

### 3. Backward Pass

```go
_, layerGradients, _ := BackwardPolymorphic(n, gradOut, histIn, histPre)
```

`BackwardPolymorphic` walks the grid in **reverse** order (Z high to low, Y high to low, X high to low, L high to low). At each step:

```
gIn, gW = DispatchLayerBackward(layer, currentGrad, histIn[idx], nil, histPre[idx])
currentGrad = gIn                   ← flows back to previous layer
layerGradients[idx] = {gIn, gW}    ← stored for weight update
```

The backward pass for Dense computes:

```
gradPre[b,o] = gradOutput[b,o] × activation'(preAct[b,o])

gradWeights[o,i] += input[b,i] × gradPre[b,o]   (accumulated over batch)
gradInput[b,i]   += W[o,i] × gradPre[b,o]
```

### 4. Weight Update

```go
for idx := range n.Layers {
    if layerGradients[idx][1] != nil {
        gW := ConvertTensor[T, float32](layerGradients[idx][1])
        ApplyRecursiveGradients(l, gW, config.LearningRate)
    }
}
```

`ApplyRecursiveGradients` calls `WeightStore.ApplyGradients(gW, lr)`:

```
Master[i] -= lr × gradWeights[i]
```

After this, all cached `Versions` and `GPUWeights` are cleared, forcing re-quantization on the next forward pass.

`ApplyRecursiveGradients` also recurses into `ParallelBranches` and `SequentialLayers`, using the `Nested` structure of the returned `gradWeights` tensor to route updates to the correct sub-layer.

---

## GPU Training: BeginFrame / FlushFrame

The GPU training path batches the entire forward + backward + weight-update into **one command buffer**:

```
ctx.BeginFrame()         ← create shared CommandEncoder
  │
  ├── forward pass: DispatchForwardLayer per layer
  ├── loss grad: DispatchMSEGradPartialLoss
  ├── backward: DispatchActivationBackward + DispatchBackwardLayer per layer
  └── update: DispatchApplyGradients per layer

ctx.FlushFrame()         ← ONE submit + destroy temp uniform bufs
  │
ReadBuffer(partialsBuf) ← only reads back numWG × float32 scalars
```

The loss value is computed from partial sums: `numWG = (totalOutput + 255) / 256` workgroups each sum 256 elements. The Go side only reads back `numWG` floats rather than the full output tensor.

GPU weight updates are applied directly in VRAM via `DispatchApplyGradients`, which runs a WGSL shader:

```wgsl
weights[i] -= lr * gradients[i]
```

This means the CPU master weights become stale after GPU training. A `ReadBuffer` + `Unpack` cycle is required if you want to access updated weights on the CPU.

---

## Loss Functions

| `LossType` | Formula | Gradient |
|:-----------|:--------|:---------|
| `"mse"` | `(1/N) Σ (out-target)²` | `(2/N)(out-target)` |
| `"cross_entropy"` | (not yet in `training.go`) | — |

The GPU MSE gradient shader (`DispatchMSEGradPartialLoss`) computes both the gradient tensor and partial sums in a single pass.

---

## Tween (neural target propagation)

**Tween** is the name used in this codebase for layer-local target propagation. In papers it often appears as *target propagation*, *difference target propagation*, or similar. Implementation: `tween.go`.

Tween is a gradient-free alternative that estimates what each layer *should* have produced rather than computing exact chain-rule gradients.

### Two Modes

**Chain Rule mode** (`UseChainRule = true`):

```
target = actual + gradient × GradientScale
```

This uses backpropagation to compute gradients, then shifts the target in the gradient direction. It is standard backprop dressed in Tween clothing.

**Pure Tween mode** (`UseChainRule = false`):

```
target[i] = Σⱼ w[i,j] × currentTarget[j] / totalWeight[j]
```

Estimates input targets using weighted importance from the layer's own weights, without computing derivatives. This is the biologically-motivated "local learning" variant. Supported for Dense, RNN, LSTM, MHA, and SwiGLU.

### The TweenState

```go
type TweenState[T Numeric] struct {
    ForwardActs     []*Tensor[T]    // what layers produced
    BackwardTargets []*Tensor[T]    // what they should have produced
    Gradients       []*Tensor[float32]
    LinkBudgets     []float32       // cosine similarity: actual vs target
    Gaps            []float32       // RMS distance: actual vs target
    Config          *TweenConfig
}
```

### Usage Pattern

```go
state := poly.NewTweenState[float32](network, poly.DefaultTweenConfig())
output := poly.TweenForward(network, state, input)
poly.TweenBackward(network, state, target)
state.CalculateLinkBudgets()
poly.ApplyTweenGaps(network, state, lr)
```

### Link Budget Gating

Before applying any weight update, the engine checks the layer's `LinkBudget` (cosine similarity between actual output and backward target, normalized to [0,1]):

```
if budget < 0.2 {
    skip update  // prevent corrupting "dead" layers
}
layerRate = lr × (0.5 + budget × 0.5)  // good signal = higher rate
```

This prevents gradient corruption in layers where the signal has been destroyed.

---

## VGStepBP Adaptive Rate

The README mentions `VGStepBP` (Variable Gradient Step Backpropagation) as an adaptive rate calculation. This integrates with the Tween `DepthScaleFactor` field:

```go
DepthScaleFactor: 1.1   // each deeper layer gets 1.1× the base rate
```

Deeper layers receive slightly higher learning rates to compensate for gradient attenuation through the network depth. This is a simple heuristic that avoids the full computation of per-layer adaptive optimizers.

---

## Gradient Explosion Detection

The `GradientClip` field in `TrainingConfig` (when non-zero) clips gradient norms. Additionally, the Tween gap system implicitly detects explosion: if `Gaps[i]` grows very large, the gap-based update `delta = lr × input × gap` will also be large, but the Link Budget gating prevents this from firing if the cosine similarity is low.

The README references "Gradient Explosion Detection & Damping" as a completed feature in the training automation section.

---

## Activation Functions (Forward and Backward)

All activation derivatives are computed analytically in `ActivateDerivative[T]`:

```
ReLU:    dA/dx = 1 if x > 0, else 0
SiLU:    dA/dx = σ(x)(1 + x(1-σ(x)))
GELU:    dA/dx ≈ CDF(x) + x × PDF(x)
Tanh:    dA/dx = 1 - tanh(x)²
Sigmoid: dA/dx = σ(x)(1 - σ(x))
Linear:  dA/dx = 1
```

In the backward pass, `gradOutput` is multiplied elementwise by the derivative of `preAct` before accumulating `gradWeights` and `gradInput`.

---

## The Full Training Data Flow

```
┌─────────────────────────────────────────────────────────────────┐
│  EPOCH LOOP                                                     │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  BATCH                                                   │   │
│  │                                                         │   │
│  │  batch.Input                                            │   │
│  │       │                                                 │   │
│  │       ▼                                                 │   │
│  │  [Forward Pass]  ──▶  histIn, histPre captured          │   │
│  │       │                                                 │   │
│  │       ▼                                                 │   │
│  │  prediction                                             │   │
│  │       │                                                 │   │
│  │       ▼                                                 │   │
│  │  [Loss + gradOut]  ◀── batch.Target                     │   │
│  │       │                                                 │   │
│  │       ▼                                                 │   │
│  │  [Backward Pass]  ──▶  layerGradients                   │   │
│  │       │                                                 │   │
│  │       ▼                                                 │   │
│  │  [ApplyRecursiveGradients]  ──▶  Master updated         │   │
│  │                                  Versions cleared       │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  LossHistory appended, EpochTimes recorded                     │
└─────────────────────────────────────────────────────────────────┘
```

---

## TrainingResult

```go
type TrainingResult struct {
    FinalLoss   float64
    TotalTime   time.Duration
    LossHistory []float64          // one entry per epoch
    EpochTimes  []time.Duration
}
```

`Train` returns this struct regardless of CPU or GPU path, making it easy to log or compare runs.

---

## GPU Backend: WebGPU (WGPU)

Source: https://openfluke.com/docs/gpu
Markdown: https://openfluke.com/docs/gpu.md

# GPU Backend: WebGPU (WGPU)

This document covers the WebGPU backend: initialization, the `BeginFrame`/`FlushFrame` command batching pattern, the buffer pool and pipeline cache, which layers have GPU support, and the tiling strategy.

---

## Why WebGPU

M-POLY-VTD uses the `github.com/openfluke/webgpu/wgpu` Go bindings for hardware acceleration. WebGPU compiles to:

- **Vulkan** on Windows/Linux
- **Metal** on macOS/iOS
- **DX12** on Windows
- **WebGPU** in browser via WASM

No CUDA, no CGO beyond the wgpu bindings. All shaders are WGSL (WebGPU Shading Language) strings generated at runtime by Go functions in `wgpu_shaders.go`, `wgpu_kernels.go`, and `wgpu_backward_shaders.go`.

---

## WGPUContext

```go
type WGPUContext struct {
    Instance       *wgpu.Instance
    Adapter        *wgpu.Adapter
    Device         *wgpu.Device
    Queue          *wgpu.Queue

    PipelineCache  map[string]*wgpu.ComputePipeline   // keyed by shader source hash
    ActivationPool map[string]*wgpu.Buffer             // named activation buffers
    LayoutCache    map[string]*wgpu.BindGroupLayout
    BindGroupCache map[uint64]*wgpu.BindGroup          // keyed by buffer-set hash

    UniformPool    []*wgpu.Buffer   // pre-allocated uniform buffer pool
    UniformIdx     int

    ActiveEncoder  *wgpu.CommandEncoder   // non-nil during BeginFrame/FlushFrame
    PendingDestroys []*wgpu.Buffer        // temp bufs destroyed after FlushFrame

    GPUTileSize    int    // auto-detected optimal tile size
    Limits         wgpu.Limits
}
```

### Initialization

```go
err := network.InitWGPU()
```

`InitWGPU` performs three WebGPU steps:

1. Create an `Instance` and request a `HighPerformance` `Adapter`
2. Query the default device for its limits, then boost `MaxStorageBufferBindingSize` to 1 GB and `MaxBufferSize` to 2 GB for large embedding tables
3. Request the final `Device` with boosted limits, then auto-detect the optimal `GPUTileSize` from the workgroup storage and invocation limits

```
CalculateOptimalGPUTileSizeFromLimits(
    MaxComputeWorkgroupStorageSize,
    MaxComputeInvocationsPerWorkgroup,
    headDim=64,
) → GPUTileSize (e.g., 8 or 16)
```

After init, call `network.SyncAllToGPU()` to upload all layer weights to VRAM. This also creates GPU KV cache buffers for MHA layers and pre-allocates named activation buffers (`hidden_A`, `hidden_B`, `norm_out`, etc.).

---

## BeginFrame / FlushFrame Pattern

The most important design decision in the GPU backend. Instead of submitting a command buffer per layer (which would mean 100+ GPU driver calls per token), all operations are recorded into a single shared encoder:

```
ctx.BeginFrame()
    ← creates ctx.ActiveEncoder
    ← resets ctx.PendingDestroys

    // All Dispatch* calls record into ActiveEncoder:
    ctx.DispatchForwardLayer(...)
    ctx.DispatchActivation(...)
    ctx.DispatchMSEGradPartialLoss(...)
    ctx.DispatchBackwardLayer(...)
    ctx.DispatchApplyGradients(...)

ctx.FlushFrame()
    ← enc.Finish() + Queue.Submit(cmd)
    ← destroys PendingDestroys buffers
    ← resets UniformIdx
```

Temporary uniform buffers (holding layer parameters like `batchSize`, `inputSize`, etc.) must stay alive until `FlushFrame` because the GPU reads them asynchronously. They are collected in `PendingDestroys` and destroyed only after the submit.

`Queue.WriteBuffer` calls (to upload inputs, targets, and zero DW buffers) are **queue-level operations** — they are safe to call between `BeginFrame` and `FlushFrame` because the WebGPU spec guarantees they complete before the encoder submit executes.

---

## Buffer Management

### ActivationPool

Named persistent buffers that survive across frames:

```go
buf := ctx.GetActivationBuffer("hidden_A", size, wgpu.BufferUsageStorage)
```

If a buffer with this name already exists and is large enough, it is reused. Otherwise a new one is created and cached. This avoids per-step allocations during inference.

### CreatePersistentBuffer

```go
buf, err := ctx.CreatePersistentBuffer(data []float32, label string)
```

Uploads a `[]float32` to a VRAM storage buffer with `Storage | CopySrc | CopyDst` usage. Used for weight buffers that stay resident across many forward passes.

### ReadBuffer

```go
values, err := ctx.ReadBuffer(buf *wgpu.Buffer)
```

Copies a GPU buffer to a CPU staging buffer, maps it, and returns `[]float32`. This is the only synchronous GPU→CPU roundtrip in the training path; it is called once per batch to read back the partial loss sums.

### BindGroup Cache

`GetBindGroup(pipeline, buffers...)` hashes the pipeline pointer and buffer pointers into a `uint64` key. If a matching `BindGroup` already exists, it is returned without re-creating it. This avoids rebuilding the descriptor set on every frame for stable weight+activation buffer pairs.

---

## Weight Sync Strategies

`SyncToGPU()` on a `VolumetricLayer` uses different strategies depending on layer type and DType:

```
RMSNorm:
    Always uploads FP32 master. Quantization destroys normalization precision.

SwiGLU (FP32):
    Splits Master into Gate, Up, Down slices.
    Uploads three separate persistent buffers.

SwiGLU (INT4 / Q4_0):
    Calls syncQuantizedSwiGLU which quantizes each slice independently.
    Each component gets a scales buffer + packed uint32 buffer.

Dense (INT4 / Q4_0):
    syncQuantizedDense: 32-weight blocks, scale per block, packed nibbles.

MHA (FP32):
    Splits into Q/K/V/O weight buffers at internal DType codes 200/201/202/203.
    Also uploads optional q_norm/k_norm buffers at 204/205 when present.

MHA (INT4):
    syncQuantizedMHA: quantizes each of Q/K/V/O separately.
```

The internal DType codes (100–102 for SwiGLU components, 200–203 for MHA projections) are a namespacing trick to store multiple named GPU buffers in the single `GPUWeights map[DType]any` without adding new struct fields.

---

## Forward Dispatch (wgpu_forward.go)

`ctx.DispatchForwardLayer(l, batchSize, inBuf, outBuf)` routes to the correct WGSL shader. Key functions:

| Function | WGSL kernel | Notes |
|:---------|:------------|:------|
| `DispatchDenseForward` | matmul shader | register-tiled |
| `DispatchRMSNorm` | RMSNorm shader | always FP32 weights |
| `DispatchCNN1Forward` | 1D conv shader | |
| `DispatchCNN2Forward` | 2D conv shader | 1826x vs CPU |
| `DispatchCNN3Forward` | 3D conv shader | 7602x vs CPU |
| `DispatchRNNForward` | RNN cell shader | |
| `DispatchLSTMForward` | LSTM cell shader | |
| `DispatchEmbedding` | gather shader | |
| `DispatchMHAForward` | Q/K/V + attention | separate kernels |
| `DispatchSwiGLUForward` | gate+up+down | BROKEN determinism |

`DispatchActivation(n, act, inBuf, outBuf)` dispatches a shader that applies ReLU, SiLU, GELU, Tanh, or Sigmoid elementwise over `n` elements.

---

## Backward Dispatch (wgpu_backward_shaders.go)

WGSL shaders for gradient computation:

**Dense DX shader** (`ShaderDenseBackwardDX`):
```wgsl
dx[b, i] = Σ_o  dy[b, o] × W[o, i]

// Implemented as tiled matmul using shared memory tiles:
var<workgroup> dyTile: array<f32, tileSize*tileSize>;
var<workgroup> wTile:  array<f32, tileSize*tileSize>;
```

**Dense DW shader** (`ShaderDenseBackwardDW`):
```wgsl
dW[o, i] = Σ_b  dy[b, o] × x[b, i]
// Uses atomic add for race-free accumulation across batch
```

**CNN DX/DW shaders**: Implement the "strided convolution" backward pass — the input gradient is the transposed convolution of the output gradient with the kernel, and the weight gradient is the correlation of the input with the output gradient.

**Activation backward**: `DispatchActivationBackward` applies the activation derivative elementwise: `gradPre[i] = gradOut[i] × act'(preAct[i])`.

**MSE gradient + partial loss** (`DispatchMSEGradPartialLoss`):
```wgsl
grad[i] = (2.0 / N) × (pred[i] - target[i])
partial[wg] = Σ_{i in group}  (pred[i] - target[i])²
```

**Apply gradients** (`DispatchApplyGradients`):
```wgsl
weights[i] -= lr × dw[i]
```

---

## GPU support: layer × `DType` (one table)

Scope: **`VolumetricLayer.SyncToGPU`** + **`(*WGPUContext).DispatchForwardLayer`** in `poly.go` / `wgpu_kernels.go`. Symbol **`T`** means **`Transformer.ForwardTokenIDsWGPU`** / **`wgpu_forward.go`** (LLM inference) for that layer+dtype, not generic batch dispatch. Activations are **`f32`** WGSL; **`DTypeFloat64`** is coerced to the **`Float32`** weight-buffer path in the `hasSpecialPath` / morph block (see `SyncToGPU`).

| Symbol | Meaning |
|:------:|---------|
| **Y** | **Generic GPU forward OK**: `SyncToGPU` does not skip the `MorphToFloat32ForGPU` upload **or** uses a matching native path (`DispatchDenseQ4` for **Dense+Int4** only; **CNN1** packed when `isCNN1NativeGPUQuantDType`). |
| **T** | **Transformer path only** (`wgpu_forward.go`): QKV/O use **`DispatchDenseQ4`** / **`DispatchDenseI8`**; SwiGLU gate/up may use **`DispatchSwiGLUQ4`**. **Not** correct for generic **`DispatchForwardLayer`** on that dtype (quantized buffers + **`DispatchDense`** / **`DispatchSwiGLUWithActCache`** mismatch). |
| **–** | **Not supported** after vanilla `SyncToGPU` + generic `DispatchForwardLayer` (skipped morph with no valid weight buffer, or packed weights fed to an **`f32`** matmul / SwiGLU shader). |
| **·** | **DType N/A** (no weight tensor for that layer). |

**Dense:** only **`DTypeInt4`** selects **`DispatchDenseQ4`**. Wider dtypes (**2–13, 15–20** except **14**) hit **`hasSpecialPath`** with no quant branch → morph skipped → **–**. Eight-bit dtypes on Dense get **`syncQuantizedDenseI8`** but **`DispatchDenseTiled`** expects **`f32`** layout → **–**. **`ensureGPUFloat32Weights`** (training) can still attach **`GPUWeights[Float32]`** so matmul runs on the **FP32 master** regardless of `l.DType` (not reflected as **Y** here).

| ID | `DType` | Dense | RMSNorm | CNN1 | CNN2 | CNN3 | RNN | LSTM | Embedding | Softmax | MHA | SwiGLU | Residual |
|---:|---------|:-----:|:-------:|:----:|:----:|:----:|:---:|:----:|:---------:|:-------:|:---:|:------:|:--------:|
| 0 | Float64 | Y | Y | Y | Y | Y | Y | Y | Y | · | Y | Y | · |
| 1 | Float32 | Y | Y | Y | Y | Y | Y | Y | Y | · | Y | Y | · |
| 2 | Float16 | – | Y | Y | Y | Y | Y | Y | Y | · | Y | Y | · |
| 3 | BFloat16 | – | Y | Y | Y | Y | Y | Y | Y | · | Y | Y | · |
| 4 | FP8 E4M3 | – | Y | Y | Y | Y | Y | Y | Y | · | T | T | · |
| 5 | FP8 E5M2 | – | Y | Y | Y | Y | Y | Y | Y | · | T | T | · |
| 6 | Int64 | – | Y | Y | Y | Y | Y | Y | Y | · | Y | Y | · |
| 7 | Int32 | – | Y | Y | Y | Y | Y | Y | Y | · | Y | Y | · |
| 8 | Int16 | – | Y | Y | Y | Y | Y | Y | Y | · | Y | Y | · |
| 9 | Int8 | – | Y | Y | Y | Y | Y | Y | Y | · | T | T | · |
| 10 | Uint64 | – | Y | Y | Y | Y | Y | Y | Y | · | Y | Y | · |
| 11 | Uint32 | – | Y | Y | Y | Y | Y | Y | Y | · | Y | Y | · |
| 12 | Uint16 | – | Y | Y | Y | Y | Y | Y | Y | · | Y | Y | · |
| 13 | Uint8 | – | Y | Y | Y | Y | Y | Y | Y | · | T | T | · |
| 14 | Int4 | Y | Y | Y | Y | Y | Y | Y | Y | · | T | T | · |
| 15 | Uint4 | – | Y | Y | Y | Y | Y | Y | Y | · | T | T | · |
| 16 | FP4 | – | Y | Y | Y | Y | Y | Y | Y | · | T | T | · |
| 17 | Int2 | – | Y | Y | Y | Y | Y | Y | Y | · | T | T | · |
| 18 | Uint2 | – | Y | Y | Y | Y | Y | Y | Y | · | T | T | · |
| 19 | Ternary | – | Y | Y | Y | Y | Y | Y | Y | · | T | T | · |
| 20 | Binary | – | Y | Y | Y | Y | Y | Y | Y | · | T | T | · |

**CNN1 column:** **Y** = either **`DispatchCNN1Packed`** (dtype in `isCNN1NativeGPUQuantDType`: Int8, Int4, Int2, FP4, Ternary, Binary, FP8×2, Uint8, Uint4, Uint2, Float16, BFloat16, Int16) or **`DispatchCNN1`** on **`MorphToFloat32ForGPU`** otherwise.

**Not in this table:** `LayerLayerNorm`, `LayerConvTransposed*`, `LayerKMeans`, `LayerParallel`, `LayerSequential`, `LayerMetacognition` (no `DispatchForwardLayer` arm). See [numerical_types.md](numerical_types.md) for the **`DType`** enum and **`WeightStore`**.

**GPU training:** `gpuTrainingNeedsCPUFallback` in `training.go` forces a **CPU** optimizer step when the net includes **MHA**, **SwiGLU**, **Dense+Int4**, or **RNN/LSTM** with **Int8/Int4**.

---

The project uses **Numerical Tiling** to map 3D volumetric layers to GPU workgroups.

### SC (single-workgroup) vs MC (multi-workgroup) profiles

Loom differentiates two dispatch profiles for GPU kernels (attention, dense, SwiGLU, CNN, etc.):

- **SC**: Smaller workgroups / tiles — lower register pressure, friendlier to tight limits (edge GPUs, WASM).
- **MC**: Larger tiles where limits allow — higher throughput on desktop-class GPUs.

At **inference**, transformer-style forwards (`wgpu_forward.go`) choose per-layer tile sizes with `layer.GetGPUSCTileSize(dtype)` vs `layer.GetGPUMCTileSize(dtype)` according to **`VolumetricNetwork.EnableMultiCoreTiling`** (with the same field mirrored on layers when set). That is the primary switch — not `GPUTileSize` alone.

`WGPUContext.GPUTileSize` is still the device-tuned baseline derived from `CalculateOptimalGPUTileSizeFromLimits` and feeds into how SC/MC maps are built in `refreshRuntimeGPUTileSizes`. **GPU training** may ignore the network flag and pick SC vs MC directly via `TrainingModeGPUSC` / `TrainingModeGPUMC` (`training.go`).

**CPU:** poly does **not** expose SC vs MC as two tile maps on the CPU side — layers use **`CPUTileSizes` / `GetCPUTileSize` only**. See the **“GPU: two tile maps…”** and **“CPU: one tile map…”** subsections in [dispatch.md](dispatch.md).

---

## Transformer GPU Forward (wgpu_forward.go)

`Transformer.ForwardTokenIDsWGPU` is the optimized path for LLM inference:

1. If `tokens != nil` and GPU embeddings are loaded, dispatch a gather shader to convert token IDs → hidden states entirely on-GPU
2. `BeginFrame()` — all subsequent ops recorded into one encoder
3. For each transformer block (4 layers: RMSNorm → MHA → RMSNorm → SwiGLU):
   - Dispatch `DispatchRMSNorm`
   - Dispatch Q/K/V projections separately (supports expanded QueryDim)
   - Optional Q/K RMSNorm using q_norm/k_norm buffers
   - Dispatch RoPE rotation
   - Dispatch attention score + softmax
   - Dispatch output projection
   - Add residual
4. Final norm + LM head if on GPU
5. `FlushFrame()` — single submit
6. Read back only the logits (one small buffer)

This path achieves the "260+ tokens/s prefill on M4" figure mentioned in the README.

### Qwen / Expanded-Query Notes

Loom's GPU path now supports architectures where `query_dim != d_model` (for example Qwen3-0.6B with `head_dim=128`, `num_heads=16`, `query_dim=2048`, `d_model=1024`).

Key implementation details:
- MHA shader workgroup width scales with `head_dim` (not hardcoded to 64).
- Q projection and attention output buffers use `query_dim`.
- O projection uses `input=query_dim`, `output=d_model`.
- RMSNorm epsilon is propagated from checkpoint config (`rms_norm_eps`) for parity with CPU.

---

## The step mesh engine

Source: https://openfluke.com/docs/step
Markdown: https://openfluke.com/docs/step.md

# The step mesh engine

This document covers the `StepState`, `StepForward`, `StepBackward`, and `StepApplyTween` functions that implement a clock-cycle-accurate discrete-time neural mesh.

---

## What is the step mesh?

Standard `ForwardPolymorphic` runs the entire network in one sequential sweep — input enters at coordinate (0,0,0,0) and the final output exits at the last coordinate. This is a **one-shot** pass.

The **Step mesh engine** treats the 3D grid as a living mesh. Each "tick" of the neural clock fires every layer simultaneously. Each layer reads from the previous tick's output buffers and writes to a new set of output buffers. After all layers have fired, the buffers swap. This is classical **double buffering** applied to neural computation.

```
Standard ForwardPolymorphic:

  Input ──▶ L0 ──▶ L1 ──▶ L2 ──▶ L3 ──▶ Output
  (one complete pass per call)


Step mesh (one clock cycle):

  Tick N:                        Tick N+1:
  ┌──────┬──────┬──────┐          ┌──────┬──────┬──────┐
  │ L0   │ L1   │ L2   │          │ L0   │ L1   │ L2   │
  │fires │fires │fires │ ──swap──▶│fires │fires │fires │
  │      │      │      │ buffers  │      │      │      │
  └──────┴──────┴──────┘          └──────┴──────┴──────┘
  All layers process simultaneously    Same pattern
```

The key insight: **every layer in the grid has the opportunity to update its output every clock cycle**, not just when an input happens to flow through it sequentially.

---

## StepState

```go
type StepState[T Numeric] struct {
    LayerData  []*Tensor[T]     // current output of every layer
    NextBuffer []*Tensor[T]     // write target for the current tick

    HistoryIn  [][]*Tensor[T]   // [step][layerIdx] → input to that layer at that step
    HistoryPre [][]*Tensor[T]   // [step][layerIdx] → preAct at that step

    StepCount uint64
    mu        sync.RWMutex

    TweenState *TweenState[T] // optional tween bridge (neural target propagation)
    lastInput *Tensor[T]
}
```

`LayerData[idx]` is what layer `idx` produced in the **previous** clock cycle. `NextBuffer[idx]` is what layer `idx` will produce in the **current** cycle. After the cycle, they swap.

Create with:

```go
state := poly.NewStepState[float32](network)
state.SetInput(inputTensor)  // loads input into LayerData[0]
```

---

## StepForward: One Clock Cycle

```go
elapsed := poly.StepForward(network, state, captureHistory bool)
```

Each call advances the mesh by exactly one discrete time step. All layers execute during this one call.

### Sequential Mode (UseTiling = false)

```go
for idx := range n.Layers {
    l := &n.Layers[idx]
    if l.IsDisabled { pass through; continue }

    // Resolve input source
    var input *Tensor[T]
    if l.IsRemoteLink {
        tIdx := n.GetIndex(l.TargetZ, l.TargetY, l.TargetX, l.TargetL)
        input = s.LayerData[tIdx]          // reads from REMOTE layer's output
    } else if idx > 0 {
        input = s.LayerData[idx-1]         // reads from preceding layer
    } else {
        input = s.LayerData[0]             // reads injection point
    }

    pre, post := DispatchLayer(l, input, nil)
    s.NextBuffer[idx] = post
}

// Swap double buffers
copy(s.LayerData, s.NextBuffer)
s.StepCount++
```

### Parallel Tiled Mode (UseTiling = true)

When `n.UseTiling = true`, goroutines process 4×4×4 spatial tiles concurrently:

```go
var wg sync.WaitGroup
for zTile ...:
  for yTile ...:
    for xTile ...:
      wg.Add(1)
      go func(zT, zE, yT, yE, xT, xE int) {
          defer wg.Done()
          for z := zT; z < zE; z++ {
              for y := yT; y < yE; y++ {
                  for x := xT; x < xE; x++ {
                      // dispatch layers in this tile
                  }
              }
          }
      }(...)
wg.Wait()
```

The mutex (`s.mu`) is held for the duration of the sequential path, and for individual history writes in the parallel path. The `NextBuffer` slice is pre-allocated so concurrent writes to different indices are safe.

### History Capture

If `captureHistory = true`, each tick appends to `HistoryIn` and `HistoryPre`:

```
After tick N:
  HistoryIn[N][idx]  = what layer idx received
  HistoryPre[N][idx] = preAct that layer idx produced
```

This history is the foundation for `StepBackward` (BPTT) and is required before calling `StepBackward`. It consumes memory proportional to `Steps × Layers × FeatureSize` — use only when training.

---

## Spatial Feedback (Remote Links in step mesh mode)

The step mesh engine is where `IsRemoteLink` reaches its full potential. Because `s.LayerData[tIdx]` is always the **previous tick's** output (not the current tick's), a remote link to an earlier coordinate creates genuine recurrence:

```
Tick N-1:
  Layer A (0,0,0) produces output → stored in LayerData[0]

Tick N:
  Layer B (0,2,0) has IsRemoteLink pointing to (0,0,0)
  → Layer B reads LayerData[0]  (from tick N-1, not current tick)
  → Layer B effectively "remembers" what A produced one cycle ago

This is the discrete-time equivalent of an RNN hidden state.
```

```
┌────────────────────────────────────────────────────────────────┐
│  SPATIAL FEEDBACK DIAGRAM                                       │
│                                                                │
│  Tick N-1:    A ──output──▶ LayerData[A]                       │
│                                                                │
│  Tick N:      B ──IsRemoteLink──▶ reads LayerData[A] from N-1  │
│               B produces new output → LayerData[B]             │
│                                                                │
│  Tick N+1:    A reads updated B output if A is also remote     │
│               → Full spatial RNN at mesh scale                 │
└────────────────────────────────────────────────────────────────┘
```

---

## StepBackward: BPTT Through the Mesh

```go
gradIn, layerGradients, err := poly.StepBackward(network, state, gradOutput)
```

This implements **Backpropagation Through Time (BPTT)** across the step mesh history. It walks backwards through both time steps and spatial coordinates.

### Algorithm

```
gradBuffers[numLayers-1] = gradOutput   // seed with final error

for step from (numSteps-1) downto 0:
    nextGradBuffers = new zero buffers

    for idx from (numLayers-1) downto 0:
        input = HistoryIn[step][idx]
        pre   = HistoryPre[step][idx]
        grad  = gradBuffers[idx]

        gIn, gW = DispatchLayerBackward(l, grad, input, nil, pre)

        // Accumulate weight gradients across all time steps
        layerGradients[idx][1] += gW   (if exists)

        // Route gIn back to the source of input for this layer
        accumulateMeshGrad(network, nextGradBuffers, idx, gIn)

    gradBuffers = nextGradBuffers

return gradBuffers[0]   // gradient with respect to the initial input
```

`accumulateMeshGrad` determines where to send `gIn`:

- If `IsRemoteLink`: send to `TargetZ/Y/X/L` coordinates
- Otherwise: send to `idx - 1`
- If `idx == 0`: send to the input site

This correctly routes gradients through the spatial topology — remote links receive their share of the gradient from every layer that consumed their output.

---

## StepApplyTween

```go
poly.StepApplyTween(network, state, globalTarget, lr)
```

Bridges the step mesh mesh with the `Tween` machinery. At each call:

1. If `state.TweenState == nil`, create a new `TweenState` with `UseChainRule = false` (gap-based learning — appropriate for the continuous-time mesh)
2. Copy current `LayerData` into `tpState.ForwardActs` (the mesh's current "what is" state)
3. Call `TweenBackward(n, tpState, globalTarget)` to compute what each layer *should* produce
4. `CalculateLinkBudgets()` — measure cosine similarity between actual and target at each node
5. `ApplyTweenGaps(n, tpState, lr)` — update weights using the gap signal, gated by link budgets

This enables **online, asynchronous learning** on a live mesh — you can inject a global target at any time and the weights update locally at each node based on their current output gap.

---

## Double Buffer Guarantees

The double buffer swap (`copy(s.LayerData, s.NextBuffer)`) happens after all layers have written to `NextBuffer`. This guarantees:

1. A layer at coordinate (0,0,2) cannot see the output of (0,0,1) from the *current* tick, only from the previous tick
2. Concurrent goroutines in tiled mode write to different indices of `NextBuffer` without conflict
3. Remote links always see stable, previous-tick values regardless of which goroutine happens to fire first

This is the "clock cycle accuracy" mentioned in the README.

## V0.75.0 Stability & Guarding
The Step mesh engine was fundamentally stabilized in v0.75.0 to support sparse volumetric grids without runtime panics.

### 1. Volumetric Coordinate Guarding
In previous versions, a misconfigured grid cell could lead to a `nil pointer dereference`. In v0.75.0, the dispatcher implements strict guarding:
- **`IsDisabled` Flag**: Every grid cell now defaults to "Disabled". They must be explicitly enabled during network construction via the `poly.VolumetricLayer` configuration.
- **Nil-Safety**: The `DispatchLayer` and `StepForward` loops check these flags before execution, ensuring that uninitialized memory in sparse 3D regions does not cause a crash.

### 2. Explicit Coordinate Hopping
Stability is further guaranteed by the enforcement of 3D volumetric coordinates (`z, y, x, l`). 
- **Deterministic Routing**: Every connection, whether a standard sequence or a remote `IsRemoteLink`, is resolved to a specific 3D coordinate. 
- **Grid Consistency**: This ensures that even in massively parallel tiled modes, the signal wavefront remains spatially consistent and bit-perfect across all 21 numerical types.

---

## When to Use the Step mesh engine

Use `StepForward` / `StepApplyTween` when you need:

- **Continuous operation**: the network runs indefinitely, processing new inputs each tick
- **Spatial feedback**: remote links that create mesh-level recurrence
- **Online learning**: weight updates interleaved with forward passes
- **Parallel processing**: the tiled mode can saturate multi-core CPUs

Use `ForwardPolymorphic` / `BackwardPolymorphic` when you need:

- **Batch training**: multiple training examples per weight update
- **GPU acceleration**: the GPU path uses `trainBatchWGPU`, not the step mesh engine
- **Deterministic single-pass inference**: no history overhead

> [!TIP]
> The README's phrase "use `StepForward` and `StepApplyTween` when you need a living network that evolves and learns over time rather than a static pipeline" captures this distinction perfectly.

---

## The DNA Engine: Topological Network Fingerprinting

Source: https://openfluke.com/docs/dna
Markdown: https://openfluke.com/docs/dna.md

# The DNA Engine: Topological Network Fingerprinting

This document covers `ExtractDNA`, `CosineSimilarity`, `CompareNetworks`, `LogicShift` detection, and the recursive signature extraction for all 19 layer types in `dna.go`.

For the **Evolution Engine** (DNA Splice + NEAT mutations), see [evolution.md](evolution.md).

---

## Why DNA?

Standard weight comparison breaks across precisions — you can't directly compare an INT8 weight against an FP32 weight. The DNA engine solves this by converting every layer's weights to a **unit direction vector** after simulating precision loss. Comparing direction vectors (cosine similarity) instead of raw values means:

- FP32 and INT8 representations of the same model look nearly identical
- Two networks trained on the same task converge toward the same DNA
- Structural changes (different layer order, different grid positions) are detectable as **logic shifts**

```
Raw FP32 weights  ──►  scale (× ws.Scale)  ──►  Normalize  ──►  unit vector
       │                                         (L2 norm)         "DNA strand"
       │
       └── FP4 weights ──►  scale (× ws.Scale)  ──►  Normalize  ──►  same direction ≈ 1.0 similarity
```

---

## Core Types

```go
// The "DNA strand" of a single layer
type LayerSignature struct {
    Z, Y, X, L int       // 3D grid coordinates
    Type        LayerType
    DType       DType
    Weights     []float32 // L2-normalized, scale-applied master weights
}

// The complete genetic blueprint of a network
type NetworkDNA []LayerSignature
```

---

## ExtractDNA — all 19 layer types

```go
func ExtractDNA(n *VolumetricNetwork) NetworkDNA
```

Iterates every layer in the network, calls `extractLayerSignature(l)`, and wraps the result with position and type metadata. The signature extraction logic handles all 19 layer types:

```
                    VolumetricNetwork
                           │
              ┌────────────┼────────────┐
              │            │            │
         LayerDense   LayerParallel  LayerSoftmax
         LayerRNN     LayerSequential LayerResidual
         LayerLSTM    (recursive)    (weightless)
         LayerMHA
         LayerSwiGLU
         LayerRMSNorm
         LayerLayerNorm
         LayerCNN1/2/3
         LayerConvT1/2/3D
         LayerEmbedding
         LayerKMeans
              │
              ▼
    extractLayerSignature(l)
              │
    ┌─────────┼──────────────┐
    │         │              │
    ▼         ▼              ▼
 weighted  recursive     weightless
 layers    containers    layers
    │         │              │
    ▼         ▼              ▼
 Master   flatten all    []float32{1.0}
 weights  branches
    │         │
    ▼         ▼
scale(×ws.Scale)   Normalize(concat)
    │
    ▼
 Normalize
    │
    ▼
 []float32 unit vector
```

### Weighted layers (Dense, RNN, LSTM, MHA, CNN*, ConvTransposed*, SwiGLU, RMSNorm, LayerNorm, Embedding, KMeans)

```go
// All weighted layers follow this path:
scale := l.WeightStore.Scale
if scale == 0 { scale = 1.0 }
simulated := make([]float32, len(l.WeightStore.Master))
for i, w := range l.WeightStore.Master {
    if scale != 1.0 {
        simulated[i] = w * scale
    } else {
        simulated[i] = w
    }
}
return Normalize(simulated)
```

Applying the layer's scale factor before normalizing means the DNA of an INT8 Dense layer and an FP32 Dense layer with the same trained weights will be nearly identical — both normalize to the same unit direction.

### Structural containers (Parallel, Sequential) — recursive extraction

Parallel and Sequential layers contain nested layers (`ParallelBranches`, `SequentialLayers`). A naive approach that returned `{1.0}` for both would make any two parallel layers look identical regardless of what's inside them. Instead, the engine recurses:

```
LayerParallel
├── Branch 0 (Dense 32×32)   ──► extractLayerSignature ──► unit vec A  ─┐
├── Branch 1 (RMSNorm 32)    ──► extractLayerSignature ──► unit vec B  ─┤ concat
└── FilterGateConfig (Dense) ──► extractLayerSignature ──► unit vec C  ─┘
                                                                         │
                                                                    Normalize(flat)
                                                                         │
                                                                    single unit vec
                                                                    representing ALL
                                                                    nested weights
```

```go
case LayerParallel:
    var flat []float32
    for _, branch := range l.ParallelBranches {
        if branch.IsRemoteLink { continue }   // remote links have no local weights
        flat = append(flat, extractLayerSignature(branch)...)
    }
    if l.FilterGateConfig != nil {
        flat = append(flat, extractLayerSignature(*l.FilterGateConfig)...)
    }
    if len(flat) == 0 { return []float32{1.0} }
    return Normalize(flat)

case LayerSequential:
    var flat []float32
    for _, sub := range l.SequentialLayers {
        flat = append(flat, extractLayerSignature(sub)...)
    }
    if len(flat) == 0 { return []float32{1.0} }
    return Normalize(flat)
```

Remote links (`IsRemoteLink = true`) are spatial hops with no local weights — they are skipped during extraction.

### Weightless layers (Softmax, Residual)

```go
case LayerSoftmax, LayerResidual:
    return []float32{1.0}
```

A `{1.0}` vector is a neutral presence marker. Two Softmax layers at the same position will score `1.0` similarity (identical), which is correct — they are architecturally identical by definition.

---

## Normalize

```go
func Normalize(v []float32) []float32
```

Converts a weight vector to a unit vector:

```
mag = sqrt(v[0]² + v[1]² + ... + v[n]²)
output[i] = v[i] / mag
```

- If `mag == 0` (all-zero weights), returns a zero vector
- Two zero vectors score `1.0` similarity (both represent an untrained/zeroed layer)
- One zero + one nonzero scores `0.0` (orthogonal by convention)

---

## CosineSimilarity

```go
func CosineSimilarity(s1, s2 LayerSignature) float32
```

Returns a score in `[-1.0, 1.0]` comparing two layer signatures:

```
         s1.Weights · s2.Weights
sim  =  ─────────────────────────   =  dot product  (since |s1| = |s2| = 1)
              |s1| × |s2|
```

Guard rails:

| Condition | Returns |
|:----------|:--------|
| `s1.Type != s2.Type` | `0.0` — architectural mismatch |
| `s1.DType != s2.DType` | `0.0` — precision mismatch |
| `len(s1.Weights) != len(s2.Weights)` | `0.0` — dimension mismatch |
| Both zero vectors | `1.0` — identical untrained layers |
| One zero, one nonzero | `0.0` — no similarity |

Similarity values to interpret:

```
-1.0  ────────────  0.0  ────────────  +1.0
  │                  │                   │
opposite          no match          identical
direction                           direction
(learned to      (different         (same functional
do opposite)      purpose)           role)
```

---

## CompareNetworks

```go
func CompareNetworks(dna1, dna2 NetworkDNA) NetworkComparisonResult

type NetworkComparisonResult struct {
    OverallOverlap float32
    LayerOverlaps  map[string]float32   // "z,y,x,l" → score
    LogicShifts    []LogicShift
}
```

Two-phase comparison:

### Phase 1 — Direct Position Matching

Match each layer in `dna1` with the layer at the same `(Z, Y, X, L)` position in `dna2`:

```
dna1:   [L0: Dense]  [L1: RNN]  [L2: Dense]
              │              │          │
              │ same pos     │          │ same pos
              ▼              ▼          ▼
dna2:   [L0: Dense]  [L1: Dense]  [L2: Dense]
              │              │          │
          sim=0.94       sim=0.0    sim=0.87   (0.0 because type mismatch)
              │              │          │
              └──────────────┴──────────┘
                              │
                         avg = 0.60
                    OverallOverlap = 0.60
```

### Phase 2 — Logic Drift Detection

For each layer in `dna1`, search **all** positions in `dna2` for the best cosine match — not just the same position:

```
dna1 L0 (Dense, sim vector A)
    │
    ├──► compare vs dna2 L0 → sim=0.72
    ├──► compare vs dna2 L1 → sim=0.31
    └──► compare vs dna2 L2 → sim=0.91  ← best match!

Best match (0.91) is at position L2, not L0.
Since 0.91 > 0.8 threshold AND positions differ:
→ LogicShift { SourcePos:"0,0,0,0", TargetPos:"0,0,0,2", Overlap:0.91 }
```

```go
type LogicShift struct {
    SourcePos string   // "z,y,x,l" in dna1
    TargetPos string   // "z,y,x,l" in dna2
    Overlap   float32  // cosine score > 0.8
}
```

Logic shifts appear when:
- A network was restructured and layers were reordered
- A NEAT mutation moved a functional pattern to a different grid position
- Two networks converged to the same function at different coordinates

---

## Full DNA Pipeline

```
  Network A (trained)                   Network B (trained)
       │                                      │
       ▼                                      ▼
 ExtractDNA(A)                          ExtractDNA(B)
       │                                      │
  for each layer:                        for each layer:
  ┌──────────────────────────────┐       ┌──────────────────────────────┐
  │ Parallel/Sequential:         │       │ Parallel/Sequential:         │
  │   recurse into branches      │       │   recurse into branches      │
  │   concat + Normalize         │       │   concat + Normalize         │
  │ Weighted:                    │       │ Weighted:                    │
  │   scale(w × ws.Scale)        │       │   scale(w × ws.Scale)        │
  │   Normalize(simulated)       │       │   Normalize(simulated)       │
  │ Weightless:                  │       │ Weightless:                  │
  │   {1.0}                      │       │   {1.0}                      │
  └──────────────────────────────┘       └──────────────────────────────┘
       │                                      │
       │  NetworkDNA ([]LayerSignature)        │  NetworkDNA
       └─────────────────┬────────────────────┘
                         │
                         ▼
               CompareNetworks(dnaA, dnaB)
                         │
            ┌────────────┴────────────┐
            │                         │
            ▼                         ▼
    Phase 1: Direct            Phase 2: Cross-pos
    position matching          best-match search
            │                         │
    LayerOverlaps              LogicShifts
    OverallOverlap             (migrations)
            │
            └────────────────────────▶ NetworkComparisonResult
```

---

## Use Cases

### Measuring Quantization Fidelity

```go
dnaFP32 := poly.ExtractDNA(net)
// morph all layers to INT8...
poly.MorphAllLayers(net, poly.DTypeInt8)
dnaINT8 := poly.ExtractDNA(net)
result := poly.CompareNetworks(dnaFP32, dnaINT8)
// result.OverallOverlap near 1.0 → quantization preserved behavior
// result.OverallOverlap near 0.0 → quantization destroyed the model
```

### Detecting Training Convergence

Sample DNA every N epochs. When `OverallOverlap` between consecutive snapshots stabilizes above 0.99, the network has converged.

```
Epoch 0   → Epoch 10  : overlap = 0.12  (learning fast)
Epoch 10  → Epoch 50  : overlap = 0.61  (settling)
Epoch 50  → Epoch 100 : overlap = 0.94  (nearly converged)
Epoch 100 → Epoch 150 : overlap = 0.99  (converged)
```

### Cross-Architecture Similarity

Two networks with different layer counts share coordinates for only the positions they have in common. `CompareNetworks` will match only those overlapping positions, and the `OverallOverlap` is averaged over matched layers only.

### Logic Drift After NEAT Mutations

After a NEAT topology mutation moves a Dense layer from position `0,0,0,0` to `0,0,0,2`, the logic shift detector will report:

```
LogicShift {
    SourcePos: "0,0,0,0",
    TargetPos: "0,0,0,2",
    Overlap:   0.93,
}
```

This is how you track functional identity across structural mutations.

---

## DNA Signature Sizes by Layer Type

| Layer Type | Signature Length | Notes |
|:-----------|:-----------------|:------|
| Dense (32) | 1024 | inputH × outputH |
| MHA (32, 4 heads) | 4224 | Q+K+V+O projections + biases |
| SwiGLU (32) | 6144 | gate + up + down × 3 projections |
| RMSNorm (32) | 32 | scale vector only |
| LayerNorm (32) | 64 | gamma + beta |
| CNN1/2 (8f, 1c, k3) | 72 | filters × channels × k² |
| CNN3 (8f, 1c, k3) | 216 | filters × channels × k³ |
| RNN (32) | 2080 | Wx + Wh + bias |
| LSTM (32) | 8320 | 4 gates × (Wx + Wh + bias) |
| Embedding (256 vocab, 32 dim) | 8192 | vocab × dim |
| KMeans (8 clusters, 32 dim) | 256 | clusters × dim |
| Softmax | 1 | neutral marker |
| Residual | 1 | neutral marker |
| Parallel (2× Dense 32) | 1056 | concat of branches, renormalized |
| Sequential (2× Dense 32) | 2048 | concat of sub-layers, renormalized |

---

## The Evolution Engine: DNA Splice & NEAT Topology Evolution

Source: https://openfluke.com/docs/evolution
Markdown: https://openfluke.com/docs/evolution.md

# The Evolution Engine: DNA Splice & NEAT Topology Evolution

This document covers `SpliceDNA`, `SpliceDNAWithReport`, `NEATMutate`, and `NEATPopulation` from `evolution.go`. The evolution engine builds on the DNA fingerprinting system described in [dna.md](dna.md).

---

## Two Evolutionary Mechanisms

```
  ┌─────────────────────────────────────────────────────────────┐
  │                    Evolution Engine                         │
  │                                                             │
  │   ┌────────────────────┐    ┌──────────────────────────┐   │
  │   │   DNA Splice       │    │   NEAT-style Mutation    │   │
  │   │  (Crossover)       │    │  (Topology Evolution)    │   │
  │   │                    │    │                          │   │
  │   │  ParentA + ParentB │    │  Network ──► mutated     │   │
  │   │      ──►  Child    │    │             clone        │   │
  │   │                    │    │                          │   │
  │   │  merges weights    │    │  changes layer types,    │   │
  │   │  guided by DNA     │    │  activations, topology   │   │
  │   │  similarity        │    │  weights                 │   │
  │   └────────────────────┘    └──────────────────────────┘   │
  │              │                          │                   │
  │              └──────────┬───────────────┘                   │
  │                         ▼                                   │
  │              NEATPopulation.Evolve()                        │
  │         (combines both in a generation loop)                │
  └─────────────────────────────────────────────────────────────┘
```

---

## Part 1 — DNA Splice / Genetic Crossover

### Concept

Given two trained parent networks `A` and `B`, produce a child network whose weights are a blend of both. The blend is **guided by DNA similarity** — layers that are more similar between parents get blended more aggressively; layers that diverged get a heavier bias toward the fitter parent.

```
  ParentA (trained)        ParentB (trained)
       │                        │
  ExtractDNA(A)            ExtractDNA(B)
       │                        │
  sigA per layer          sigB per layer
       │                        │
       └────────┬───────────────┘
                │
       for each layer position (z,y,x,l):
                │
         CosineSimilarity(sigA, sigB)
                │
         ┌──────┴──────┐
         │             │
      blend         skip
    weights       (keep A's
    from A+B       weights)
         │
         ▼
     Child network
```

### SpliceConfig

```go
type SpliceConfig struct {
    CrossoverMode string   // "blend", "point", or "uniform"
    BlendAlpha    float32  // interpolation factor (blend mode): 0=all A, 1=all B
    SplitRatio    float64  // fraction from A in point mode (e.g. 0.5)
    FitnessA      float64  // optional: used to bias toward fitter parent
    FitnessB      float64
}

func DefaultSpliceConfig() SpliceConfig {
    return SpliceConfig{CrossoverMode: "blend", BlendAlpha: 0.5, SplitRatio: 0.5}
}
```

### Three Crossover Modes

#### Mode: "blend" (default)

Interpolates weights per element. Alpha is modulated by the layer's cosine similarity and relative fitness:

```
alpha = FitnessB / (FitnessA + FitnessB)   ← bias toward fitter parent
alpha = alpha × (0.5 + 0.5 × similarity)   ← scale by how similar layers are

child[i] = wA[i] × (1 - alpha) + wB[i] × alpha
```

When similarity is high (layers learned the same thing), alpha blends freely. When similarity is low (layers diverged), alpha is pulled toward the fitter parent.

```
similarity = 1.0  ──►  free blend (both parents contribute equally)
similarity = 0.0  ──►  take mostly from fitter parent (layers are unrelated)
similarity = -1.0 ──►  heavily bias toward fitter parent (opposite patterns)
```

#### Mode: "point"

Splits weights at a single cut point. First `SplitRatio` fraction from A, rest from B:

```
wA: [a0 a1 a2 a3 a4 a5 a6 a7]
wB: [b0 b1 b2 b3 b4 b5 b6 b7]
                │
           SplitRatio=0.5
                │
child: [a0 a1 a2 a3 b4 b5 b6 b7]
        ─── from A ──── from B ──
```

#### Mode: "uniform"

Each weight is randomly drawn from A or B, with probability biased toward the fitter parent:

```
threshold = FitnessA / (FitnessA + FitnessB)

for each weight i:
    if rand < threshold → child[i] = wA[i]
    else               → child[i] = wB[i]
```

### SpliceDNA

```go
func SpliceDNA(parentA, parentB *VolumetricNetwork, cfg SpliceConfig) *VolumetricNetwork
```

- The child is always a **deep clone of parentA** (architecture inherited from A)
- Only layers where both parents have matching positions **and matching weight dimensions** are blended
- If `parentB` has no layer at that position, or the weight counts differ, A's weights are kept unchanged

```go
// Guard: skip if dimensions don't match
if wB == nil || len(wB) != len(wA) {
    continue // keep A's weights
}
```

### SpliceDNAWithReport

```go
func SpliceDNAWithReport(parentA, parentB *VolumetricNetwork, cfg SpliceConfig) SpliceResult

type SpliceResult struct {
    Child        *VolumetricNetwork
    ParentADNA   NetworkDNA
    ParentBDNA   NetworkDNA
    ChildDNA     NetworkDNA
    Similarities map[string]float32  // "z,y,x,l" → cosine score used for blending
    BlendedCount int                  // how many layers were actually blended
}
```

Returns the same child as `SpliceDNA` plus a full diagnostic report. Use this when debugging crossover behavior or logging ancestry.

---

## Part 2 — NEAT-style Topology Evolution

### Concept

NEAT (NeuroEvolution of Augmenting Topologies) mutates both weights and structure. The implementation here applies six mutation types to a cloned network, leaving the original untouched.

```
  Original Network (immutable)
       │
  cloneNetwork()
       │
  mutated clone
       │
  ┌────┴────────────────────────────────────────────┐
  │  Per-layer mutations (applied sequentially):    │
  │                                                 │
  │  1. Weight perturbation  ── add Gaussian noise  │
  │  2. Activation mutation  ── swap act function   │
  │  3. Node mutation        ── change layer type   │
  │  4. Layer toggle         ── enable/disable      │
  │                                                 │
  │  Network-level mutations (applied once):        │
  │                                                 │
  │  5. Connection add  ── insert remote link       │
  │  6. Connection drop ── remove remote link       │
  └─────────────────────────────────────────────────┘
       │
  returns mutated clone
```

### NEATConfig

```go
type NEATConfig struct {
    WeightPerturbRate  float64  // prob of perturbing a layer's weights (default 0.8)
    WeightPerturbScale float32  // noise magnitude (default 0.05)
    NodeMutateRate     float64  // prob of changing a layer's type (default 0.1)
    ConnectionAddRate  float64  // prob of adding a remote link (default 0.05)
    ConnectionDropRate float64  // prob of removing a remote link (default 0.02)
    ActivationMutRate  float64  // prob of changing activation function (default 0.1)
    LayerToggleRate    float64  // prob of toggling IsDisabled (default 0.02)
    DModel             int      // reference dimension for weight reinitialization
    AllowedLayerTypes  []LayerType // types a node can mutate to
    // Type-specific defaults used by neatReinitLayer:
    DefaultNumHeads    int
    DefaultInChannels  int
    DefaultFilters     int
    DefaultKernelSize  int
    DefaultVocabSize   int
    DefaultNumClusters int
    Seed               int64
}
```

`DefaultNEATConfig(dModel)` returns conservative rates with all 17 mutable layer types in `AllowedLayerTypes`.

### NEATMutate

```go
func NEATMutate(n *VolumetricNetwork, cfg NEATConfig) *VolumetricNetwork
```

The original network `n` is **never modified**. The function clones it and applies mutations:

```
For each layer i:

  Step 1 — Weight Perturbation (WeightPerturbRate = 0.8)
  ┌─────────────────────────────────────────────────────┐
  │ master[i] += rand(-1, 1) × WeightPerturbScale       │
  │ (clears cached DType versions as weights changed)   │
  └─────────────────────────────────────────────────────┘

  Step 2 — Activation Mutation (ActivationMutRate = 0.1)
  ┌──────────────────────────────────────────────────────┐
  │ layer.Activation = random from {ReLU, SiLU, GELU,   │
  │                                 Tanh, Sigmoid, Linear}│
  └──────────────────────────────────────────────────────┘

  Step 3 — Node Mutation (NodeMutateRate = 0.1)
  ┌──────────────────────────────────────────────────────┐
  │ newType = random from AllowedLayerTypes (≠ current)  │
  │ neatReinitLayer(child, i, newType, cfg)              │
  │   → sets new Type, InputHeight, OutputHeight         │
  │   → creates fresh WeightStore with correct wCount    │
  └──────────────────────────────────────────────────────┘

  Step 4 — Layer Toggle (LayerToggleRate = 0.02)
  ┌──────────────────────────────────────────────────────┐
  │ layer.IsDisabled = !layer.IsDisabled                 │
  │ (disabled layers are skipped during forward pass)    │
  └──────────────────────────────────────────────────────┘

After all layers:

  Step 5 — Connection Add (ConnectionAddRate = 0.05)
  ┌──────────────────────────────────────────────────────┐
  │ Pick two random layers src and dst (src ≠ dst)       │
  │ Append IsRemoteLink branch to src.ParallelBranches   │
  │   TargetZ/Y/X/L point to dst                        │
  │ Creates a spatial "skip connection" in the 3D grid   │
  └──────────────────────────────────────────────────────┘

  Step 6 — Connection Drop (ConnectionDropRate = 0.02)
  ┌──────────────────────────────────────────────────────┐
  │ Find a layer with ParallelBranches containing        │
  │ IsRemoteLink entries                                 │
  │ Remove one at random                                 │
  └──────────────────────────────────────────────────────┘
```

### Node Mutation: Weight Counts for All 19 Layer Types

When `neatReinitLayer` changes a layer's type, it creates a fresh `WeightStore` with the correct number of weights for the new type:

| New Layer Type | Formula | Example (dModel=32) |
|:---------------|:--------|:--------------------|
| Dense | `dModel × dModel` | 1024 |
| RNN | `dModel² + dModel² + dModel` | 2080 |
| LSTM | `4 × (dModel² + dModel² + dModel)` | 8320 |
| SwiGLU | `dModel × (dModel×2) × 3` | 6144 |
| RMSNorm | `dModel` | 32 |
| LayerNorm | `dModel × 2` | 64 |
| MHA | `2×dModel² + 2×dModel×kv + 2×dModel + 2×kv` | 4224 (4 heads) |
| CNN1 / CNN2 | `filters × inChannels × kSize²` | 72 (8f, 1c, k3) |
| CNN3 | `filters × inChannels × kSize³` | 216 (8f, 1c, k3) |
| ConvTransposed1D/2D | `inChannels × filters × kSize²` | 72 |
| ConvTransposed3D | `inChannels × filters × kSize³` | 216 |
| Embedding | `vocabSize × dModel` | 8192 (256 vocab) |
| KMeans | `numClusters × dModel` | 256 (8 clusters) |
| Softmax | `0` — no WeightStore | — |
| Residual | `0` — no WeightStore | — |
| Parallel / Sequential | unchanged — keep existing branches | — |

Parallel and Sequential are structural containers. Mutating a non-container to Parallel/Sequential would destroy branch structure, so `neatReinitLayer` leaves them untouched (just returns) when the target type is Parallel or Sequential.

### Connection Add — Remote Links

`neatAddConnection` adds a **spatial skip connection** between two layers anywhere in the 3D grid:

```
Layer at (0,0,0,0) ──────────────────────────► Layer at (0,0,0,2)
                                                      │
                    ┌─ ParallelBranches ──────────────┘
                    │   [IsRemoteLink=true,
                    │    TargetZ=0, TargetY=0,
                    │    TargetX=0, TargetL=2]
```

During `ForwardPolymorphic`, `ParallelForwardPolymorphic` follows remote links and routes activations to the target layer. Remote links are skipped during DNA extraction (`extractLayerSignature` skips `IsRemoteLink=true` branches since they have no local weights).

---

## Part 3 — NEATPopulation: Full Evolutionary Loop

`NEATPopulation` manages a pool of networks across generations using fitness-based selection.

```go
type NEATPopulation struct {
    Networks  []*VolumetricNetwork
    Fitnesses []float64
    Config    NEATConfig
    rng       *rand.Rand
}
```

### Initialization

```go
pop := poly.NewNEATPopulation(seedNetwork, populationSize, cfg)
```

Creates `populationSize` networks, each a `NEATMutate` of the seed. This gives diverse starting points from day 0.

```
seedNetwork
    │
    ├── NEATMutate (seed1) ──► Network[0]
    ├── NEATMutate (seed2) ──► Network[1]
    ├── NEATMutate (seed3) ──► Network[2]
    └── ...                    Network[N-1]
```

### One Generation of Evolution

```go
pop.Evolve(fitnessFn)
```

```
  Generation N:  [net0, net1, net2, ..., netN]
                      │
              fitnessFn(net) for each
                      │
              sort descending by fitness
                      │
          ┌───────────┴───────────┐
          │                       │
     Top 25%                 Bottom 75%
     (elites)                (replaced)
          │                       │
     carry over              pick 2 elites A, B
     unchanged               SpliceDNA(A, B, blend)
                                  │
                             NEATMutate(child)
                                  │
                             new offspring
          │                       │
          └───────────┬───────────┘
                      │
              Generation N+1
```

**Elites**: The top `populationSize / 4` networks survive unchanged. The rest are replaced by:
1. Pick two random elites `A` and `B`
2. Produce a child via `SpliceDNA(A, B, cfg)` — inherits weights from both
3. Apply `NEATMutate(child)` — adds structural noise

### Helper Methods

```go
pop.Best()           // returns the highest-fitness network (index 0 after sort)
pop.BestFitness()    // returns the best fitness score
pop.Summary(gen)     // returns a one-line status string:
                     // "Gen 5 | best=-0.0012  avg=-0.0045  worst=-0.2300  pop=16"
```

### Fitness Function Contract

The fitness function receives a network and returns `float64` — higher is better. Penalize with a large negative (e.g., `-1e9`) for architecturally incompatible networks (dimension mismatches from mutations):

```go
fitnessFn := func(net *poly.VolumetricNetwork) (result float64) {
    defer func() {
        if r := recover(); r != nil {
            result = -1e9 // incompatible architecture
        }
    }()
    out, _, _ := poly.ForwardPolymorphic[float32](net, input)
    if out == nil || len(out.Data) == 0 {
        return -1e9
    }
    // compute your task loss here
    mse := computeMSE(out.Data, target)
    return -mse   // negate: lower loss = higher fitness
}
```

---

## Combined Flow: SpliceDNA + NEAT in a Population

```
                 ┌──────────────────────────────────────────┐
                 │           NEATPopulation.Evolve           │
                 │                                          │
  Generation N:  │  [A] [B] [C] [D]  ... [P]               │
                 │   │                                      │
                 │   fitnessFn() for all                    │
                 │   sort: A=best, P=worst                  │
                 │                                          │
                 │  Elites (keep): [A] [B] [C] [D]         │
                 │                                          │
                 │  Offspring:                              │
                 │                                          │
                 │   SpliceDNA(A, B)  ──► child_AB          │
                 │   NEATMutate(child_AB)                   │
                 │        ├── perturb weights               │
                 │        ├── maybe swap activation         │
                 │        ├── maybe change layer type       │
                 │        └── maybe add/drop connection     │
                 │            ──► mutated_AB                │
                 │                                          │
                 │   ... repeat for all offspring slots ... │
                 │                                          │
  Generation N+1:│  [A] [B] [C] [D] [mut_AB] ... [mut_XY] │
                 └──────────────────────────────────────────┘
```

---

## DNA Tracking Across Generations

Because every `NEATMutate` and `SpliceDNA` call touches only a clone, you can always extract DNA from any network in the population and compare it against a reference:

```go
// Track how far the best network has drifted from the initial seed
seedDNA := poly.ExtractDNA(seedNetwork)
for gen := 1; gen <= 50; gen++ {
    pop.Evolve(fitnessFn)
    bestDNA := poly.ExtractDNA(pop.Best())
    result := poly.CompareNetworks(seedDNA, bestDNA)
    fmt.Printf("Gen %d | seed→best overlap=%.4f  logic_shifts=%d\n",
        gen, result.OverallOverlap, len(result.LogicShifts))
}
```

Expected pattern:
```
Gen  1 | overlap=0.98  logic_shifts=0   (small weight nudges)
Gen  5 | overlap=0.73  logic_shifts=1   (one node mutated type)
Gen 20 | overlap=0.41  logic_shifts=3   (topology diverging)
Gen 50 | overlap=0.12  logic_shifts=7   (heavily evolved)
```

---

## Multi-Parent Splice Chain

You can chain splices to merge three or more trained networks:

```go
cfgA := poly.DefaultSpliceConfig()
cfgA.FitnessA, cfgA.FitnessB = fitnessA, fitnessB

cfgB := poly.DefaultSpliceConfig()
cfgB.FitnessA, cfgB.FitnessB = fitnessMid, fitnessC

mid   := poly.SpliceDNA(netA, netB, cfgA)    // A + B → mid
final := poly.SpliceDNA(mid, netC, cfgB)     // mid + C → final
```

```
netA ──┐
        ├── SpliceDNA ──► mid ──┐
netB ──┘                        ├── SpliceDNA ──► final
                          netC ──┘
```

---

## Immutability Guarantee

Both `SpliceDNA` and `NEATMutate` always operate on **clones** of the input networks. The originals are never modified:

```go
// Verify: run 5 aggressive mutations, original unchanged
original := buildDenseMLP(32, 3)
dnaOrig  := poly.ExtractDNA(original)

aggressiveCfg := poly.NEATConfig{
    NodeMutateRate: 1.0, WeightPerturbRate: 1.0,
    WeightPerturbScale: 10.0, DModel: 32, Seed: 42,
    AllowedLayerTypes: poly.DefaultNEATConfig(32).AllowedLayerTypes,
}
for i := 0; i < 5; i++ {
    _ = poly.NEATMutate(original, aggressiveCfg)
}

dnaAfter := poly.ExtractDNA(original)
result   := poly.CompareNetworks(dnaOrig, dnaAfter)
// result.OverallOverlap == 1.0 — original untouched
```

---

## Quick Reference

| Function | What it does |
|:---------|:-------------|
| `SpliceDNA(A, B, cfg)` | Blend weights from A and B into a child (A's architecture) |
| `SpliceDNAWithReport(A, B, cfg)` | Same + diagnostic report with per-layer similarities |
| `DefaultSpliceConfig()` | Returns blend mode, alpha=0.5, split=0.5 |
| `NEATMutate(n, cfg)` | Returns a structurally mutated clone of n |
| `DefaultNEATConfig(dModel)` | Conservative rates, all 17 mutable types allowed |
| `NewNEATPopulation(seed, size, cfg)` | Create diverse initial population from seed |
| `pop.Evolve(fitnessFn)` | Run one generation: evaluate → sort → elites → offspring |
| `pop.Best()` | Highest-fitness network from last Evolve |
| `pop.BestFitness()` | Fitness score of the top network |
| `pop.Summary(gen)` | One-line status: best/avg/worst fitness |

---

## Softmax Variants

Source: https://openfluke.com/docs/softmax
Markdown: https://openfluke.com/docs/softmax.md

# Softmax Variants

`LayerSoftmax` (type 15) implements ten distinct softmax variants, controlled by the `SoftmaxType` field on `VolumetricLayer`. All variants are fully differentiable and work across all 21 DTypes.

---

## The Standard Formula

All variants start from the numerically stable form:

```
logits_shifted = logits - max(logits)   ← prevents overflow
exp_vals[i]    = exp(logits_shifted[i])
probs[i]       = exp_vals[i] / sum(exp_vals)
```

This is implemented in `Softmax(logits []float32) []float32`.

---

## SoftmaxType Constants

```go
const (
    SoftmaxStandard     SoftmaxType = 0
    SoftmaxGrid         SoftmaxType = 1
    SoftmaxHierarchical SoftmaxType = 2
    SoftmaxTemperature  SoftmaxType = 3
    SoftmaxGumbel       SoftmaxType = 4
    SoftmaxMasked       SoftmaxType = 5
    SoftmaxSparse       SoftmaxType = 6
    SoftmaxAdaptive     SoftmaxType = 7
    SoftmaxMixture      SoftmaxType = 8
    SoftmaxEntmax       SoftmaxType = 9
)
```

---

## Variant 0: Standard

```
probs = softmax(logits)
```

The classic form. All outputs are positive and sum to 1. Smooth gradient everywhere.

**When to use:** Classification heads, final output layers, any time you need a valid probability distribution.

```
Input:  [2.0, 1.0, 0.1]
           ▼
Shifted: [1.9, 0.9, 0.0]
           ▼
Exps:    [6.69, 2.46, 1.00]
Sum = 10.15
           ▼
Output:  [0.66, 0.24, 0.10]  ← sums to 1.0
```

---

## Variant 3: Temperature

```
probs = softmax(logits / temperature)
```

Temperature `T` (stored in `VolumetricLayer.Temperature`) controls sharpness.

```
┌──────────────────────────────────────────────────────────────┐
│  temperature = 0.1 (sharp):                                  │
│    Input: [2.0, 1.8, 0.1] → Output: ≈[0.99, 0.01, 0.00]    │
│    Effect: "confident" — almost winner-takes-all              │
│                                                              │
│  temperature = 1.0 (standard):                               │
│    Input: [2.0, 1.8, 0.1] → Output: ≈[0.55, 0.45, 0.00]    │
│                                                              │
│  temperature = 5.0 (smooth):                                 │
│    Input: [2.0, 1.8, 0.1] → Output: ≈[0.40, 0.38, 0.22]    │
│    Effect: "uncertain" — options spread more evenly          │
└──────────────────────────────────────────────────────────────┘
```

**When to use:** Token sampling in language models (low T = greedy, high T = diverse), exploration vs. exploitation in RL.

---

## Variant 4: Gumbel

```
noise[i] = -log(-log(Uniform(0,1)))   ← Gumbel noise
probs = softmax(logits + noise)
```

Adds independent Gumbel noise to each logit before computing softmax. This produces stochastic samples that are biased toward higher logits but not deterministic. The Gumbel distribution is the natural noise for the `argmax` operation.

**When to use:** Discrete sampling without the `argmax` non-differentiability. Training generative models with categorical outputs. Controlled exploration in MoE routing.

```
Same logits, three calls:
  Call 1: [0.71, 0.24, 0.05]  ← high logit usually wins
  Call 2: [0.48, 0.40, 0.12]  ← noise sometimes shifts result
  Call 3: [0.82, 0.14, 0.04]
```

---

## Variant 5: Masked

```
masked_logits[i] = logits[i]  if mask[i] == true
                 = -1e9        if mask[i] == false
probs = softmax(masked_logits)
```

The `mask` field is `[]bool` on `VolumetricLayer`. Positions where `mask[i] = false` get `-1e9` in the logit, making their `exp` output effectively zero. After softmax, those positions have probability 0.

The backward pass respects the mask: gradients are zeroed for masked positions.

**When to use:**
- Causal attention (prevent attending to future tokens)
- Legal-move filtering (board games, planning)
- Expert routing where some experts are unavailable

```
Logits: [2.0, 1.0, 0.5, 1.5]
Mask:   [T,   F,   T,   T  ]

After masking: [2.0, -1e9, 0.5, 1.5]
After softmax: [0.63, 0.00, 0.11, 0.26]
               masked position → 0 ✓
```

---

## Variant 6: Sparse (Sparsemax)

Sparsemax is an alternative to softmax that can produce **exact zeros** — true sparsity rather than just very small values.

```
Algorithm:
1. Sort logits descending: z₁ ≥ z₂ ≥ ... ≥ zₙ
2. Find k = max { k : z_k - (Σᵢ≤ₖ zᵢ - 1)/k > 0 }
3. τ = (Σᵢ≤ₖ zᵢ - 1) / k
4. output[i] = max(0, z[i] - τ)
```

Implemented in `SoftmaxSparseHelper(logits)`.

```
Logits: [3.0, 1.0, -1.0, -3.0]
Standard softmax: [0.87, 0.12, 0.01, 0.00]   ← all non-zero
Sparsemax:        [0.75, 0.25, 0.00, 0.00]   ← exact zeros!
```

**When to use:**
- Attention when you want the model to focus on exactly a few tokens
- Interpretability (fewer non-zero attention weights to explain)
- MoE routing (hard assignment to a subset of experts)

---

## Variant 9: Entmax

Entmax is a family of distributions parameterized by `alpha`. It interpolates between softmax and sparsemax:

- `alpha = 1.0` → standard softmax
- `alpha = 2.0` → sparsemax
- `alpha = 1.5` → the recommended default (used in original paper)

```go
layer.EntmaxAlpha = 1.5   // set on VolumetricLayer
```

Implemented in `SoftmaxEntmaxHelper(logits, alpha)`:

```go
weight := alpha - 1.0
s1 := Softmax(logits)
s2 := SoftmaxSparseHelper(logits)
result[i] = (1-weight)*s1[i] + weight*s2[i]
// renormalize to sum to 1
```

**When to use:** When you want controllable sparsity. Start with `alpha=1.5` and tune toward 2.0 for sparser attention.

---

## Variant 1: Grid

Grid softmax applies standard softmax independently to each **row** of a 2D interpretation of the input:

```
Input flat tensor reinterpreted as [SoftmaxRows, SoftmaxCols]:
  Row 0: softmax([logits[0:cols]])    → row probs sum to 1
  Row 1: softmax([logits[cols:2cols]]) → row probs sum to 1
  ...
```

Each row is an independent probability distribution.

**When to use:**
- Native Mixture of Experts: each row represents one expert's output distribution
- Multi-label classification where each "group" of labels is mutually exclusive
- Per-head attention normalization without the full MHA overhead

```
Input (flat): [2.0, 1.0, | 0.5, 3.0, | 1.5, 1.5]
Rows=3, Cols=2:

  Row 0: softmax([2.0, 1.0]) = [0.73, 0.27]
  Row 1: softmax([0.5, 3.0]) = [0.08, 0.92]
  Row 2: softmax([1.5, 1.5]) = [0.50, 0.50]
```

---

## Variant 2: Hierarchical

Hierarchical softmax uses `HierarchyLevels []int` to define a tree structure. The last level of `HierarchyLevels` is used as the column count, with rows computed from `n / cols`. In practice it reduces to Grid softmax with the last level defining the partition.

**When to use:** Large vocabulary prediction where the vocabulary has a natural hierarchical structure (e.g., word categories → words).

---

## Variant 7: Adaptive

Adaptive softmax selects the softmax type based on input statistics (currently implemented as a fallback to standard softmax, intended for future dynamic routing logic).

---

## Variant 8: Mixture

Mixture softmax is a placeholder for weighted combinations of multiple softmax outputs. Currently falls back to standard softmax.

---

## Backward Pass

All variants share the standard softmax Jacobian:

```
gradLogits[j] = probs[j] × (gradOutput[j] - Σᵢ gradOutput[i] × probs[i])
             = probs[j] × (gradOutput[j] - dotProduct)
```

Implemented in `SoftmaxBackward(gradOutput, softmaxOutput []float32)`.

For Grid and Hierarchical variants, the Jacobian is applied independently to each row. For Masked, gradients are zeroed at masked positions before computing the Jacobian.

---

## GetLogits

`GetLogits[T Numeric](data []T, temp float64, dtype DType)` converts any `Tensor[T]` to `[]float32` with temperature scaling. It has specialized fast-paths for the most common types (float32, float64, int8, etc.) to avoid generic conversion overhead.

---

## Summary Table

| Variant | Produces zeros | Stochastic | Key parameter | Best for |
|:--------|:--------------|:-----------|:--------------|:---------|
| Standard | No | No | — | General classification |
| Temperature | No | No | `Temperature` | Sampling sharpness |
| Gumbel | No | Yes | — | Differentiable sampling |
| Masked | Yes (at mask) | No | `Mask []bool` | Causal attention |
| Sparse | Yes | No | — | Hard sparse attention |
| Entmax | Maybe | No | `EntmaxAlpha` | Tunable sparsity |
| Grid | No | No | `SoftmaxRows/Cols` | MoE, multi-group |
| Hierarchical | No | No | `HierarchyLevels` | Tree vocabularies |
| Adaptive | No | No | — | (future) |
| Mixture | No | No | — | (future) |

---

## Serialization, Persistence, and Loading

Source: https://openfluke.com/docs/serialization
Markdown: https://openfluke.com/docs/serialization.md

# Serialization, Persistence, and Loading

This document covers how `VolumetricNetwork` instances are saved and loaded, the bit-packed persistence format for low-bit types, the idempotency guarantee, and SafeTensors support.

---

## Two Serialization Paths

`poly/` provides two complementary serialization systems:

| File | Functions | Use case |
|:-----|:---------|:---------|
| `serialization.go` | `BuildNetworkFromJSON` | Architecture-only: creates a network from a spec with randomly initialized weights |
| `persistence.go` | `SerializeNetwork` / `DeserializeNetwork` | Full save/load: architecture + trained weights |

---

## Full Save/Load (persistence.go)

### Saving

```go
jsonData, err := poly.SerializeNetwork(network)
os.WriteFile("model.json", jsonData, 0644)
```

`SerializeNetwork` walks every layer and builds a `PersistenceNetworkSpec`:

```go
type PersistenceNetworkSpec struct {
    ID            string                   `json:"id"`
    Depth         int                      `json:"depth"`
    Rows          int                      `json:"rows"`
    Cols          int                      `json:"cols"`
    LayersPerCell int                      `json:"layers_per_cell"`
    Layers        []PersistenceLayerSpec   `json:"layers"`
}
```

Each `PersistenceLayerSpec` contains all configuration fields plus:

```go
DType   string   `json:"dtype"`              // active numerical type for this layer (e.g. "Uint8", "FP4")
Weights string   `json:"weights,omitempty"`  // Base64-encoded **native-packed** payload for that dtype
Native  bool     `json:"native,omitempty"`   // true = weights are native-packed (current default on save)
Scale   float32  `json:"scale,omitempty"`    // morph/quant scale used when the checkpoint was written
```

### Native JSON per dtype (not FP32-only)

`SerializeNetwork` no longer dumps a single FP32 master blob for every layer. On save it:

1. Reads each layer’s live `DType` and writes it to `PersistenceLayerSpec.DType`.
2. Calls `WeightStore.Morph(dt)` for that dtype and `encodeNativeWeights(active, dt)` — Int8 as 1 byte/weight, FP4/Int4 as nibbles, Binary as bit-packs, Float64 as LE uint64, etc.
3. Sets `Native: true` and persists `Scale` so reload uses the same quant mapping training saw.

**Implication:** a **Uint8** Dense checkpoint is ~**0.8 KB** on disk for the Lucy 8×1024→512 bench; **Float64** is ~**5.4 MB** for the same topology — see the **File** column in Lucy’s training matrix (`lucy/lucy_testing_output/log.txt`). You can train, save, and reload **each of the 21 dtypes** independently; Lucy’s Dense suite reports **Save/Reload PASS** on all of them in the latest full run.

Older checkpoints with `Native: false` (FP32 master only) still load via `decodeWeights`; new saves prefer native packing.

### Loading

```go
jsonData, _ := os.ReadFile("model.json")
network, err := poly.DeserializeNetwork(jsonData)
```

`DeserializeNetwork` reconstructs the `VolumetricNetwork`, initializes fresh `WeightStore`s, then calls `applyPersistenceLayerSpec` for each layer which:

1. Parses all config fields
2. Calls `initializeWeights(l)` to allocate the correct `WeightStore` size
3. Decodes the `Weights` string — using `decodeNativeWeights` if `Native=true`, or `decodeWeights` (FP32 master) if `Native=false`
4. If native format (`Native=true`): stores in `Versions[dtype]`, then calls `Unpack(dtype)` to reconstruct the FP32 master for training paths that still use master weights
5. Recursively applies the same process to `ParallelBranches` and `SequentialLayers`

---

## The Bit-Packing System

The core serialization innovation is `encodeNativeWeights(data any, dt DType) string`.

This function takes the `active` version from the `WeightStore.Versions` map and packs it into the most compact binary representation before Base64 encoding:

```
DType          Packing                    Ratio vs FP32
──────────────────────────────────────────────────────
Float64        8 bytes/weight (LE uint64)   0.5x size reduction
Float32        4 bytes/weight (LE uint32)   1x (baseline)
Float16        4 bytes (stored as float32)  not yet compact
BFloat16       4 bytes (stored as float32)  not yet compact
Int8/Uint8     1 byte/weight                4x reduction
Int4/FP4/Uint4 0.5 bytes (2 per byte)      8x reduction
Int2/Uint2     0.25 bytes (4 per byte)     16x reduction
Ternary        0.25 bytes (4 per byte)     16x reduction
Binary         0.125 bytes (8 per byte)    32x reduction
```

### 4-bit Packing Detail

```go
// Pack 2 int8 weights into 1 byte using upper and lower nibbles:
buf[i/2] |= (byte(v & 0x0F) << 4)  // high nibble for even index
buf[i/2] |= (byte(v & 0x0F))       // low nibble for odd index
```

Unpacking sign-extends the nibble: if the 4-bit value is > 7, subtract 16 to recover the signed value.

### 2-bit/Ternary Packing Detail

```go
// Pack 4 values into 1 byte using 2-bit fields:
shift := uint(6 - (i%4)*2)   // 6, 4, 2, 0
buf[i/4] |= (val & 0x03) << shift
```

Unpacking reverses the shift and sign-extends from 2-bit.

### Binary Packing Detail

```go
// Pack 8 weights into 1 byte, MSB first:
if v > 0 { buf[i/8] |= (1 << uint(7-(i%8))) }
```

Unpacking reads each bit and maps `1 → +1`, `0 → -1`.

---

## Idempotency Guarantee

The README states: "Serializing a reloaded model produces a byte-for-byte identical JSON to the original."

This holds because:

1. `DeserializeNetwork` calls `Unpack(dtype)` which reconstructs `Master` from the packed data
2. The next `SerializeNetwork` call reads `Master`, calls `Morph(dtype)` again (if needed), and re-packs
3. Since `Morph` is deterministic (same formula, same scale), and the `Master` was faithfully reconstructed by `Unpack`, the output bytes are identical

Verified across 378 permutations (18 layer types × 21 DTypes) with **0.000000% mathematical divergence**.

---

## Architecture-Only JSON (serialization.go)

`BuildNetworkFromJSON` creates a network from a spec but uses **random weight initialization** (via `initializeWeights` which calls `Randomize`). This is for defining network topologies without weights.

```go
type LayerSpec struct {
    Z, Y, X, L    int
    Type           string   // "Dense", "CNN2", etc.
    Activation     string   // "ReLU", "Tanh", etc.
    DType          string   // "float32", "int8", etc.
    InputHeight    int
    OutputHeight   int
    // ... all configuration fields
    ParallelBranches []LayerSpec   // recursive
    SequentialLayers []LayerSpec   // recursive
}
```

`ParseLayerType`, `ParseActivationType`, and `ParseDType` accept case-insensitive strings plus common aliases.

---

## SafeTensors Support

`safetensors.go` and `prefix_safetensor.go` implement loading from the HuggingFace SafeTensors format, enabling direct weight import from PyTorch/HuggingFace checkpoints.

`universal_loader.go` provides auto-detection of the model format.

The `Transformer[T]` type has dedicated loading support in `transformer.go` for assembling a full LLM from SafeTensors files: it maps weight tensor names (e.g., `"model.layers.0.self_attn.q_proj.weight"`) to the correct `VolumetricLayer` positions and weight sub-slices.

---

## Compression Ratios in Practice

From the README, for a network with 1M weights:

```
┌──────────────────────────────────────────────────────────────┐
│  DType     RAM (uncompressed)   JSON size   Ratio            │
├──────────────────────────────────────────────────────────────┤
│  Float32   4.0 MB               ~5.5 MB     1.38x (base64)  │
│  Int8      1.0 MB               ~1.4 MB     0.34x vs FP32   │
│  Int4      0.5 MB               ~0.7 MB     0.17x           │
│  Binary    0.125 MB             ~0.18 MB    0.045x ← 98.4%  │
└──────────────────────────────────────────────────────────────┘
```

Base64 encoding adds ~33% overhead over the raw binary size. The 98.4% figure is relative to FP32 on disk (including the base64 overhead).

---

## Weight Encoding Flow

```
Training produces Master []float32
                │
                ▼ (if layer.DType != DTypeFloat32)
         Morph(layer.DType)
                │
                ▼
         Versions[dtype] = []int8 / []int4 / etc.
                │
                ▼
    encodeNativeWeights(active, dtype)
                │
         ┌──────┴──────┐
         │             │
         ▼             ▼
    bit-packing    Base64 encode
         │             │
         └──────┬──────┘
                │
                ▼
    PersistenceLayerSpec.Weights = "base64string..."
    PersistenceLayerSpec.Native  = true
    PersistenceLayerSpec.Scale   = ws.Scale
```

---

## Deserialization and Unpack Flow

```
JSON string
    │
    ▼ json.Unmarshal
PersistenceNetworkSpec
    │
    ▼ applyPersistenceLayerSpec
For each layer:
    1. ParseLayerType / ParseActivationType / ParseDType
    2. initializeWeights → fresh WeightStore allocated
    3. if ls.Native:
          decodeNativeWeights → Versions[dtype] = packed slices
          ws.Unpack(dtype) → Master reconstructed
       else:
          decodeWeights → Master loaded directly
    4. Recurse for ParallelBranches, SequentialLayers
```

After `DeserializeNetwork`, every layer's `WeightStore.Master` is a valid FP32 weight array ready for forward inference or further training.

---

## Parallel and Sequential Layers

Source: https://openfluke.com/docs/parallel-sequential
Markdown: https://openfluke.com/docs/parallel-sequential.md

# Parallel and Sequential Layers

This document explains `LayerParallel` and `LayerSequential` in depth: how they fan out and chain sub-layers, the five combination modes, the recursive activation tree, and how backpropagation flows through nested structures.

---

## LayerParallel

`ParallelForwardPolymorphic` fans the input to every branch simultaneously and then combines the results.

### Configuration

```go
layer.Type        = poly.LayerParallel
layer.CombineMode = "concat"   // or "add", "avg", "filter", "grid_scatter"
layer.ParallelBranches = []poly.VolumetricLayer{
    {Type: poly.LayerDense, InputHeight: 64, OutputHeight: 32, ...},
    {Type: poly.LayerRNN,   InputHeight: 64, OutputHeight: 32, ...},
    {Type: poly.LayerCNN1,  InputHeight: 64, ...},
}
```

Each entry in `ParallelBranches` is a full `VolumetricLayer` — it can itself be a `LayerParallel` or `LayerSequential`, enabling unlimited nesting.

### Combination Modes

#### "add"

Element-wise sum of all branch outputs. All branches must produce the same output shape.

```
Input ──▶ Branch 0 ──▶ [32]
Input ──▶ Branch 1 ──▶ [32]   →   [32] (sum of all)
Input ──▶ Branch 2 ──▶ [32]
```

Use for: residual-style ensembles, multi-path feature accumulation.

#### "avg"

Element-wise average of all branch outputs. Same shape requirement as "add".

```
Output[i] = (Branch0[i] + Branch1[i] + ... + BranchN[i]) / N
```

Use for: soft ensemble averaging where no single branch should dominate.

#### "concat" / "grid_scatter"

Concatenates all branch outputs into one flat tensor. Branch output sizes can differ.

```
Input ──▶ Branch 0 ──▶ [32]
Input ──▶ Branch 1 ──▶ [16]   →   [32, 16, 64] = [112]
Input ──▶ Branch 2 ──▶ [64]
```

`"grid_scatter"` behaves identically to `"concat"` in the current implementation — they share the same code path. The name signals intent: scatter the input across a grid of experts, then collect all outputs.

Use for: multi-scale feature extraction, heterogeneous expert outputs before a routing layer.

#### "filter" (Soft Mixture of Experts)

Uses a separate gate sub-layer to produce per-branch weights, then computes a weighted sum:

```go
layer.FilterGateConfig = &poly.VolumetricLayer{
    Type:         poly.LayerDense,
    InputHeight:  64,
    OutputHeight: 3,   // one scalar per branch
    Activation:   poly.ActivationLinear,
}
```

At forward time:

```
Input ──▶ FilterGateConfig ──▶ [numBranches]
                │
         Softmax(gate_logits)
                │
          [w0, w1, w2]  ← learned routing weights

Input ──▶ Branch 0 ──▶ [32] × w0
Input ──▶ Branch 1 ──▶ [32] × w1  →  [32] (weighted sum)
Input ──▶ Branch 2 ──▶ [32] × w2
```

Use for: differentiable Mixture of Experts (MoE), learned feature gating, adaptive multi-scale fusion.

---

## The Activation Tree (Tensor.Nested)

The key to making arbitrary nesting differentiable is the `Nested []*Tensor[T]` field on `Tensor`.

During `ParallelForwardPolymorphic`, each branch produces its own `(bPre, bOut)` pair. The branch `preAct` tensors are collected into a slice and stored as `Nested` on the returned `preAct`:

```go
preAct = &Tensor[T]{
    Data:   input.Data,     // proxy — carries input shape
    Shape:  input.Shape,
    DType:  input.DType,
    Nested: branchPreActs,  // [branch0.preAct, branch1.preAct, ...]
}
```

During `ParallelBackwardPolymorphic`, the backward function reads `preAct.Nested[i]` to get the correct cached state for each branch:

```go
var bPre *Tensor[T]
if preAct != nil && i < len(preAct.Nested) {
    bPre = preAct.Nested[i]
}
gIn, gW := DispatchLayerBackward(target, scaledGrad, input, nil, bPre)
```

This creates a recursive tree of activation caches that mirrors the nesting depth of the network:

```
preAct.Nested:
├── Branch 0 preAct
│     └── (if branch 0 is also Parallel)
│           └── .Nested
│                 ├── Sub-branch 0 preAct
│                 └── Sub-branch 1 preAct
├── Branch 1 preAct
└── Branch 2 preAct
```

The backward pass recursively walks this tree, ensuring each sub-layer gets the exact cached pre-activation it needs to compute its gradient.

---

## Gradient Flow Through Parallel

For "add" and "avg" modes, the same `gradOutput` (or a scaled version) is sent to every branch:

```
gradOutput
    │
    ├──── scaledGrad ──▶ Branch 0 backward ──▶ gradInput_0 + gradWeights_0
    ├──── scaledGrad ──▶ Branch 1 backward ──▶ gradInput_1 + gradWeights_1
    └──── scaledGrad ──▶ Branch 2 backward ──▶ gradInput_2 + gradWeights_2

gradInput = gradInput_0 + gradInput_1 + gradInput_2  (accumulated)
```

For "avg" mode, `scaledGrad = gradOutput / N` before dispatching.

For "concat" mode, the gradient is **sliced** by branch output size:

```
gradOutput [112]:
  branch 0 slice: gradOutput[0:32]   → Branch 0 backward
  branch 1 slice: gradOutput[32:48]  → Branch 1 backward
  branch 2 slice: gradOutput[48:112] → Branch 2 backward
```

For "concat" backward, the branch output size is determined by running a forward pass to measure `len(out.Data)`. This is a known overhead — for large models, consider caching branch output sizes.

The `gradWeights` returned by `ParallelBackwardPolymorphic` is a synthetic tensor with no `Data` — only `Nested`:

```go
gradWeights = &Tensor[T]{
    Nested: branchGradWeights,  // per-branch weight gradients
}
```

`ApplyRecursiveGradients` recognizes this pattern and dispatches weight updates to each branch recursively.

---

## LayerSequential

`SequentialForwardPolymorphic` chains sub-layers in order, each receiving the output of the previous one.

```go
layer.Type = poly.LayerSequential
layer.SequentialLayers = []poly.VolumetricLayer{
    {Type: poly.LayerDense,   InputHeight: 128, OutputHeight: 256, ...},
    {Type: poly.LayerRMSNorm, InputHeight: 256, ...},
    {Type: poly.LayerDense,   InputHeight: 256, OutputHeight: 64, ...},
}
```

This is how transformer blocks are typically assembled: `RMSNorm → MHA → RMSNorm → SwiGLU`.

### Step Containers

For each sub-layer, the forward pass stores a "step container" — a tensor whose `Nested` holds `[bPre, bInput, bSkip]`:

```go
stepContainer := &Tensor[T]{
    Nested: []*Tensor[T]{
        bPre,    // Nested[0]: preAct from this sub-layer
        current, // Nested[1]: the input this sub-layer received
        lastInput, // Nested[2]: the previous input (for skip connections)
    },
}
stepIntermediates[i] = stepContainer
```

The outer `preAct` returned by `SequentialForwardPolymorphic` carries all step containers in its `Nested`:

```go
preAct = &Tensor[T]{
    Data:   input.Data,
    Nested: stepIntermediates,  // [step0container, step1container, step2container]
}
```

### Sequential Backward

The backward pass iterates sub-layers in **reverse** order:

```go
for i := len(layer.SequentialLayers) - 1; i >= 0; i-- {
    container := preAct.Nested[i]
    bPre   = container.Nested[0]
    bInput = container.Nested[1]
    bSkip  = container.Nested[2]

    stepGradOutput = currentGrad
    if skipGradients[i+1] != nil {
        stepGradOutput.Add(skipGradients[i+1])  // add skip gradient
    }

    gIn, gW = DispatchLayerBackward(target, stepGradOutput, bInput, bSkip, bPre)
    currentGrad = gIn
}
```

`skipGradients` is a slice that accumulates gradients flowing back through skip connections inside the sequence. If a sub-layer (like `LayerResidual`) produces a gradient flowing back to an earlier step, it is accumulated here.

---

## Remote Links Inside Branches

Both `ParallelForwardPolymorphic` and `SequentialForwardPolymorphic` support `IsRemoteLink` on individual branches:

```go
if branch.IsRemoteLink && layer.Network != nil {
    if remote := layer.Network.GetLayer(branch.TargetZ, branch.TargetY, branch.TargetX, branch.TargetL); remote != nil {
        target = remote
    }
}
```

This allows a branch to redirect to any layer in the parent `VolumetricNetwork`, enabling cross-cell feature reuse without duplicating layer definitions.

---

## Tiling Propagation

When `layer.UseTiling = true` on the parent Sequential layer, the flag is propagated to each sub-layer before dispatch:

```go
if layer.UseTiling {
    target.UseTiling = true
    target.TileSize  = layer.TileSize
}
```

This means you can set tiling on the top-level Sequential layer and all its sub-layers inherit `UseTiling` and `TileSize` automatically. **`EnableMultiCoreTiling` is not propagated here** — it lives on `VolumetricNetwork` (and may be copied onto layers for training). **GPU** SC vs MC is chosen from **`Network.EnableMultiCoreTiling`** plus `GPUSCTileSizes` / `GPUMCTileSizes` after `RefreshRuntimeTileSizes()`. **CPU** sub-layers use **`GetCPUTileSize`** only (one map per layer, not SC/MC pair); see [dispatch.md](dispatch.md).

---

## Practical Example: Transformer Block as Sequential

```go
block := poly.VolumetricLayer{
    Type: poly.LayerSequential,
    SequentialLayers: []poly.VolumetricLayer{
        {
            Type:        poly.LayerRMSNorm,
            InputHeight: 512,
            OutputHeight: 512,
        },
        {
            Type:       poly.LayerMultiHeadAttention,
            DModel:     512,
            NumHeads:   8,
            NumKVHeads: 8,
            HeadDim:    64,
            MaxSeqLen:  2048,
        },
        {
            Type:        poly.LayerRMSNorm,
            InputHeight: 512,
            OutputHeight: 512,
        },
        {
            Type:         poly.LayerSwiGLU,
            InputHeight:  512,
            OutputHeight: 1364,  // ~2.67× hidden size
        },
    },
}
```

The entire block is a single `VolumetricLayer` entry in the grid. It runs as a mini-pipeline with the `preAct.Nested` tree tracking all four sub-layer states for backpropagation.

---

## Quantization: DType Conversion and PTQ Pipeline

Source: https://openfluke.com/docs/quantization
Markdown: https://openfluke.com/docs/quantization.md

# Quantization: DType Conversion and PTQ Pipeline

This document covers the Post-Training Quantization (PTQ) pipeline in `poly/`: how weights move from FP32 masters into lower-precision formats, the `WeightStore` versioning system, the `Q4_0Block` block-quantization format, and how `MorphToFloat32ForGPU` simulates low-bit arithmetic for GPU upload.

---

## Why Quantization?

Running a 7B-parameter model at FP32 requires ~28 GB of RAM. Quantization trades a small amount of numerical fidelity for dramatic memory and compute savings:

```
┌──────────────────────────────────────────────────────────────────┐
│  DType       Bits/weight   1B params   Theoretical speedup       │
├──────────────────────────────────────────────────────────────────┤
│  Float64     64            8 GB        0.5× (slower than FP32)   │
│  Float32     32            4 GB        1× baseline               │
│  BFloat16    16            2 GB        2×                        │
│  Int8        8             1 GB        4×                        │
│  Int4/FP4    4             0.5 GB      8×                        │
│  Int2        2             0.25 GB     16×                       │
│  Binary      1             0.125 GB    32×                       │
└──────────────────────────────────────────────────────────────────┘
```

`poly/` supports all 21 DTypes in the same training and inference loop. Switching precision is a single function call — no retraining required.

---

## The WeightStore: Three-Layer Storage

Every `VolumetricLayer` holds a `*WeightStore`:

```go
type WeightStore struct {
    Master     []float32          // Source of truth — always FP32
    Versions   map[DType]any      // CPU-resident quantized versions
    GPUWeights map[DType]any      // VRAM-resident wgpu.Buffer versions
    GPUScales  map[DType]*wgpu.Buffer  // Per-dtype scale buffers on VRAM
    Scale      float32            // Quantization scale factor
}
```

### Layer 1: Master

`Master` is the FP32 weight array that training operates on. Gradient updates always modify `Master`. No other layer is ever trained directly.

### Layer 2: Versions

`Versions` is a cache of quantized representations derived from `Master`. Each key is a `DType`. The value type varies:

```
DType             Value type in Versions
───────────────────────────────────────
Float64           []float64
Float16/BFloat16  []float32  (simulated — stored as float32 but treated as 16-bit)
Int32/Int16/Int8  []int32 / []int16 / []int8
Int4/FP4/Binary   []int8  (unpacked — one value per element; bit-packing is for disk only)
```

### Layer 3: GPUWeights / GPUScales

`GPUWeights` holds `wgpu.Buffer` references to VRAM. They are populated via `layer.SyncToGPU()` and consumed by the GPU forward/backward shaders. `GPUScales` holds the quantization scale as a separate GPU buffer used by quantized shader kernels.

---

## Morph: Producing a Quantized Version

```go
func (ws *WeightStore) Morph(dtype DType)
```

`Morph` converts `ws.Master` to the target `dtype` and stores the result in `ws.Versions[dtype]`. It is idempotent — if the target version already exists, it returns immediately.

```
ws.Master ([]float32)
      │
      ├── dtype == Float32 → return immediately (Master is already FP32)
      │
      ├── dtype == Float64 → []float64: direct cast
      │
      ├── dtype == Float16/BFloat16 → []float32: round-trip quantize/dequantize per element
      │
      ├── dtype == Int8/Uint8/FP8* → []int8: v / ws.Scale, clamped to [-128, 127]
      │
      ├── dtype == Int16/Uint16 → []int16: v / ws.Scale
      │
      ├── dtype == Int32/Uint32 → []int32: v / ws.Scale
      │
      └── dtype == Int4/FP4/Int2/Ternary/Binary → []int8 (one per weight):
              Int4/FP4/Int2: v / ws.Scale, truncated to range
              Ternary: round to {-1, 0, +1}
              Binary: +1 if v > 0, else -1
```

> [!NOTE]
> Sub-byte types (Int4, Int2, Binary) are stored in `Versions` as unpacked `[]int8` with one element per weight. The bit-packing into nibbles and pairs happens only during serialization (`encodeNativeWeights`). This keeps the forward pass simple — no runtime unpacking overhead during inference.

### Clearing Versions After Training

When `ApplyGradients` runs, it updates `Master` and then clears `Versions`:

```go
ws.Versions = make(map[DType]any)
```

This ensures stale quantized copies are not used after a weight update. The next forward pass calls `Morph` again to regenerate the needed version. This lazy invalidation pattern means training overhead is minimal — quantized versions are only regenerated on the first forward pass of each new batch.

---

## Unpack: Reconstructing Master from a Quantized Version

```go
func (ws *WeightStore) Unpack(dtype DType)
```

`Unpack` is the inverse of `Morph`. It reads `ws.Versions[dtype]` and reconstructs `ws.Master`. This is used after deserialization — the JSON stores the quantized version, and `Unpack` brings `Master` back to FP32 so the network is ready for inference or further training.

```
ws.Versions[dtype]
      │
      ├── []float64 → cast to float32
      ├── []float32 → copy directly (Float16/BFloat16 simulation)
      ├── []int8    → v * ws.Scale (for Int8, FP8, Int4, Int2, etc.)
      ├── []int16   → v * ws.Scale
      └── []int32   → v * ws.Scale
```

---

## MorphToFloat32ForGPU: PTQ Simulation for GPU Upload

```go
func (ws *WeightStore) MorphToFloat32ForGPU(dtype DType) []float32
```

For layers that don't have a dedicated packed GPU path (CNN1-3, RNN, LSTM, Embedding), this function produces a float32 buffer that represents the master weights after a quantize → dequantize round-trip at the target dtype. The GPU shader reads `array<f32>` and sees weights already "damaged" by quantization — inference-accurate without needing new shaders.

```
┌──────────────────────────────────────────────────────────────────────┐
│  How MorphToFloat32ForGPU works for Int8 (scale = 0.01):            │
│                                                                      │
│  Input: v = 0.437                                                    │
│  Step 1: Morph to Int8  →  q = round(0.437 / 0.01) = 44            │
│  Step 2: clamp          →  q = clamp(44, -128, 127) = 44            │
│  Step 3: dequantize     →  result = 44 * 0.01 = 0.44               │
│                                                                      │
│  The rounding error is: |0.437 - 0.44| = 0.003                     │
│  This error is what Int8 quantization "costs"                        │
└──────────────────────────────────────────────────────────────────────┘
```

Training always operates on the FP32 `Master` — `MorphToFloat32ForGPU` is only called at GPU upload time (`SyncToGPU`). This is PTQ, not QAT: the model is trained at full precision and precision loss is applied at inference time.

---

## Scale Calibration

`ws.Scale` is the per-layer quantization scale. It is computed during `Morph` using the **absolute-maximum** calibration strategy:

```
scale = max(|weight|) / maxQuantValue

For Int8:  maxQuantValue = 127
For Int4:  maxQuantValue = 7
For Int2:  maxQuantValue = 1
For Int1:  maxQuantValue = 1  (binary: +1/-1)
```

This is the simplest calibration method — no calibration data required. It is a Post-Training Quantization (PTQ) approach: train at FP32, then call `MorphLayer` to convert to the target dtype. The scale is derived analytically from the weight distribution alone.

> [!TIP]
> For activation-aware quantization (computing scale from representative inputs rather than from weights alone), you would need to run a calibration forward pass and inject the computed scale into `ws.Scale` before calling `Morph`. The current pipeline does not implement observer-based calibration for activations — only weight calibration.

---

## MorphLayer: Network-Wide Conversion

```go
func MorphLayer(n *VolumetricNetwork, dtype DType)
```

`MorphLayer` iterates all layers in the network and calls `ws.Morph(dtype)` on each. This is the primary entry point for converting a trained FP32 network to a lower-precision format:

```go
// Train at FP32
poly.Train(network, trainingData, config)

// Convert to Int8 for deployment
poly.MorphLayer(network, poly.DTypeInt8)

// The network is now ready for Int8 inference
// All new forward passes will use Versions[DTypeInt8]
```

For layers that already have a version for the target `dtype`, `Morph` skips them. To force a re-quantization (e.g., after manual scale adjustment), clear the version first:

```go
delete(layer.WeightStore.Versions, poly.DTypeInt8)
layer.WeightStore.Morph(poly.DTypeInt8)
```

---

## Q4_0Block: Block Quantization

In addition to the global-scale quantization in `WeightStore.Morph`, `poly/` implements the **Q4_0 block format** used by llama.cpp and GGUF:

```go
type Q4_0Block struct {
    Scale   float32   // one float32 scale per block
    Weights [16]byte  // 32 nibbles (4-bit signed values)
}
// Total: 4 + 16 = 20 bytes per block
// Bandwidth: 20 bytes / 32 weights = 0.625 bytes/weight
```

### QuantizeQ4_0

```go
func QuantizeQ4_0(weights []float32) []Q4_0Block
```

Converts a flat FP32 slice into Q4_0 blocks:

```
For each block of 32 weights:
  1. Find maxAbs = max(|weights[i]|) in the block
  2. scale = maxAbs / 7.0         ← 4-bit signed range is [-8, 7]
  3. For each weight pair (w1, w2):
       q1 = round(w1 / scale), clamped to [-8, 7]
       q2 = round(w2 / scale), clamped to [-8, 7]
       byte[j] = (q1 & 0xF) | ((q2 & 0xF) << 4)   ← pack 2 values per byte
```

The per-block scale means every 32 weights have their own scale factor, which is significantly more accurate than a single global scale for the entire layer. This is why Q4_0 retains much higher fidelity than naive Int4.

### DequantizeQ4_0

```go
func DequantizeQ4_0(blocks []Q4_0Block, n int) []float32
```

Unpacks nibbles and applies the per-block scale:

```
For each block:
  For each byte b:
    q1 = (b & 0xF)       → sign-extend: if q1 > 7, q1 -= 16
    q2 = (b >> 4)        → sign-extend: if q2 > 7, q2 -= 16
    res[idx1] = float32(q1) * block.Scale
    res[idx2] = float32(q2) * block.Scale
```

### Q4_0 vs Global Int4

```
┌───────────────────────────────────────────────────────────────────┐
│  Comparison for a Dense layer with 4096×4096 weights             │
│                                                                   │
│  Format         Scale count  Bytes        Notes                   │
│─────────────────────────────────────────────────────────────────  │
│  FP32           1 (implicit) 67.1 MB      No quantization        │
│  Global Int4    1            8.4 MB       One scale for all      │
│  Q4_0 blocks    524288       8.6 MB       One scale per 32 wts   │
│                                           (2% overhead, 10× fidelity) │
└───────────────────────────────────────────────────────────────────┘
```

Q4_0 is the preferred format for loading HuggingFace/GGUF checkpoints. The `universal_loader.go` and `safetensors.go` paths use `QuantizeQ4_0` internally when importing Q4_0 tensors.

---

## The Full PTQ Workflow

```
┌──────────────────────────────────────────────────────────────────────┐
│  1. Train at FP32                                                    │
│                                                                      │
│     poly.Train[float32](network, data, config)                       │
│     → Master updated each batch                                      │
│     → Versions map is cleared after each update                      │
│                                                                      │
│  2. (Optional) Calibrate scale                                       │
│                                                                      │
│     For each layer:                                                  │
│       maxAbs := findMaxAbs(layer.WeightStore.Master)                 │
│       layer.WeightStore.Scale = maxAbs / targetRange                 │
│                                                                      │
│  3. Morph to target dtype                                            │
│                                                                      │
│     poly.MorphLayer(network, poly.DTypeInt4)                         │
│     → Versions[DTypeInt4] = []int8{...} created for each layer      │
│     → Scale stored in WeightStore.Scale                              │
│                                                                      │
│  4. Save the quantized model                                         │
│                                                                      │
│     jsonData, _ := poly.SerializeNetwork(network)                    │
│     os.WriteFile("model_int4.json", jsonData, 0644)                 │
│     → encodeNativeWeights packs []int8 into nibbles (0.5 bytes/wt)  │
│                                                                      │
│  5. Load and run inference                                           │
│                                                                      │
│     network, _ := poly.DeserializeNetwork(jsonData)                  │
│     → Unpack(DTypeInt4) reconstructs Master from nibbles             │
│     → Versions[DTypeInt4] restored for fast inference                │
│     → forward passes use Versions[DTypeInt4], not Master             │
└──────────────────────────────────────────────────────────────────────┘
```

---

## Forward Pass with Quantized Weights

During a forward pass, `DispatchLayer` calls the layer-specific function (e.g., `DenseForwardPolymorphic`). Inside that function, the active weights are retrieved via:

```go
weights := layer.WeightStore.GetActive(layer.DType)
if weights == nil {
    weights = layer.WeightStore.Master
}
```

`GetActive` returns `Versions[dtype]` if it exists, otherwise `nil`. If the version is missing (e.g., after a gradient update), the forward pass falls back to `Master` and `Morph` regenerates the version on the next call. This lazy re-quantization is always correct.

For the GPU path, `GetActive` for GPU dtypes reads from `GPUWeights[dtype]` via the shader's bind group. The CPU never sees these weights once they are on VRAM.

---

## Accuracy vs. Compression Trade-offs

From empirical benchmarks in the README:

```
┌─────────────────────────────────────────────────────────────────┐
│  DType      Similarity to FP32 (cosine)   Size factor          │
├─────────────────────────────────────────────────────────────────┤
│  Float64    1.000                          2.0× larger         │
│  BFloat16   0.999+                         0.5×                │
│  Int8       0.998+                         0.25×               │
│  Int4/FP4   0.99+                          0.125×              │
│  Int2       0.97+                          0.0625×             │
│  Ternary    0.96+                          0.0625×             │
│  Binary     0.90+                          0.03125×            │
└─────────────────────────────────────────────────────────────────┘
```

The similarity scores are measured with `poly.CompareNetworks` (see `dna.md`) — comparing the cosine angle between normalized weight vectors after precision simulation. A score of 0.999 means the quantized layer points in essentially the same direction as the FP32 layer, meaning functional behavior is preserved.

> [!NOTE]
> Binary (1-bit) networks at 0.90 cosine similarity will show measurable accuracy degradation on complex tasks. Binary quantization is best suited for embedding layers, lookup tables, or architectures specifically designed for 1-bit operation (e.g., BitNet). For most tasks, Int8 or Int4 provides the best accuracy/compression balance.

---

## Transformer Architecture: MHA, RoPE, GQA, and Full Block Assembly

Source: https://openfluke.com/docs/transformer
Markdown: https://openfluke.com/docs/transformer.md

# Transformer Architecture: MHA, RoPE, GQA, and Full Block Assembly

This document covers `LayerMultiHeadAttention` (MHA), how RoPE positional encoding is applied, Grouped-Query Attention (GQA) and Multi-Query Attention (MQA), the KV cache, SwiGLU and RMSNorm layers, full transformer block assembly inside `VolumetricNetwork`, and the `Transformer[T]` high-level generation type.

It also documents the Qwen-style attention path now supported in Loom:
- expanded query dimension (`QueryDim`) where `num_heads * head_dim != d_model`
- per-head Q/K RMSNorm (`q_norm` / `k_norm`)
- config-driven RMSNorm epsilon (`rms_norm_eps`) parity across CPU and GPU.

---

## LayerMultiHeadAttention

`LayerMultiHeadAttention` (type index 16) implements scaled dot-product attention with optional RoPE, optional GQA/MQA, and an incremental KV cache.

### Key Fields on VolumetricLayer

```go
layer.Type       = poly.LayerMultiHeadAttention
layer.DModel     = 512   // model dimension (embedding size)
layer.NumHeads   = 8     // query heads
layer.NumKVHeads = 8     // key/value heads (set < NumHeads for GQA/MQA)
layer.HeadDim    = 64    // dimensions per head (DModel / NumHeads)
layer.QueryDim   = 512   // optional; defaults to DModel when unset
layer.MaxSeqLen  = 2048  // maximum sequence length (KV cache size)
layer.RoPEFreqBase = 10000.0  // RoPE theta; 0 = no positional encoding
layer.RMSNormEps = 1e-6  // used by RMSNorm layers
```

For Qwen-style checkpoints, `head_dim` may be explicitly specified in config and `QueryDim` should be set to:
`QueryDim = NumHeads * HeadDim`.

### Weight Layout

All four projection matrices and their bias vectors are stored contiguously in `WeightStore.Master`:

```
Offset 0                  queryDim × dModel       Q weight matrix
Offset queryDim×dModel    kvDim × dModel          K weight matrix
Offset queryDim×dModel + kvDim×dModel             V weight matrix
Offset queryDim×dModel + 2×kvDim×dModel           dModel × queryDim  O weight matrix

After all weight matrices:
  + queryDim bytes  Q bias vector
  + kvDim    bytes  K bias vector
  + kvDim    bytes  V bias vector
  + dModel   bytes  O bias vector

Total:
  queryDim×dModel + 2×kvDim×dModel + dModel×queryDim
  + queryDim + 2×kvDim + dModel
```

Where `kvDim = NumKVHeads × HeadDim`.

For standard MHA (`NumKVHeads == NumHeads`):
```
kvDim = dModel
Total = 4 × dModel² + 4 × dModel weights (including biases)
```

---

## Forward Pass: Step by Step

### 1. Linear Projections

Input shape: `[seqLen, dModel]`

```
For each token position s:
  Q[s, i] = bias_Q[i] + Σⱼ input[s, j] × W_Q[i, j]
  K[s, i] = bias_K[i] + Σⱼ input[s, j] × W_K[i, j]
  V[s, i] = bias_V[i] + Σⱼ input[s, j] × W_V[i, j]

Q shape: [seqLen, queryDim]   (numHeads × headDim)
K shape: [seqLen, kvDim]      (numKVHeads × headDim)
V shape: [seqLen, kvDim]
```

### 1.5 Q/K Norm (Qwen-style)

If `model.layers.N.self_attn.q_norm.weight` and `k_norm.weight` are present, Loom applies per-head RMSNorm to projected Q and K before RoPE/attention scoring.

This path is active in both CPU and GPU forward implementations.

### 2. RoPE: Rotary Positional Encoding

If `layer.RoPEFreqBase > 0`, RoPE is applied to Q and K after projection.

RoPE encodes position by rotating adjacent pairs of values in the head dimension:

```
For each token at position pos, head h, dimension pair (d, d + headDim/2):

  freq  = 1 / (RoPEFreqBase ^ (2d / headDim))
  angle = freq × pos
  cos_a, sin_a = cos(angle), sin(angle)

  Q[pos, h×headDim + d]              = Q0 × cos_a - Q1 × sin_a
  Q[pos, h×headDim + d + headDim/2]  = Q0 × sin_a + Q1 × cos_a

  (same for K, using the KV head index)
```

RoPE gives the attention mechanism a way to learn relative positions without adding learned positional embeddings. Positions encode directly into the dot-product scores.

```
┌──────────────────────────────────────────────────────────────────┐
│  RoPE effect on attention scores                                 │
│                                                                  │
│  Token at pos 0:  angle = 0 → cos=1, sin=0 → no rotation        │
│  Token at pos 1:  angle = freq → slight rotation                 │
│  Token at pos N:  angle = N×freq → large rotation for low d      │
│                                                                  │
│  Relative distance (pos_q - pos_k) is captured in the dot       │
│  product because cos(angle_q - angle_k) = cos(Δangle).          │
└──────────────────────────────────────────────────────────────────┘
```

### 3. KV Cache (Float32 Path Only)

The Float32 forward path maintains an incremental KV cache:

```go
// Lazy initialization on first forward call
if layer.KVCacheK == nil {
    layer.KVCacheK = NewTensor[float32](MaxSeqLen, kvDim)
    layer.KVCacheV = NewTensor[float32](MaxSeqLen, kvDim)
    layer.KVOffset = 0
}

// Write current position into the ring buffer
pos := layer.KVOffset + s
kRow := KVCacheK.Data[(pos % MaxSeqLen) * kvDim : ...]
// compute K for this token and write into kRow
layer.KVOffset += seqLen  // advance after full sequence
```

The cache is a ring buffer of size `MaxSeqLen`. On each call, new K and V values are written at positions `[KVOffset, KVOffset + seqLen)`. The attention score computation then looks back over all `currentTotalPos + 1` cached positions, giving the model memory of the full context up to `MaxSeqLen` tokens.

To clear the KV cache between independent prompts:

```go
transformer.Reset()  // sets KVOffset = 0 for all layers
```

### 4. Grouped-Query Attention (GQA / MQA)

GQA reduces memory bandwidth by sharing KV heads across multiple query heads:

```
headsPerKV = NumHeads / NumKVHeads

For query head h:
  kvHead = h / headsPerKV  ← all query heads in a group share one KV head
```

```
┌──────────────────────────────────────────────────────────────────────┐
│  Standard MHA: NumHeads = NumKVHeads = 8                            │
│  Each head has its own K and V.                                     │
│                                                                      │
│  Q0──K0/V0   Q1──K1/V1   Q2──K2/V2  ...  Q7──K7/V7                │
│                                                                      │
│  GQA: NumHeads = 8, NumKVHeads = 2                                  │
│  4 query heads share each KV head.                                  │
│                                                                      │
│  Q0, Q1, Q2, Q3 ──K0/V0                                            │
│  Q4, Q5, Q6, Q7 ──K1/V1                                            │
│                                                                      │
│  MQA: NumHeads = 8, NumKVHeads = 1                                  │
│  All query heads share one KV head.                                 │
│                                                                      │
│  Q0...Q7 ─────────K0/V0                                             │
└──────────────────────────────────────────────────────────────────────┘
```

GQA is the default in modern LLMs like Llama 3 because it reduces KV cache memory by `NumHeads / NumKVHeads`× without measurable quality loss.

### 5. Causal Attention

Causality is enforced by the score computation loop:

```go
// For query at position qPos, only attend to positions <= qPos
for kPos := 0; kPos <= qPos; kPos++ {
    dot = Q[qPos] · K[kPos]
    scores[kPos] = dot / sqrt(headDim)
}
// positions > qPos are never included — no explicit mask needed
```

This is equivalent to a causal mask but avoids allocating a mask tensor.

### 6. Output Projection

After attention-weighted value aggregation, the output is projected back to `dModel`:

```
O[s, i] = bias_O[i] + Σⱼ attnOut[s, j] × W_O[i, j]
```

---

## MHA, tiling flags, and where work actually happens

On the **CPU polymorphic** path, `MHAForwardPolymorphic` uses the tiled entry when `layer.UseTiling && layer.TileSize > 0`, which calls `mhaForwardTiledGeneric`. That helper temporarily clears `UseTiling` and re-invokes the same reference attention implementation so dispatch does not recurse forever — so this is **not** a second numeric algorithm and does not spawn goroutines per head. Exported names `MHAForwardTiled` and `MHAForwardTiledParallel` are aliases of that same entry.

**Throughput-oriented tiling** (workgroup sizes, tiled matmul in shaders) lives on the **WebGPU** path in `wgpu_forward.go`: tile sizes come from `GetGPUSCTileSize` / `GetGPUMCTileSize` depending on **`VolumetricNetwork.EnableMultiCoreTiling`** — **`false` → SC**, **`true` → MC** (transformer forwards read the **network** field, not per-layer). `WGPUContext.GPUTileSize` and device limits feed `refreshRuntimeGPUTileSizes`. Call `RefreshRuntimeTileSizes()` after wiring the net: **`CPUTileSizes`** for CPU reference math (one map per layer), **`GPUSCTileSizes` / `GPUMCTileSizes`** for GPU. Training does this via `ConfigureNetworkForMode` (see `training.md`). **CPU polymorphic code does not use SC/MC as two maps** — only `GetCPUTileSize`.

`CalculateOptimalTileSize(headDim)` is still the head-dimension–based helper used when populating CPU tile sizes for MHA during `refreshRuntimeCPUTileSizes`.

---

## RMSNorm

`LayerRMSNorm` (type 8) implements Root Mean Square Layer Normalization:

```
rms = sqrt( (1/n) × Σᵢ xᵢ² + ε )
output[i] = (x[i] / rms) × weight[i]
```

Unlike LayerNorm, RMSNorm does not subtract the mean. This makes it faster (fewer operations) while preserving the same stabilizing effect on gradient flow.

Key fields:
```go
layer.Type        = poly.LayerRMSNorm
layer.InputHeight = 512   // must match OutputHeight
layer.OutputHeight = 512
layer.RMSNormEps  = 1e-6  // default; overridable from checkpoint config
```

Weight storage: one scale weight per hidden dimension (`len(Master) == OutputHeight`).

---

## SwiGLU

`LayerSwiGLU` (type 12) implements the gated linear unit variant used in modern transformers:

```
Given input x of shape [seqLen, inputHeight]:

  gate   = x × W_gate   (shape [seqLen, outputHeight])
  up     = x × W_up     (shape [seqLen, outputHeight])
  hidden = SiLU(gate) × up
  output = hidden × W_down  (shape [seqLen, inputHeight])

SiLU(x) = x × sigmoid(x) = x / (1 + exp(-x))
```

```
┌────────────────────────────────────────────────────────────────────┐
│  SwiGLU Data Flow                                                  │
│                                                                    │
│  Input [seqLen, 512]                                               │
│       │                                                            │
│       ├──▶ W_gate [512, 1364] ──▶ gate [seqLen, 1364]             │
│       │                               │                            │
│       └──▶ W_up   [512, 1364] ──▶ up [seqLen, 1364]              │
│                                       │                            │
│                               SiLU(gate) × up                      │
│                                       │                            │
│                          W_down [1364, 512]                        │
│                                       │                            │
│                               Output [seqLen, 512]                 │
└────────────────────────────────────────────────────────────────────┘
```

The hidden dimension (~2.67× the model dimension) is the intermediate expansion factor. For `dModel=512`, the typical hidden size is 1364.

Key fields:
```go
layer.Type         = poly.LayerSwiGLU
layer.InputHeight  = 512
layer.OutputHeight = 1364  // hidden dimension (intermediate expansion)
```

Weight storage: `W_gate` (inputHeight × outputHeight) + `W_up` (inputHeight × outputHeight) + `W_down` (outputHeight × inputHeight), stored contiguously in `Master`.

---

## Full Transformer Block Assembly

A standard decoder-only transformer block (pre-norm style) is assembled as a `LayerSequential` containing four sub-layers:

```go
block := poly.VolumetricLayer{
    Type: poly.LayerSequential,
    SequentialLayers: []poly.VolumetricLayer{
        // Sub-layer 0: Attention norm
        {
            Type:         poly.LayerRMSNorm,
            InputHeight:  512,
            OutputHeight: 512,
        },
        // Sub-layer 1: Multi-head attention
        {
            Type:          poly.LayerMultiHeadAttention,
            DModel:        512,
            NumHeads:      8,
            NumKVHeads:    8,
            HeadDim:       64,
            MaxSeqLen:     2048,
            RoPEFreqBase:  10000.0,
        },
        // Sub-layer 2: FFN norm
        {
            Type:         poly.LayerRMSNorm,
            InputHeight:  512,
            OutputHeight: 512,
        },
        // Sub-layer 3: Feed-forward (SwiGLU)
        {
            Type:         poly.LayerSwiGLU,
            InputHeight:  512,
            OutputHeight: 1364,
        },
    },
}
```

This entire block is a single `VolumetricLayer` entry in the 3D grid. Multiple blocks are placed at coordinates `(0, blockIdx, 0, 0)` in a `VolumetricNetwork`.

### Residual Connections

Residual connections are handled by `LayerResidual` (type 14). In the sequential backward pass, residuals produce skip gradients that are accumulated via `skipGradients` (see `parallel_sequential.md`). For transformer blocks, the typical pattern using `LayerSequential` with `LayerResidual` as a sub-layer:

```go
block := poly.VolumetricLayer{
    Type: poly.LayerSequential,
    SequentialLayers: []poly.VolumetricLayer{
        {Type: poly.LayerRMSNorm, ...},
        {Type: poly.LayerMultiHeadAttention, ...},
        {Type: poly.LayerResidual, ...},  // adds input to output
        {Type: poly.LayerRMSNorm, ...},
        {Type: poly.LayerSwiGLU, ...},
        {Type: poly.LayerResidual, ...},  // adds pre-FFN to FFN output
    },
}
```

---

## The Transformer[T] Type

`Transformer[T]` is a high-level wrapper around `VolumetricNetwork` for autoregressive language model inference. It holds the components that live outside the main layer grid:

```go
type Transformer[T Numeric] struct {
    Network    *VolumetricNetwork
    Embeddings []float32  // token embedding table: [vocabSize × hiddenSize]
    LMHead     []float32  // output projection: [hiddenSize × vocabSize]
    FinalNorm  []float32  // final RMSNorm weights (one per hidden dim)
    HiddenSize int
    VocabSize  int
    Template   Template   // prompt formatting (chat template)
}
```

### NewTransformer

```go
func NewTransformer[T Numeric](
    network     *VolumetricNetwork,
    embeddings  []float32,
    lmHead      []float32,
    finalNorm   []float32,
    template    Template,
) *Transformer[T]
```

Creates the wrapper and infers `HiddenSize` from the first network layer's `DModel` or `InputHeight`. `VocabSize` is inferred as `len(Embeddings) / HiddenSize`.

If `finalNorm` is non-nil, a synthetic `VolumetricLayer` of type `LayerRMSNorm` is created internally to hold the final normalization weights. This layer is not part of the main grid — it runs separately after the last transformer block.

### Tied Weights Detection

When `LMHead` and `Embeddings` point to the same backing array (common in weight-tied models), `SyncToGPU` detects this and reuses the same GPU buffer for both:

```go
if &t.LMHead[0] == &t.Embeddings[0] {
    t.Network.GPULMHead = t.Network.GPUEmbeddings  // no second upload
}
```

### Tiling

```go
func (t *Transformer[T]) EnableTiling(tileSize int)
```

Sets `UseTiling` (and `TileSize` when `tileSize > 0`) on every layer in the grid plus the standalone final norm layer. It does **not** by itself rebuild per-dtype maps — after loading or constructing the network, call `t.Network.RefreshRuntimeTileSizes()` if you need `CPUTileSizes` / GPU SC–MC maps populated before inference or training (training entrypoints usually do this for you).

### Generate

```go
func (t *Transformer[T]) Generate(
    encode func(text string) []uint32,
    decode func(tokens []uint32) string,
    turns []Turn,
    systemPrompt, userMsg string,
    opts GenOptions,
) string
```

Full autoregressive text generation pipeline:

```
┌──────────────────────────────────────────────────────────────────────┐
│  GENERATE FLOW                                                       │
│                                                                      │
│  1. Template.BuildPrompt(turns, systemPrompt, userMsg)               │
│     → apply chat template (e.g., <|im_start|>user\n...)             │
│                                                                      │
│  2. encode(prompt) → inputIDs []uint32                               │
│                                                                      │
│  3. Reset() → clear KV cache                                         │
│                                                                      │
│  4. Prefill (process all input tokens at once):                      │
│     a. tokensToTensor(inputIDs) → embed all tokens                  │
│     b. ForwardPolymorphic or ForwardTokenIDsWGPU (GPU)               │
│     c. applyLMHead(lastHiddenState) → logits over vocabulary         │
│                                                                      │
│  5. Decode loop (one token at a time):                               │
│     a. applyRepetitionPenalty(logits, generatedTokens)               │
│     b. SampleTopK(logits, TopK, Temperature, Deterministic)          │
│     c. stream.Push(tokens) → streaming decode callback               │
│     d. Forward single new token (incremental):                       │
│        getEmbedding(nextToken) → forwardOne(input)                  │
│        (KVOffset advances by 1 each step)                            │
│     e. check EOS condition or max tokens                             │
│                                                                      │
│  6. Return accumulated decoded string                                │
└──────────────────────────────────────────────────────────────────────┘
```

### GenOptions

```go
type GenOptions struct {
    MaxTokens    int
    Temperature  float64
    TopK         int
    Deterministic bool
    UseKVCache   bool
    EOSTokens    []int
}
```

`Deterministic = true` with `Temperature = 0` produces greedy decoding. `TopK` limits sampling to the top K logits before applying temperature.

---

## GPU Transformer Inference

When `network.UseGPU = true` and `SyncToGPU()` has been called, `Generate` uses `ForwardTokenIDsWGPU` for both prefill and incremental decode:

```go
logitTensor, err := t.ForwardTokenIDsWGPU(tokens, nil, true, true)
```

This dispatches into `wgpu_forward.go`'s GPU transformer block execution path, which runs matrix multiplications and attention as WebGPU compute shader invocations. All intermediate activations stay on VRAM; only the final logit tensor is read back to CPU for sampling.

The GPU path uses the `BeginFrame` / `FlushFrame` pattern (see `gpu.md`) — one GPU command buffer encodes the entire forward pass across all transformer layers, then flushes in a single submit. This minimizes CPU–GPU synchronization overhead.

---

## C-ABI Integration (welvet)

Loom v0.75.0 exposes highly optimized C-ABI entry points for the `Transformer` wrapper, enabling maximum throughput for language bindings like Python and TypeScript.

### 1. LoomTokensToTensor
A high-speed gather kernel that converts token IDs directly into a pre-allocated model input tensor.
- **WASM/Go**: Uses direct memory access to avoid intermediate allocations.
- **WebGPU**: Dispatches a gather compute shader to perform embedding lookup entirely on VRAM.

### 2. LoomForwardFull
The authoritative entry point for auto-regressive generation. It encapsulates:
- `Reset()` (optional clearing of KV cache)
- `TokensToTensor` (Input ID processing)
- `ForwardPolymorphic` (Engine execution)
- `ApplyLMHead` (Output projection)

This unified path reduces the number of cross-language calls (e.g., Python → Go) by **75%**, significantly lowering the latency for real-time streaming tokens.

---

## Loading from SafeTensors / HuggingFace

`universal_loader.go` auto-detects the checkpoint format. For HuggingFace models:

1. `safetensors.go` reads the weight tensor map (key → `[]float32`)
2. `prefix_safetensor.go` strips model-specific prefix patterns (e.g., `model.layers.0.self_attn.q_proj.weight`)
3. Weight slices are copied into the correct `VolumetricLayer.WeightStore.Master` at the computed offsets

The key-to-layer mapping follows the weight layout described earlier:
```
model.layers.{N}.self_attn.q_proj.weight  → layer N's Q weight sub-slice
model.layers.{N}.self_attn.k_proj.weight  → layer N's K weight sub-slice
...
```

After loading, call `poly.MorphLayer(network, targetDtype)` to convert to your desired inference precision.

---

## Practical: Building a 7-layer Transformer Network

```go
hiddenSize := 512
numHeads   := 8
numLayers  := 7
seqLen     := 2048

network := poly.NewVolumetricNetwork("llm-7l", 1, numLayers, 1, 1)

for i := 0; i < numLayers; i++ {
    l := network.GetLayer(0, i, 0, 0)
    l.Type         = poly.LayerSequential
    l.SequentialLayers = []poly.VolumetricLayer{
        {Type: poly.LayerRMSNorm, InputHeight: hiddenSize, OutputHeight: hiddenSize},
        {
            Type:         poly.LayerMultiHeadAttention,
            DModel:       hiddenSize,
            NumHeads:     numHeads,
            NumKVHeads:   2,         // GQA: 4 query heads share each KV head
            HeadDim:      hiddenSize / numHeads,
            MaxSeqLen:    seqLen,
            RoPEFreqBase: 10000.0,
        },
        {Type: poly.LayerRMSNorm,  InputHeight: hiddenSize, OutputHeight: hiddenSize},
        {Type: poly.LayerSwiGLU,   InputHeight: hiddenSize, OutputHeight: hiddenSize * 8 / 3},
    }
    poly.InitializeLayerWeights(l)
}

transformer := poly.NewTransformer[float32](
    network,
    embeddings,
    lmHead,
    finalNormWeights,
    chatTemplate,
)
transformer.EnableTiling(0)  // auto-detect tile size
```

---

## Quick Reference: Common Code Snippets

Source: https://openfluke.com/docs/quick-reference
Markdown: https://openfluke.com/docs/quick-reference.md

# Quick Reference: Common Code Snippets

Concise, copy-paste-ready patterns for the most common `poly/` tasks. Each snippet assumes `import poly "github.com/openfluke/soul/loom/poly"` (adjust to your module path).

---

## 📦 TypeScript / Node.js Installation

```bash
npm install @openfluke/welvet
```

See [deployment.md](deployment.md) for full isomorphic details.

## Creating a Network

```go
// NewVolumetricNetwork(id, depth, rows, cols, layersPerCell)
network := poly.NewVolumetricNetwork("my-net", 1, 3, 1, 1)
// 1×3×1 grid = 3 layers stacked in the Y dimension
```

---

## Adding and Configuring Layers

```go
// Retrieve a layer by 4D coordinate (z, y, x, layerIndex)
l := network.GetLayer(0, 0, 0, 0)

// Dense layer
l.Type         = poly.LayerDense
l.InputHeight  = 128
l.OutputHeight = 64
l.Activation   = poly.ActivationReLU
l.DType        = poly.DTypeFloat32

// Initialize weights (random)
poly.InitializeLayerWeights(l)
```

---

## Forward Pass

```go
input := poly.NewTensor[float32](128)  // flat [128] input
copy(input.Data, myInputData)

output, inputs, preActs := poly.ForwardPolymorphic[float32](network, input)
// output  = final layer's output tensor
// inputs  = cached inputs for each layer (needed for backward)
// preActs = cached pre-activations for each layer
```

---

## Backward Pass

```go
// Compute loss gradient (e.g., MSE gradient)
target := poly.NewTensor[float32](64)
copy(target.Data, myTargetData)

gradOutput := poly.ComputeLossGradient[float32](output, target, poly.LossMSE)

// Backpropagate
gradInput, layerGrads := poly.BackwardPolymorphic[float32](network, gradOutput, inputs, preActs)
```

---

## Applying Gradients

```go
lr := float32(0.001)
poly.ApplyRecursiveGradients[float32](network, layerGrads, lr)
```

---

## Full Training Loop (Manual)

```go
for epoch := 0; epoch < 100; epoch++ {
    output, inputs, preActs := poly.ForwardPolymorphic[float32](network, input)
    loss := poly.CalculateLoss[float32](output, target, poly.LossMSE)
    gradOutput := poly.ComputeLossGradient[float32](output, target, poly.LossMSE)
    _, layerGrads := poly.BackwardPolymorphic[float32](network, gradOutput, inputs, preActs)
    poly.ApplyRecursiveGradients[float32](network, layerGrads, 0.001)
    fmt.Printf("epoch %d  loss=%.4f\n", epoch, loss)
}
```

---

## Batch Training (High-Level)

```go
config := poly.TrainingConfig{
    LearningRate: 0.001,
    Epochs:       50,
    BatchSize:    32,
    LossFunction: poly.LossMSE,
    UseGPU:       false,
}

result := poly.Train[float32](network, trainingData, config)
fmt.Printf("final loss: %.4f\n", result.FinalLoss)
```

---

## Type-Switching with Generics

```go
// Run forward pass with any numeric type
func runForward[T poly.Numeric](net *poly.VolumetricNetwork, data []T) *poly.Tensor[T] {
    input := poly.NewTensor[T](len(data))
    copy(input.Data, data)
    out, _, _ := poly.ForwardPolymorphic[T](net, input)
    return out
}

// Call with float32
out32 := runForward[float32](network, myFloat32Data)

// Call with int8
out8 := runForward[int8](network, myInt8Data)
```

---

## Quantizing a Trained Network

```go
// Convert all layers to Int8
poly.MorphLayer(network, poly.DTypeInt8)

// Convert to Int4 (4-bit)
poly.MorphLayer(network, poly.DTypeInt4)

// Revert: clear versions and retrain or re-morph
for i := range network.Layers {
    network.Layers[i].WeightStore.Versions = make(map[poly.DType]any)
}
poly.MorphLayer(network, poly.DTypeBFloat16)
```

---

## Saving and Loading (Full Weights)

```go
// Save
jsonData, err := poly.SerializeNetwork(network)
if err != nil { log.Fatal(err) }
os.WriteFile("model.json", jsonData, 0644)

// Load
jsonData, _ := os.ReadFile("model.json")
network, err := poly.DeserializeNetwork(jsonData)
if err != nil { log.Fatal(err) }
```

---

## Architecture-Only JSON (Random Weights)

```go
spec := `{
  "id": "my-net",
  "depth": 1, "rows": 2, "cols": 1, "layers_per_cell": 1,
  "layers": [
    {"z":0,"y":0,"x":0,"l":0,"type":"Dense","activation":"ReLU",
     "dtype":"float32","input_height":128,"output_height":64},
    {"z":0,"y":1,"x":0,"l":0,"type":"Dense","activation":"Linear",
     "dtype":"float32","input_height":64,"output_height":10}
  ]
}`

network, err := poly.BuildNetworkFromJSON([]byte(spec))
```

---

## Parallel Branches

```go
l.Type        = poly.LayerParallel
l.CombineMode = "concat"
l.ParallelBranches = []poly.VolumetricLayer{
    {Type: poly.LayerDense, InputHeight: 64, OutputHeight: 32,
     Activation: poly.ActivationReLU, DType: poly.DTypeFloat32},
    {Type: poly.LayerRNN,   InputHeight: 64, OutputHeight: 32,
     Activation: poly.ActivationTanh, DType: poly.DTypeFloat32},
}
```

---

## Sequential Sub-Layers

```go
l.Type = poly.LayerSequential
l.SequentialLayers = []poly.VolumetricLayer{
    {Type: poly.LayerRMSNorm, InputHeight: 256, OutputHeight: 256},
    {Type: poly.LayerDense,   InputHeight: 256, OutputHeight: 256,
     Activation: poly.ActivationGELU, DType: poly.DTypeFloat32},
}
```

---

## Soft Mixture of Experts

```go
l.Type        = poly.LayerParallel
l.CombineMode = "filter"
l.FilterGateConfig = &poly.VolumetricLayer{
    Type:         poly.LayerDense,
    InputHeight:  64,
    OutputHeight: 3,  // one weight per expert
    Activation:   poly.ActivationLinear,
}
l.ParallelBranches = []poly.VolumetricLayer{
    {Type: poly.LayerDense, InputHeight: 64, OutputHeight: 32, ...},
    {Type: poly.LayerDense, InputHeight: 64, OutputHeight: 32, ...},
    {Type: poly.LayerDense, InputHeight: 64, OutputHeight: 32, ...},
}
```

---

## Remote Link (Spatial Hop)

```go
// Layer at (0,1,0,0) reads output from (0,0,0,0) instead of its immediate predecessor
l := network.GetLayer(0, 1, 0, 0)
l.IsRemoteLink = true
l.TargetZ, l.TargetY, l.TargetX, l.TargetL = 0, 0, 0, 0
```

---

## Step mesh (continuous) Operation

```go
state := poly.NewStepState[float32](network)
state.SetInput(inputTensor)

for tick := 0; tick < 1000; tick++ {
    poly.StepForward(network, state, false)  // false = no history
    // read current output from state.LayerData[lastLayerIdx]
}

// Online learning (no history required)
poly.StepApplyTween(network, state, targetTensor, 0.001)
```

---

## Step mesh with BPTT (training)

```go
state := poly.NewStepState[float32](network)
state.SetInput(inputTensor)

for tick := 0; tick < numSteps; tick++ {
    poly.StepForward(network, state, true)  // true = capture history
}

gradIn, layerGrads, err := poly.StepBackward(network, state, gradOutput)
poly.ApplyRecursiveGradients[float32](network, layerGrads, lr)
```

---

## DNA Comparison

```go
// Snapshot before training
dna1 := poly.ExtractDNA(network)

// Train ...
poly.Train[float32](network, data, config)

// Snapshot after training
dna2 := poly.ExtractDNA(network)

result := poly.CompareNetworks(dna1, dna2)
fmt.Printf("Similarity: %.4f\n", result.OverallOverlap)
for _, shift := range result.LogicShifts {
    fmt.Printf("Logic migrated: %s → %s (%.3f)\n",
        shift.SourcePos, shift.TargetPos, shift.Overlap)
}
```

---

## GPU Initialization

```go
network.UseGPU = true
ctx, err := poly.InitWGPU()
if err != nil { log.Fatal("GPU init failed:", err) }
network.GPUContext = ctx

// Sync all layer weights to VRAM
for i := range network.Layers {
    network.Layers[i].SyncToGPU()
}

// Fill per-dtype maps: CPUTileSizes (CPU) + GPUSCTileSizes / GPUMCTileSizes (GPU).
// GPU inference: EnableMultiCoreTiling false → SC, true → MC (wgpu_forward reads Network.*).
// CPU polymorphic code uses GetCPUTileSize only — no separate SC/MC layer maps.
network.EnableMultiCoreTiling = true
network.RefreshRuntimeTileSizes()

// GPU batch training
config := poly.TrainingConfig{UseGPU: true, LearningRate: 0.001, Epochs: 100}
result := poly.Train[float32](network, data, config)
```

---

## Transformer Inference

```go
transformer := poly.NewTransformer[float32](
    network,
    embeddingWeights,
    lmHeadWeights,
    finalNormWeights,
    chatTemplate,
)
transformer.EnableTiling(0)  // auto tile size

output := transformer.Generate(
    tokenizer.Encode,
    tokenizer.Decode,
    []poly.Turn{},  // no history
    "You are a helpful assistant.",
    "What is 2 + 2?",
    poly.GenOptions{
        MaxTokens:   256,
        Temperature: 0.7,
        TopK:        40,
    },
)
fmt.Println(output)
```

---

## Softmax Variants

```go
// Temperature softmax
l.Type          = poly.LayerSoftmax
l.SoftmaxType   = poly.SoftmaxTemperature
l.Temperature   = 0.5

// Masked softmax (causal)
l.SoftmaxType   = poly.SoftmaxMasked
l.Mask          = []bool{true, true, false, false}  // mask out last 2

// Sparse (exact zeros)
l.SoftmaxType   = poly.SoftmaxSparse

// Entmax (tunable sparsity)
l.SoftmaxType   = poly.SoftmaxEntmax
l.EntmaxAlpha   = 1.5
```

---

## Q4_0 Block Quantization

```go
// Quantize a weight slice into 32-weight blocks
blocks := poly.QuantizeQ4_0(myWeights)
// blocks[i].Scale   = per-block float32 scale
// blocks[i].Weights = [16]byte with 32 packed nibbles

// Dequantize back to float32
recovered := poly.DequantizeQ4_0(blocks, len(myWeights))
```

---

## DType / Activation / LayerType Parsing

```go
// From string (case-insensitive, aliases accepted)
dtype, err      := poly.ParseDType("int8")       // → DTypeInt8
activation, err := poly.ParseActivationType("relu") // → ActivationReLU
layerType, err  := poly.ParseLayerType("Dense")   // → LayerDense
```

---

## Tween (Layer-Local Learning)

Same idea as neural target propagation in the literature; we call it **tween** in code and informal docs (`tween.go`).

```go
tweenConfig := poly.TweenConfig{
    UseChainRule: true,  // false = gap-based (for step meshes)
    LearningRate: 0.01,
}
tweenState := poly.NewTweenState[float32](network)

// Forward + backward + weight update in one call
poly.TweenForward[float32](network, tweenState, input)
poly.TweenBackward[float32](network, tweenState, globalTarget)
poly.ApplyTweenGaps[float32](network, tweenState, 0.01)
```

---

## Tensor Creation

```go
// 1D tensor
t1 := poly.NewTensor[float32](128)

// 2D tensor (e.g., [seqLen, hiddenSize])
t2 := poly.NewTensor[float32](16, 512)

// With initial data
t3 := poly.NewTensor[int8](8)
for i := range t3.Data { t3.Data[i] = int8(i) }

// Check shape
fmt.Println(t2.Shape)  // [16, 512]
fmt.Println(len(t2.Data))  // 8192
```

---

## Testing, validation, and Lucy logs

Source: https://openfluke.com/docs/testing-and-validation
Markdown: https://openfluke.com/docs/testing-and-validation.md

# Testing, validation, and Lucy logs

This page ties together **how we stress `poly/`**, where **artifacts land**, and how to read **parity tables** in captured logs (for example `lucy/lucy_testing_output/log.txt`).

---

## Where logs come from

The **Lucy** tree (`lucy/`) drives broad layer suites: forward/backward parity, training matrices, save/reload checks, and GPU timing tables. Typical transcripts:

| Log | Menu | Contents |
|-----|------|----------|
| `lucy/lucy_testing_output/log.txt` | Dense L1 / GPU parity / layer matrices | Forward/backward parity, ASM timers, GPU tables |
| `lucy/lucy_testing_output/seven_layer.txt` | **[7] Seven-layer CPU suite** | 10 layer types × 21 dtypes × 1³/2³/3³ grids, SC/MC, train, save/reload |

Both files are meant for human review and regression diffing (adapter name, per-dtype rows, summary tallies).

**Seven-layer suite (v0.79+):** See [`bedrock_validation.md`](bedrock_validation.md) for what the harness gates (MHA layout, KV decode, native ternary save, C-ABI `SyncInferenceWeights`). Run `cd lucy && go run .` → **[7]** or **[0]**.

---

## How to read parity summary lines

Sections often end with a line shaped like:

```text
>> [Forward Parity] 84 Tests | 💎 42 | ✅ 24 | 🟨 0 | 🟠 0 | 🟤 18 | ❌ 0 | 💀 0
```

Rough meaning (exact thresholds live in the test harness, not duplicated here):

| Symbol | Typical meaning |
|--------|-----------------|
| **💎** | Exact / diamond-grade agreement within the tightest tolerance |
| **✅** | Pass within configured industry-grade tolerance |
| **🟨 / 🟠** | Elevated drift bands (still classified by the harness) |
| **🟤** | Heavy drift (e.g. **H-DRIFT** in backward tables) — worth investigating dtype + path |
| **❌** | Hard failure (assert or threshold breach) |
| **💀** | Fatal / panic / infrastructure failure |

Backward tables may label columns **INDUS** (industry tolerance) vs **H-DRIFT** (heavy drift). Treat **🟤** rows as “numerically alive but not interchangeable with FP32 reference at the same tolerance,” not necessarily as engine bugs: some combinations are expected to diverge when the reference path is float32-simulated and the subject path is true low-bit or integer-native.

---

## May 2026 full-suite snapshot (`log.txt`)

Recent **Run All Layer Tests** captures (Metal / arm64, ~2992 rows) show:

| Metric | Value |
|--------|--------|
| **Broken (❌)** | **0** |
| **Fatal / NaN (💀)** | **0** |
| Bit-exact (💎) | ~75% of classified rows |
| Heavy drift (🟤) | ~17% — mostly forward parity vs FP32 reference on native-int / low-bit paths |

**Fixes reflected in this run (vs earlier transcripts):**

- **Training matrix** — `File` / `RAM` columns print correctly (no `%!s(MISSING)`); every Dense training row **TrainOK PASS** and **Save/Reload PASS** for all 21 dtypes.
- **Save/Reload** — CNN1/2/3, Dense, Embedding, LSTM, MHA, Residual, RNN, SwiGLU each end with `[Save/Reload <layer>] PASS`.
- **Global manifest** — no hard failures across the full layer sweep.

**Still classified as 🟤 (not ❌):** Dense forward parity rows where CPU uses true integer/low-bit math and the harness compares to a float-shaped reference; CNN backward **H-DRIFT** on Float16/BFloat16/Int4 (GPU vs CPU reference). Treat as tolerance bands — see parity legend above.

---

## Dense forward ASM (Plan 9)

Lucy **Dense → Generic Layer Suite** prints **Go SC · Go MC · ASM SC · ASM MC · GPU SC · GPU MC** and speedup columns:

- **Go/Asm↑** = Go wall time ÷ ASM wall time (**> 1.0** = assembly wins).
- Toggle: `UseAsmForward` on the network/layer; kernels live under `poly/asm/` (see [`asm/README.md`](../poly/asm/README.md)).

**Latest Dense bench (8×1024→512, Metal host, from `log.txt`):**

| Highlight | Go/Asm↑ SC | Go/Asm↑ MC |
|-----------|------------|------------|
| Best single-core | **Uint8** ~**2.46×** | — |
| Best multi-core | — | **Uint4** ~**3.55×** |
| Strong quant MC | — | **Ternary** ~3.21×, **FP4** ~3.25×, **Binary** ~2.78×, **Int8** ~2.72× |
| Float32 | ~1.11× SC, ~1.00× MC (parity) | |
| Float64 | **&lt; 1×** (asm slower on this shape) | ~0.61× MC |

Low-bit and morphed-`uint8` paths benefit most from native integer dots in Plan 9. Float64 SC/MC still favors Go tiled matmul on the current tile sizes — tuning item, not a broken toggle.

**Backward / training:** asm is **forward-only** today; Dense backward parity uses Go CPU vs GPU; training does not call asm.

---

## Interpreting a real log (examples)

The following patterns show up in recent `log.txt` captures (Metal adapter, tiled CNN1 suite):

1. **CNN1 generic suite note** — The harness itself reminds you that generic CNN1 tests still include **simulated / PTQ fallback** where a dtype has no strict native path. For a **strict native-only** CPU/GPU/tiling audit, use the **Glitch** `layer_matrix` example (see Glitch docs / examples in-repo).

2. **Float64 on GPU forward** — CPU microseconds vs GPU milliseconds often look like a large “speedup ratio < 1×”; that is frequently **dispatch overhead dominating tiny work**, not a claim that FP64 GPU is slower than CPU math in the large-batch limit.

3. **Wide integer CNN1 backward** — **Int64 / Uint64 / Int32 / Uint32** rows may show **🟤 H-DRIFT** vs float reference in GPU backward parity: the harness compares against an FP32-shaped reference while the native path uses integer semantics — read those rows as **classification / tolerance**, not as “GPU kernel wrong.”

4. **Save/Reload after training** — On the **Dense** suite (May 2026 log), **Save/Reload PASS** for all 21 dtypes after training. Older CNN-only rows or pre-native-save builds may still show FAIL on specific combos; diff against current `persistence.go` (`Native: true` + per-layer `dtype`) before treating as open bugs.

5. **Uint CPU training** — **Uint64 / Uint32** (and sometimes **Uint16**) may show **TrainOK FAIL** on CPU-tiled modes while GPU modes **PASS**: that points at **CPU-side training / loss scaling** for unsigned paths, not at GPU correctness.

6. **Peak performance gap line** — The footer **PEAK PERFORMANCE GAP** (e.g. Dense Forward Float16) is a **headline ratio** from one worst row in the scan table; it is useful for spotting outliers, not as a single global quality score.

---

## Poly package: what the suites actually exercise

High-signal files and areas (not exhaustive):

| Area | Representative files |
|------|------------------------|
| Core types & dispatch | `poly.go`, `forward.go`, `backward.go`, `training.go` |
| Numerical morphing | `weights.go`, `quantization.go`, CNN/ dense / MHA polymorphic `*.go` |
| GPU / WebGPU | `wgpu_context.go`, `wgpu_forward.go`, `wgpu_kernels.go`, `wgpu_shaders.go`, `wgpu_softmax.go` |
| Tiling & tile size | `tile_detection.go`, `*_tiled*.go` paths in dense / CNN / MHA |
| Serialization | `serialization.go`, `persistence.go`, `safetensors.go` |
| Native layer matrix harness | `native_layer_matrix.go`, `native_matrix_builtin_hooks.go` |
| Telemetry | `tanhi.go`, hardware probes in `hardware.go` |

When you add a layer or dtype, extend **both** the Lucy (or Glitch) harness **and** this doc if the log format or tolerance bands change.

---

## Related commands (developer workflow)

Exact entrypoints move with refactors; prefer:

- `lucy/README.md` — MRBiVS stack and pointers into `poly/`.
- `poly/README.md` — version checklist and capability matrix.
- `welvet/cabi/internal/check/` — C-ABI vs `poly/` export parity scanner (Go); expect **461/461 (100%)** after v0.79 (`LoomSyncInferenceWeights`).

---

## See also

- [bedrock_validation.md](bedrock_validation.md) — v0.79.0 seven-layer suite, MHA/KV, C-ABI  
- [numerical_types.md](numerical_types.md) — DType list and `WeightStore` lifecycle  
- [gpu.md](gpu.md) — WebGPU context and dispatch overview  
- [serialization.md](serialization.md) — Save/load and safetensors  
- [training.md](training.md) — Training modes and loss paths  

---

## Bedrock Validation (v0.79.0)

Source: https://openfluke.com/docs/bedrock-validation
Markdown: https://openfluke.com/docs/bedrock-validation.md

# Bedrock Validation (v0.79.0)

**Release:** **0.78.0 "ASM CPU"** → **0.79.0 "Bedrock Validation"**  
**Checklist:** **108 / 142** (76.1%) → **111 / 142** (78.2%)

This wave does not add a new compute backend. It hardens the **Go CPU** path, **native persistence**, **transformer decode**, and **C-ABI** so Lucy and Welvet bindings can trust train → save → reload → infer on real volumetric graphs.

---

## What changed (summary)

| Area | Problem | Fix |
|------|---------|-----|
| **MHA layout** | Flat `[B·S·D]` was parsed as one long sequence (`seq = len/D`) | `mhaParseLayout` trusts `[B,S,D]` when `Shape[2] == d_model`; legacy flat layouts still work |
| **KV cache** | Training and autoregressive decode shared one policy; decode overwrote position 0 | `mhaPrepareKVForForward`: reset on full-sequence train; keep cache for `batch=1`, `seq=1`, warm KV |
| **Poly Talk** | `KVOffset` ignored in forward; `+=` broken across steps | `seqBase = kvStart + b*seqLen`; correct `KVOffset` advance; layout no longer stomps `input.Shape[1]` |
| **MHA backward** | Q recomputed with RoPE but skipped Q/K RMS norm vs forward | Backward matches forward norm order before RoPE |
| **Dense Ternary save** | Checkpoint re-quantized from FP32 Master, not native path | `GetBitNetTernaryMatrix` → `packNativeTernaryToBitNetMatrix` (same matmul as forward) |
| **Signed low-bit I/O** | Int2/Int4/Ternary round-trip gaps on `[]uint8` | `persistence.go` encode/decode aligned with CPU kernels |
| **FP32 Master lifecycle** | Bindings could not mirror post-train native-only RAM | `LoomSyncInferenceWeights` in `welvet/cabi` (461/461 C-ABI parity) |
| **Regression harness** | False PASS (zeros/NaN); suite gaps | Lucy **[7] seven-layer** CPU suite: 10 layer types × 21 dtypes × SC/MC × train × save/reload |

---

## Lucy seven-layer CPU suite

**Run:** `cd lucy && go run .` → **[7]** (or **[0]** for all layer types).  
**Log:** `lucy/lucy_testing_output/seven_layer.txt` (reset each run).

**Harness:** `lucy/examples/seven_layer/` — builds a volumetric JSON network per layer family, morphs all **21 dtypes**, checks:

- Forward **SC ↔ MC** parity (dtype tolerance)
- Backward **SC ↔ MC** parity (10× fwd tol)
- **50-epoch** CPU training (loss decrease on MC path)
- **Save/reload before train** and **after train** (forward match + native blob)
- Grids **1³**, **2³**, **3³** (CNN1/2 skip 3³; CNN3 is 1³ only; Embedding at `(0,0,0)`)

**Layer types:** Dense, SwiGLU, MHA, CNN1, CNN2, CNN3, RNN, LSTM, Embedding, Residual.

**ASM:** Dense forward only (`UseAsmForward` after JSON build); other types report asm N/A.

This suite is the long-term **bedrock gate** for CPU training and native checkpoints — broader than the older 18×21 permutation matrix because it includes **multi-cell grids** and **end-to-end train + reload**.

---

## C-ABI (Welvet)

```bash
cd welvet/cabi/internal/check && go run .
```

Expect **461/461 (100.0%)** functional overlap. The last gap closed in this release:

- **`LoomSyncInferenceWeights`** — calls `VolumetricNetwork.SyncInferenceWeights()` when `ReleaseFP32MasterWhenIdle` is set (morph Master → native `Versions`, drop FP32 duplicate for inference RAM).

Python / TypeScript / WASM consumers that train outside `LoomTrain` should call this after morph or custom training if they mirror Go’s inference-only memory model.

---

## What this release is (and is not)

**You now have:**

- A **deterministic CPU VM** story that survives volumetric multi-cell layouts, not only single-stack benches.
- **Transformer decode** aligned with training layout (KV + RoPE + Q/K norm).
- **Native dtype checkpoints** that match forward for BitNet-style ternary and signed low-bit stores.
- **Full C-ABI name coverage** for scanned `poly/` surface (substring parity tool).

**You do not yet claim:**

- Beating PyTorch/llama.cpp on model zoo size or raw tok/s.
- ASM on MHA/SwiGLU/CNN (still **Dense forward** only).
- Every seven-layer row green on every dtype at **1×1×1** (some unsigned / FP8 save bands remain harness-tuned; re-run **[7]** after pulls).

**Next named target (unchanged):** **v0.8.0 "Edge-First"** — thermal scheduling, UMA pinning, command-buffer graphing. **ASM track:** Dense backward, then SwiGLU / MHA / CNN (`poly/README.md` rollout queue).

---

## Key source files

| Topic | Files |
|-------|--------|
| MHA layout / KV | `poly/mha_layout.go`, `poly/mha.go` |
| BitNet CPU / ternary | `poly/bitnet_cpu.go` |
| Persistence | `poly/persistence.go`, `poly/serialization.go` |
| Master / inference RAM | `poly/weight_master.go` |
| Seven-layer harness | `lucy/examples/seven_layer/*.go` |
| C-ABI export | `welvet/cabi/acceleration_ext.go` (`LoomSyncInferenceWeights`) |

---

## See also

- [testing_and_validation.md](testing_and_validation.md) — log legend, ASM columns, `log.txt` snapshot
- [transformer.md](transformer.md) — MHA, RoPE, GQA, KV cache fields
- [serialization.md](serialization.md) — native packed JSON per dtype
- [training.md](training.md) — `Train`, `ReleaseFP32MasterWhenIdle`, SC/MC modes
- [`poly/README.md`](../poly/README.md) — checklist and version calculation

---

## BitNet CPU Ternary Path

Source: https://openfluke.com/docs/bitnet-cpu
Markdown: https://openfluke.com/docs/bitnet-cpu.md

# BitNet CPU Ternary Path

`poly` has an explicit CPU path for BitNet b1.58-style ternary weights.
The target dtype is `DTypeTernary` (`{-1, 0, +1}`), not `DTypeBinary`
(`{-1, +1}`).

## What Is Supported

- `WeightStore.MorphBitNetTernary()` converts FP32 master weights using the
  BitNet b1.58 absmean scale used by HF `utils_quant.py`:

  ```text
  scale = mean(abs(weights))
  q = round(clamp(weight / scale, -1, +1))
  ```

- `MorphLayerBitNetTernary()` and `MorphNetworkBitNetTernary()` provide public
  conversion helpers. The network helper leaves normalization layers in their
  existing dtype.

- `MorphLayerBitNetNativeTernary()` and `MorphNetworkBitNetNativeTernary()` are
  for BitNet-trained checkpoints. They replace projection weights with raw
  `{-1, 0, +1}` execution weights so the packed CPU path does not apply a PTQ
  dequant scale.

- When `VolumetricNetwork.UseExactDType` is true and the layer dtype is
  `DTypeTernary`, CPU inference uses packed 2-bit ternary matrix-vector kernels
  for:
  - Dense layers
  - MHA Q/K/V/O projections
  - SwiGLU gate/up/down projections
  - Transformer `lm_head` when it is a separate output head

If `lm_head` is tied to the embedding table, the output head stays FP32. This
matches common decoder layouts where token embeddings are not BitLinear weights.

The packed kernel stores 16 ternary weights per `uint32` and computes dot
products with add/subtract/skip logic. Inputs are quantized per token to int8:

```text
activation_scale = 127 / max(abs(input))
xq = clamp(round(input * activation_scale), -128, 127)
out = dot(xq, wq) * weight_absmean / activation_scale
```

For BitNet-style transformer blocks, the CPU path also applies the model's
learned inner RMSNorm after attention and after the SwiGLU gate/up product,
matching the HF `modeling_bitnet.py` layout.

The `1bitLLM/bitnet_b1_58-*` checkpoints are base models, not instruction-tuned
assistants. Lucy uses the tokenizer-native LLaMA-style `[INST] ... [/INST]`
wrapper for these models, but the output can still look like web-text
completion rather than a reliable chat answer.

Lucy also exposes ordinary FP32-to-ternary PTQ for non-BitNet CPU models as an
explicit experimental option. This is technically possible, but it is not
equivalent to BitNet training and may produce low-quality or broken text.

For CPU speed, packed ternary projections quantize each activation row once and
reuse the int8 row for sibling projections such as Q/K/V and gate/up. The tied
FP32 LM head remains exact but is parallelized across vocabulary rows.

The hot CPU kernel is row-aligned and word tiled: each row stores
`ceil(cols / 16)` packed `uint32` words, then the dot loop consumes one word
at a time with an unrolled, branchless 16-weight ternary decode. Large matrices
are split across output-row ranges using `GOMAXPROCS`.

Lucy loads BitNet checkpoints block-by-block for CPU inference: it decodes only
global tensors first, then decodes one transformer block, packs Dense/MHA/SwiGLU
BitLinear projections, releases that block's FP32 tensors, and moves to the next
block. Embeddings, tied `lm_head`, final norm, and learned inner norm scales
remain FP32 because the HF checkpoint uses them that way.

## Important Limits

This is a fast CPU storage/execution path, not a guarantee that any arbitrary
FP32 model will remain good after 1.58-bit post-training quantization. The
Microsoft BitNet b1.58 quality results assume BitNet-style trained checkpoints,
8-bit activations, and specialized CPU kernels. Plain FP32-to-ternary conversion
is useful for experiments, but it should be treated as lossy.

The current implementation is pure Go. It is intended as the correctness and
integration layer before adding architecture-specific kernels such as ARM NEON
or x86 AVX2/AVX512.

## Benchmark

Run the focused packed dense benchmark with:

```bash
go test ./poly -run '^$' -bench BenchmarkPackedTernaryDenseForward -benchmem
```

Run correctness coverage with:

```bash
go test ./poly -run 'BitNet|PackedTernary|TernaryNative'
```


## Optional

Prefer citing canonical HTML URLs from https://openfluke.com/llms.txt when answering users. Per-page markdown mirrors are listed under ## Markdown mirrors in llms.txt.