# OpenFluke — full corpus > Complete OpenFluke website text and Loom documentation for LLM ingestion. Navigation index: https://openfluke.com/llms.txt ## Site pages (full text) --- Source: https://openfluke.com/ # OpenFluke — Sovereign AI on Your Hardware (Golang AI Engine) > OpenFluke builds Loom, a pure Golang AI engine (Apache 2.0, zero CGO) for CPU, GPU, and WebGPU on every OS; SoulGlitch, a private offline AI digital pet on Google Play; and Primecraft, a voxel simulation engine. Canonical: https://openfluke.com/ --- Star openfluke/loom Open-Source AI Infrastructure Lab Sovereign AI. On Your Hardware. OpenFluke builds foundational tools so intelligence can run locally—private, portable, and free of cloud lock-in. Loom (v0.79 Bedrock) is the M-POLY-VTD engine: 3D volumetric networks, 21 numeric types, and welvet bindings on every major OS. SoulGlitch shows what that feels like: a living AI companion that never phones home. Explore Loom Why Loom? Loom docs SoulGlitch on Play SoulGlitch for Linux Source on GitHub Who We Are An open-source AI infrastructure lab Most AI today lives in someone else's data center. OpenFluke is an independent R&D lab building the opposite: edge-native, privacy-first infrastructure that puts training and inference on the devices people already own. We ship real software—not slide decks. Loom is the engine. SoulGlitch is the proof. Everything is open source so developers, researchers, and hobbyists can inspect, extend, and ship without permission. 100% offline capable Loom Apache 2.0 Phone to server Your data stays yours Why Loom vs the industry No cloud required Run models on laptops, phones, and browsers. No API keys, no subscriptions, no upload pipeline. One engine, many surfaces Drop Loom into Python, JavaScript, Go, Flutter, or WASM—the same weights, the same behavior. Built to be felt SoulGlitch turns abstract ML into something expressive: chat, train emotions, and evolve personalities on-device. The Engine What is Loom? Loom is our Apache 2.0 M-POLY-VTD engine—CPU and GPU capable, OS-agnostic, and designed like SQLite for neural networks : one library you embed, not a cloud you rent. v0.79 hardens the CPU bedrock : volumetric train → native save → reload → infer (Lucy seven-layer suite), MHA decode , and C-ABI 461/461 —on top of BitNet, asm Dense forward, step mesh, and welvet on every OS. The Loom runtime — runs everywhere Silicon & acceleration x86_64, ARM64, ARMv7 CPU inference & training GPU via WebGPU / native paths WebAssembly in the browser Operating systems Windows macOS & iOS Linux & Android Node.js, Bun, browsers 📦 Drop-in portability Prebuilt native libraries ( .dll , .so , .dylib ) install beside your app. Train once, ship everywhere. 💾 Native precision on disk 21 DTypes from Float64 to 1-bit binary—checkpoints store packed weights per layer, not FP32-only JSON. BitNet and Qwen3 load from Hugging Face via Lucy. 🎯 Bit-exact reproducibility Deterministic execution across CPU, GPU, and language bindings—same inputs, same outputs, every time. 🧬 Biological learning 3D volumetric networks with target propagation—layers learn locally without classic backprop lock-in. Deterministic AI on any CPU or GPU Loom is built as a Deterministic Neural Virtual Machine (DNVM) : the same model weights, prompts, and settings produce bit-identical behaviour whether you run on Apple Silicon, x86, WebGPU, or inside SoulGlitch via Lucy and the welvet C-ABI. Apache 2.0 License Why Loom? Loom overview Read docs Hugging Face · local inference Models Lucy & SoulGlitch support Download checkpoints into your local Hugging Face hub cache. Lucy (CLI) and SoulGlitch share the same approved model list—load safetensors, run chat offline through Loom/welvet. Model family Hugging Face repo Typical use SmolLM2 Lite HuggingFaceTB/SmolLM2-135M-Instruct Phones · fast reactions · default SoulGlitch brain SmolLM2 Balanced HuggingFaceTB/SmolLM2-360M-Instruct Desktop · everyday chat SmolLM2 Deep HuggingFaceTB/SmolLM2-1.7B-Instruct Stronger private hardware · deeper replies BitNet b1.58 microsoft/bitnet-b1.58-2B-4T Low-bit ternary weights · Loom v0.78+ infer · v0.79 native save/reload Qwen3 Lite Qwen/Qwen3-0.6B GPU-friendly · strong quality per GB Qwen3 Balanced Qwen/Qwen3-1.7B Desktop · sharded safetensors Qwen3 Heavy Qwen/Qwen3-4B Large GPU / patient downloads Custom volumetric networks (XOR, NEAT, DNA splice) are built in Loom/poly—no HF download required. What We Build The OpenFluke ecosystem Infrastructure, a flagship app, and community tools—one vision of local, sovereign AI. Loom v0.79 — Bedrock M-POLY-VTD engine: 3D grids, CPU train/save/reload bedrock (v0.79), BitNet on CPU, transformers via WebGPU, NEAT/DNA evolution. Apache 2.0 — train once, ship in Python, JS, Go, Flutter, or the browser. Why Loom? Documentation Loom overview SoulGlitch Android & Linux A private, on-device AI companion—on Android and Linux (x86_64). Reactive glitch face, swarm chat, emotion training, all offline. iOS, macOS, and Windows coming soon. Get on Google Play Linux download Scene Gallery Live Screenshots and voxel scenes from SoulGlitch and the Primecraft engine—see the worlds we're building. Open gallery For developers Get started in minutes Loom is a self-contained C-ABI library ( welvet ) you embed in any stack. v0.79 validates CPU train/save/reload and transformer decode; rebuild welvet natives after upgrade. BitNet, Donate Compute, and TANHI from v0.78 still ship. One model file, identical results on Windows, Linux, macOS, iOS, Android, and WASM. Install via pip install welvet , npm install @openfluke/welvet , or embed natives in Flutter—as SoulGlitch does. Full reference: openfluke.com/docs . Python JavaScript Go iOS / Android WebAssembly WebGPU Loom documentation Product page Python Node.js Go # Install pip install welvet # XOR in 10 lines from welvet import Network, train net = Network({ "id": "xor", "depth":1,"rows":1,"cols":1, "layers_per_cell":2, "layers": [ {"l":0,"type":"dense","input_height":2, "output_height":8,"activation":"relu"}, {"l":1,"type":"dense","input_height":8, "output_height":1,"activation":"sigmoid"} ] }) losses = train(net, [[[0,0],[0,1],[1,0],[1,1]]], [[[0],[1],[1],[0]]], epochs=100, learning_rate=0.1) print(f"Final loss: {losses[-1]:.4f}") # Install npm install @openfluke/welvet # Usage const { Network } = require('@openfluke/welvet'); const net = new Network({ id: "demo", depth:1, rows:1, cols:1, layers_per_cell: 1, layers: [{ l:0, type:"dense", input_height:4, output_height:2 }] }); // Same model, same weights, identical output // whether running in Node or a browser. // go get github.com/openfluke/loom/poly package main import ( "fmt" "github.com/openfluke/loom/poly" ) func main() { net := poly.BuildNetwork(poly.Config{ ID: "demo", Depth:1, Rows:1, Cols:1, LayersPerCell: 1, Layers: []poly.LayerDef{ {L: 0, Type: "dense", InputHeight: 4, OutputHeight: 2}, }, }) state := net.NewState(poly.Float32) state.SetInput([]float64{1,0,1,0}) state.Step() fmt.Println(state.Output(0)) } Deeper dive How Loom differs architecturally Independent analysis of Loom's 3D volumetric design, compression pipeline, and target-propagation learning— for readers who want the technical story behind the marketing. Architecture reference: docs overview · comparative analysis on the research page . Technical research Loom: 3D grids & target propagation Comparative analysis vs PyTorch, JAX, and Go ML stacks—architecture, DNVM determinism, and edge deployment. Read full analysis 🧊 Thinks in 3D Signals move across a volumetric grid—not only through a rigid layer stack—closer to spatial brain topology than a factory line. 💾 Up to 98.4% compression Bit-packed serialization from Float64 down to 1-bit binary—gigabyte-class models can shrink enough to run on a phone, offline. 🧬 Target propagation Layers learn independently via localized target signals—more biologically plausible, and viable on non-differentiable low-bit models. ⚡ BitNet on device v0.79 fixes native ternary checkpoints end-to-end—Lucy, SoulGlitch, and welvet infer BitNet-class models on packed CPU paths without a cloud API. Available on Android SoulGlitch — chat, train, evolve SoulGlitch is what local AI feels like when it has a face. Ask a swarm, train emotions on your photos, and watch a glitchy entity react—all powered by Loom on your phone with no cloud. Download now on Google Play or Linux (x86_64). Coming soon: App Store (iOS), Mac App Store, and Microsoft Store. Get on Google Play Linux Learn more Google Play · now Linux · now iOS · soon macOS · soon Windows · soon Loom × SoulGlitch — models run on your PC; TANHI streams execution live to your phone so you watch mixed layers and remote links in 3D, in real time. SoulGlitch trailer — private AI companion on your hardware. × ‹ › More open source Built alongside the ecosystem Utilities and experiments from the same lab. Open source · NLP tool TokenTrove Find recurring text patterns across millions of documents—n-gram chains, file-level tracking, and parallel processing. Shown here on 5,000+ FCC filings to surface common boilerplate. Built in Go + Fiber. openfluke/tokentrove Pattern mining at scale Linked n-gram chains across thousands of files—not just word counts, but multi-sentence recurring structures. Built for real corpora Web UI, parallel processing, numeric filtering—legal docs, filings, research sets, plagiarism workflows. Pure Go Same stack as the rest of OpenFluke. Drop it on any server. Join the project Loom is Apache 2.0 and fully open source. Stars help others discover it; issues and PRs shape what ships next. Contribute on GitHub Read the docs → · API reference → --- Source: https://openfluke.com/about # About Samuel Watson — Golang AI Engineer & OpenFluke Founder > Samuel Watson is a Golang AI systems engineer and founder of OpenFluke. Programming since 2006, Master of Applied AI (Deakin). Building Loom (pure Go AI), Primecraft, and SoulGlitch. Canonical: https://openfluke.com/about --- Samuel Watson AI Systems Engineer & Founder of OpenFluke — Melbourne, Australia Programming since 2006 Master of Applied AI AI Runtime Engineering openfluke planetbridging LinkedIn About Me The short version I've been programming since 2006 — starting with IT support and small automation scripts before working my way through web development, data engineering, systems programming, and eventually AI runtime research. Over the years I've written production code in Go, Python, Java, JavaScript, TypeScript, C#, VBA, R, PHP, and Shell. In my spare time I build OpenFluke — a passion project I work on for fun. It's centred around Loom , a portable AI engine that runs neural networks natively across every major platform and language without vendor lock-in. Alongside it I'm building Primecraft , a simulation engine, and SoulGlitch , an AI creature evolution game powered by both. All of it built in my own time, just because I enjoy it. I've designed and verified cross-language, cross-vendor AI runtimes — achieving bit-level determinism across 7+ architectures including Apple M4, AMD Ryzen, Intel Arc, NVIDIA, and Qualcomm Adreno — using WebGPU/Vulkan compute with unified C-ABI bindings for Go, Python, C#, C, and WebAssembly. Languages & Technologies Accumulated across ~20 years Go Python JavaScript TypeScript C# Java C / C-ABI HTML / CSS SQL VBA / Excel R PHP Education Formal qualifications Master of Applied Artificial Intelligence Deakin University — 2023 to 2025 AQF Level 9 ACS Accredited Seoul Accord Deep Learning Reinforcement Learning Computer Vision Bachelor of Information Technology Griffith University — 2019 to 2021 AQF Level 7 Systems Development IT Project Management Security Policy Certifications Industry credentials Microsoft Certified: Azure AI Fundamentals Microsoft Current Projects What I'm building at OpenFluke Loom A portable, cross-language AI engine that runs neural networks natively across Go, Python, C#, TypeScript, WebAssembly, and more — without vendor lock-in. Learn more → Primecraft A distributed simulation engine with procedural world generation, physics, and embedded neural AI. Available on Android, Windows, Linux, and Steam. Learn more → SoulGlitch An AI creature evolution game built on Primecraft and powered by Loom. Train neural networks through gameplay. In active development. Learn more → Portfolio Demos Selected project video demonstrations Flamekeeper · RAG / LLM Local Offline ChatGPT-like System Built from scratch: offline conversations with RAG-based AI, voice input, TTS, natural language recommendations, and local vector search. Dockerized microservices with React UI. React · GoFiber · FastAPI · Docker · MongoDB · Ollama · Tacotron2 Bampro · MARL / Distributed AI WebGPU-Agnostic AI Framework End-to-end simulation of a WebGPU-agnostic AI framework designed for horizontal scaling in distributed environments. Multi-agent RL with evolutionary neural architecture selection and real-time dashboards. Go · Fiber · Docker Compose · WebSockets · MARL · React Deakin University · Team Lead Game Dev — Neurodiversity & Accessibility Served as team leader for a game development project at Deakin University, focusing on neurodiversity, accessibility, and innovative thinking in an inclusive educational platform. Geolocation · OpenStreetMap Australia Open-Source Lots Map Integration of OpenStreetMap to display geolocation data for all open-source lots across Australia on an interactive web interface. GitHub Portfolio Selected open-source projects — github.com/planetbridging & github.com/openfluke 🔥 Flamekeeper Multimodal RAG system: speech recognition, TTS, embeddings, and local LLM inference. Dockerized microservices with React UI for document ingestion and vector search. React · GoFiber · FastAPI · Ollama · Tacotron2 🤖 Bampro Multi-agent reinforcement learning experiments in a 3D simulation environment. Evolutionary neural architecture selection, real-time dashboards, low-spec cloud orchestration. Go · Fiber · Docker · WebSockets · MARL · React 🌐 Biocraft Isomorphic physics + AI sandbox running both natively and in browser. JSON-driven scene import/export, player-to-policy training, GPU-accelerated inference, multi-server monitoring. Go · WebGPU · Jolt Physics · Three.js · WebAssembly 🕸️ 3D Permission Dendrogram Interactive 3D visualization of hierarchical permission trees — streamed live from a Go backend and rendered in React Three Fiber with WebGL. Go · WebSockets · React Three Fiber · WebGL · Docker 🔐 CyberSentry Series Cybersecurity dashboard suite for CVE/CPE lookups and vulnerability enumeration. Real-time APIs with caching layers across MySQL and MongoDB. TypeScript · React · Bun · Docker · MySQL · MongoDB 🌉 Bridgeware Real-time microservice framework for secure CPE lookup and encrypted client-server messaging with a React dashboard. Node.js · React · Express · Socket.IO · Docker · Chakra UI 🎵 Audio Labeling Pipeline Full-stack ML pipeline for audio data: hierarchical labeling, spectrogram generation, neural architecture search, model training, and secure auth. React · Node.js · Flask · TensorFlow · MongoDB · Docker 🚁 DJI Tello Autonomous Flight Computer vision-guided robotics: CNNs trained to recognize individuals and trigger autonomous drone flight sequences via the DJI Tello SDK. Python · TensorFlow · OpenCV · Keras · ffmpeg 🐾 Paws Network packet capture and analysis tool in Go — goroutine-based sniffing, REST endpoints, and a responsive web dashboard for traffic inspection. Go · gopacket · pcap · Bootstrap 🔍 TokenTrove N-gram chain discovery across millions of documents. Parallel processing, file-level pattern tracking, web UI with real-time stats. Tested on 5,000+ FCC legal filings. GitHub → Additional Projects 🏈 AFL Match Prediction (Flask · TensorFlow · Pandas) 📊 Steam vs Android Trends Dashboard (Python · Chart.js) ⚙️ Ansible VMware/vSphere Examples 🗄️ CSV-to-SQL Converter (C#/.NET WPF) 🧟 Zombie Apocalypse Simulation (Node.js · Socket.IO) --- Source: https://openfluke.com/loom # Loom — Golang AI Engine, Portable & Zero CGO (v0.79) > Loom is a pure Golang AI engine: neural networks in Go with zero CGO on Windows, Linux, macOS, Android, iOS, and WASM. v0.79 Bedrock Validation: CPU train/save/reload, MHA decode, seven-layer Lucy suite, C-ABI 461/461. BitNet, WebGPU, 21 dtypes, welvet bindings. Canonical: https://openfluke.com/loom --- Open Source · github.com/openfluke/loom The Universal AI Engine M-POLY-VTD — a ground-up neural engine in Go: 3D volumetric grids, 21 numeric types, and polyglot bindings ( welvet ) for Python, TypeScript, Dart, and WASM. Train once, run with bit-identical results on CPU, WebGPU, and every major OS. v0.79.0 — Bedrock Seven-layer CPU suite 21 DTypes · DNVM Read the docs GitHub Releases v0.79.0 Bedrock validation · 111/142 checklist v0.79 — trustworthy CPU train → save → reload → infer Lucy [7] — 10 layer types × 21 dtypes × 1³/2³/3³ grids · SC/MC · train · native save/reload MHA layout + KV — [B,S,D] training · autoregressive decode · Poly Talk fixed Native persistence — BitNet ternary + signed low-bit round-trip · LoomSyncInferenceWeights Welvet C-ABI — 461/461 export parity (rebuild libwelvet after upgrade) Still from v0.78: Dense asm forward · BitNet CPU · WebGPU · Donate Compute · TANHI AI Deep Research Independent AI Analysis of Loom Comparative research on M-POLY-VTD vs PyTorch and JAX — plus the full engine reference on this site, synced from loom/docs . Architecture & research 3D grids, target propagation, DNVM Start with the overview — volumetric dispatch, WeightStore morphing, step mesh, transformers, and v0.79 bedrock validation . Why Loom? All Loom docs Research write-up 🧊 AI that thinks in 3D Most AI frameworks process data in a straight line, like an assembly line. Loom uses a three-dimensional grid — more like how your brain's neurons actually connect, jumping across regions rather than always going layer by layer. 💾 Fits AI on a USB stick Loom can compress AI models by up to 98.4%. A model that normally takes gigabytes of storage can shrink to a fraction — small enough to run on a phone or an old laptop with no internet required. 🧬 Learns like biology, not math Traditional AI learning requires freezing everything to calculate one massive equation. Loom's Target Propagation lets each part of the network learn independently — more like how neurons fire and strengthen in a real brain. Read the Full Technical Breakdown For Non-Technical People What is Loom, exactly? "Think of Loom like SQLite — but for AI." SQLite is a tiny database that runs inside your app with no server needed. Loom is the same idea for neural networks: a self-contained engine you can drop into any project, on any device, with no cloud account, no GPU server, no complicated setup. 🧠 Train it like a brain A neural network learns by seeing examples — like showing a child thousands of pictures of cats until they know what a cat is. Loom provides all the tools to build and teach these networks. 📦 Pack it anywhere Once trained, your model is a tiny file. Drop it into your Python script, your phone app, your website, or a game engine. Loom runs it everywhere with the exact same output. 🔒 No cloud needed Unlike ChatGPT or other AI services, Loom runs 100% locally on your device. Your data never leaves your machine. Perfect for privacy-sensitive apps or offline use. ⚡ WebGPU acceleration On supported devices, Loom uses your GPU through WebGPU — achieving 17× to 65× faster training than CPU. Works in browsers too. 🌍 Every language Python developer? pip install welvet . JavaScript? npm install @openfluke/welvet . Go, C, C#, Rust? There are bindings for all of them. One model, every language. 🎯 Deterministic on CPU & GPU Loom's Deterministic Neural Virtual Machine (DNVM) delivers bit-identical behaviour across Apple Silicon, x86, WebGPU, and language bindings. Lucy and SoulGlitch depend on this for reproducible local inference. 🧬 Evolution built in Loom includes a full NEAT evolution engine — models can mutate and breed like living organisms. This powers SoulGlitch's creature evolution system. Lucy & SoulGlitch Supported Hugging Face models Approved checkpoints share the same list in loom/lucy and SoulGlitch—download once, run offline via welvet. SmolLM2 135M · 360M · 1.7B Instruct — mobile to server brains Qwen3 0.6B · 1.7B · 4B — GPU-friendly chat models BitNet b1.58 microsoft/bitnet-b1.58-2B-4T — packed ternary CPU path (v0.78+, native save/reload in v0.79) Plus custom Loom/poly networks (training, NEAT, DNA) with no HF download. Get Started Install in 30 seconds Pick your language and paste the command. No account required. Python Node.js Go WebAssembly $ pip install welvet Copy Ships with precompiled native libraries for Windows, Linux, macOS, iOS, and Android. Zero Python dependencies. PyPI page → $ npm install @openfluke/welvet Copy Works in Node.js and browsers via WebAssembly. npm page → $ go get github.com/openfluke/loom/poly Copy Pure Go module. No CGO. Works with standard go build . Quick reference → · Source → Download main.wasm from the releases page Download 6.9 MB WASM bundle. Drop into any web page and run Loom in the browser. All releases → Platform Support Runs everywhere Prebuilt native libraries for every major platform — just download and go. Windows x86-64, ARM64 Linux x86-64, ARM64, ARM v7, x86 macOS x86-64, ARM64 (M-series), Universal Android ARM64, x86-64 iOS ARM64, Simulator, XCFramework WebAssembly Browser + Node.js WebGPU Forward + Backward pass, 17×–65× speedup PyPI welvet — zero dependencies For Developers What's under the hood Loom isn't just a wrapper around PyTorch. It's a ground-up engine built for portability and precision. All major layer types Dense, MHA, SwiGLU, RMSNorm, LayerNorm, CNN 1D/2D/3D, Transposed Conv, RNN, LSTM, Embedding, KMeans, Softmax, Parallel, Sequential, Residual. 21 numeric types float64 all the way down to binary (1-bit), including fp8, fp4, int4, and ternary. Choose precision vs. model size at runtime. NEAT evolution + DNA A full neuroevolution engine with mutation, crossover, and fitness selection. Models have a "DNA" signature for reproducible evolution. 98.4% compression Native bit-packed serialization shrinks model files by 98.4% compared to raw float storage. Plus SafeTensors support for HuggingFace compatibility. Target propagation An alternative to backpropagation where each layer is given a direct target. More biologically plausible and works for non-differentiable layers. Step mesh engine Clock-cycle 3D grid with double-buffered layers, spatial remote links, BPTT, and neural target propagation — online learning without a rigid layer stack. BitNet & low-bit CPU BitNet b1.58–style checkpoints with packed ternary linear layers. Lucy pulls from Hugging Face; welvet C-ABI exposes CPU inference paths. Operation mesh Donate Compute (LAN TCP model sharing), TANHI UDP layer telemetry for SoulGlitch HUD, tiled forward/backward, and Qwen3-family HF ingest. Full documentation Deployment guide BitNet CPU Watch It Work See Loom In Action Real demos — Loom models running in real time, live TANHI telemetry to SoulGlitch on your phone, benchmarks, and 3D visualization. Loom × SoulGlitch · live TANHI × Regional Mix — models on your PC, view on your phone Watch Loom AI models run in real time on a regional_mix harness (Dense, MHA, SwiGLU, RNN, LSTM with remote links across 3D topologies). Execution streams over UDP as TANHI telemetry into SoulGlitch on your local phone — a spatial, time-scrubbable trace instead of numbers in a terminal. TANHI docs → · YouTube → Performance Benchmark Forget Llama.cpp: WebGPU Inference in Pure Go SmolLM2-135M benchmarks: 68 tok/s on RTX 1650 Super, 143 tok/s on Linux i5, 229 tok/s on Mac M4. Zero CGO. FlashPoly Tiling. Bit-level deterministic across OS boundaries. Visualization Loom: Visualizing 3D Neural Networks in Real-Time Watch the AI "think" in real-time. Stepping mode, 3D grid topology, Zig-Zag and Starburst routing patterns — the black box, opened. Android · Airplane Mode Offline LLM Inference on Android via Loom AI Loom v0.0.8 running 100% locally on Android — device locked in Airplane Mode throughout. Zero cloud dependency. Pure on-device compute from first principles in Go. Open Source Tool NeuralWave: 3D Neural Network Visualization & Weight Analysis Real-time model discovery from HuggingFace, interactive 3D layer inspection, attention head visualization. Built on Loom + Go backend + Three.js. Star Loom on GitHub Loom is free, open-source, and built in the open. Stars help others find it and fuel continued development. Star openfluke/loom Report an Issue Star Fork --- Source: https://openfluke.com/why-loom # Why Loom? Golang AI vs PyTorch, llama.cpp & Cloud AI — OpenFluke > Why choose Loom: pure Golang AI engine (Apache 2.0, zero CGO), offline DNVM, 21 dtypes, WebGPU, vs PyTorch, JAX, llama.cpp, GoMLX, and cloud chatbots. Open source engine + SoulGlitch proof. Canonical: https://openfluke.com/why-loom --- Golang AI · Edge · Open Source Why Loom vs the rest of AI Cloud chatbots rent intelligence. PyTorch rents a Python runtime. Loom is a pure Go AI engine you embed—offline, deterministic, Apache 2.0—with a shipped app ( SoulGlitch ) that proves it on real phones. Interactive 3D Loom overview Documentation GitHub Skip to comparisons ↓ What you get The OpenFluke stack Not a single API—an open-source AI infrastructure lab : engine, bindings, docs, and products built on the same runtime. Loom (Apache 2.0) M-POLY-VTD engine: train + infer, 21 dtypes, WebGPU, BitNet CPU, C-ABI welvet , native release binaries. Polyglot bindings Python, TypeScript/npm, Go, Dart, C#, Java, WASM—one engine, same weights, embed like SQLite for neural nets. SoulGlitch (product) Offline AI companion on Google Play—swarm Q&A, emotion training, reactive face. Living proof of on-device Loom. Primecraft + lab tools Voxel simulation with embedded AI, scene gallery, Lucy CLI for local HF models—same sovereignty story. Open source means Loom: source, license, and rebuildable natives on GitHub. Releases ship prebuilt .so / .dylib / wheels so you don't have to compile Go—same pattern as PyTorch pip wheels or llama.cpp binaries. SoulGlitch is a product on Google Play (app code not necessarily OSS). Model weights come from Hugging Face under their own licenses. Advantages What Loom does differently Compared to cloud AI, Python frameworks, LLM-only runners, and other Go ML libraries. Sovereign & offline No API keys. Prompts and training stay on your hardware—privacy by architecture, not policy PDFs. Pure Go, zero CGO Golang AI without a Python runtime or CUDA-only trap. Single-binary deployment story for edge and servers. 3D volumetric mesh Networks as spatial grids—not only nn.Sequential . Native target propagation and step mesh learning. 21 dtypes + BitNet Float64 down to 1-bit binary per layer. Native packed checkpoints with verified save/reload (v0.79). BitNet b1.58 on CPU since v0.78. DNVM determinism Bit-identical behaviour across CPU, WebGPU, and bindings—reproducible research and embedded systems. WebGPU everywhere Cross-vendor GPU: Windows, Linux, macOS, Android, browser—without shipping CUDA toolchains per platform. DNA & NEAT built-in Topological comparison of whole networks, evolution in-engine—not just weight checkpoint diffing. Shipped proof SoulGlitch on Play Store today. Not slides—a consumer app running local LLMs via Loom/welvet. Vs the industry Quick comparisons Cloud AI (ChatGPT, etc.) Them: Intelligence in their datacenter Loom: Engine in your process Them: No embeddable runtime Loom: C-ABI for your app PyTorch / JAX Them: Python + huge CUDA stack Loom: Go binary, edge-first Them: 1D autograd DAG Loom: 3D mesh + target propagation llama.cpp / Ollama Them: LLM inference focus Loom: Train + small nets + NEAT + DNA Them: GGUF decode excellence Loom: Full engine for products GoMLX / Born ML Them: 1D stacks or OpenXLA/CGO Loom: Zero CGO + WebGPU Them: Narrower scope Loom: DNVM, BitNet, DNA, shipped app Feature matrix Loom vs PyTorch & Go ML (summary) Capability Loom PyTorch / JAX llama.cpp Core language Pure Go (golang AI) Python + C++/CUDA C/C++ Offline / embed First-class (C-ABI, WASM) Possible, heavy Inference-focused Training + custom nets 3D mesh, NEAT, DNA Autograd ecosystem Mostly inference Quantization 21 native dtypes + BitNet CPU TorchAO add-ons GGUF quants GPU path WebGPU (cross-platform) CUDA / ROCm / TPU CPU/GPU backends Determinism (DNVM) Bit-exact claim Not guaranteed Varies Open source Apache 2.0 engine + binaries Framework OSS OSS inference Deep dive: M-POLY-VTD architecture research · docs overview Fit When to choose Loom Choose Loom if you need… Offline AI inside your app (Flutter, Go, WASM) A golang AI / Go ML stack without Python Bit-exact, auditable local inference BitNet or sub-byte models on CPU 3D / NEAT / DNA research in one engine Apache 2.0 you can fork and ship Use something else if you need… Largest cloud models with zero setup (use hosted APIs) Massive PyTorch ecosystem & HF fine-tune recipes day one Fastest GGUF Llama on Mac CPU only (benchmark llama.cpp) Enterprise MLOps (Kubeflow, etc.) out of the box Ready to try the golang AI engine? Star the repo, read the docs, or install SoulGlitch and run models offline today. openfluke/loom Deploy with welvet SoulGlitch --- Source: https://openfluke.com/loom/research # Loom M-POLY-VTD — Golang AI Architecture Deep Research > Technical analysis of Loom's pure Go AI stack (M-POLY-VTD): volumetric tensor dispatch, 21-type polymorphism, step mesh, neural target propagation, topological DNA vs PyTorch, JAX, Born ML, GoMLX. Canonical: https://openfluke.com/loom/research --- AI Deep Research · Technical Analysis M-POLY-VTD: The Loom Architecture An exhaustive technical analysis of the Loom framework — covering Volumetric Tensor Dispatch, Multi-Numerical Polymorphism, Systolic Grid Propagation, Neural Target Propagation, the Topological DNA Engine, and a rigorous comparison against PyTorch, JAX, and the Go ML ecosystem. 3D Volumetric Grid 21 Numeric Types Neural Target Propagation WebGPU Native Pure Go · Zero CGO AI-Generated Deep Research · Podcasts & PDFs Three release-era briefings from the Loom lab — listen in-browser or download the matching PDF report from files.openfluke.com . v0.78 · Bedrock Loom Poly AI Engine Research Flagship M-POLY-VTD deep dive — volumetric dispatch, 21 dtypes, target propagation, and Go ML comparisons. Your browser does not support the audio element. PDF MP3 v0.76 Operation Mesh Shrinks Local AI How Loom’s operation mesh and release trajectory tighten the local-AI deployment story on consumer hardware. Your browser does not support the audio element. PDF MP3 v0.75 Mac Mini Beats RTX 4090 with Loom AI engine tiling update analysis — cache-aware dispatch and why Apple Silicon + Loom can outrun big discrete GPUs on the right workloads. Your browser does not support the audio element. PDF MP3 View Source Back to Loom Section 1 The Paradigm Shift: Volumetric Tensor Dispatch (VTD) Traditional deep learning frameworks — including PyTorch and TensorFlow — construct neural networks as directed acyclic graphs (DAGs) or sequential layer lists . While mathematically sound, this one-dimensional abstraction creates rigid execution pipelines that struggle to implement complex, biologically inspired routing. The Loom architecture fundamentally dismantles this constraint by introducing a 3D Volumetric Coordinate System . Every layer is assigned a geometric address (z, y, x, l) within a pre-allocated spatial grid. A flattening algorithm maps these 3D coordinates to contiguous 1D memory, maintaining hardware cache locality despite the logical 3D abstraction. Spatial Hopping In standard sequential models, data must flow strictly from layer N to layer N+1. In the Loom volumetric grid, data signals can bypass adjacent layers and jump across geometric coordinates — mimicking biological cortical columns. If a layer has an IsRemoteLink flag, the dispatcher fetches the remote layer dynamically via TargetZ, TargetY, TargetX, TargetL and injects it into the local execution path without graph recompilation. Dynamic Branching via Polymorphic Routing: The LayerParallel and LayerSequential container types aggregate sub-branches within the coordinate space. When ParallelForwardPolymorphic executes, the dispatcher routes input to multiple coordinate-mapped branches simultaneously, then merges using configurable topological modes: 🔗 concat Standard tensor concatenation across parallel branches. ➕ add Residual aggregation — sum branch outputs for skip connections. 〰️ avg Ensemble smoothing via averaged output tensors. 🔀 grid_scatter Spatial distribution of tensors across the volumetric grid. 🎛️ filter (MoE) Mixture-of-Experts gating: a FilterGateConfig layer generates Softmax coefficients to compute a dynamically weighted sum. Section 2 Multi-Numerical Polymorphism (M-POLY) A critical bottleneck in edge-device inference is memory bandwidth — streaming weight matrices from global VRAM to compute units. The Loom engine addresses this through native multi-numerical polymorphism . Unlike standard frameworks that require exporting to a fixed lower precision, Loom layers operate as fluid polymorphic units . The WeightStore struct maintains a master Float32 representation as the absolute source of truth, alongside a localized cache of actively morphed target precisions keyed by DType . Loom supports 21 distinct numerical types : Float64 Float32 BFloat16 Float16 FP8 E4M3 FP8 E5M2 FP4 Int64 Int32 Int16 Int8 Int4 Int2 UInt8 UInt4 UInt2 Ternary Binary (1-bit) NF4 E2M1 E3M0 Hardware Emulation via SimulatePrecision: For extreme low-bit types lacking native CPU/GPU register support (FP4, 2-bit quantization), Loom employs a universal fallback that mathematically forces the Float32 master weight to behave exactly as its lower-bit counterpart — simulating exponent/mantissa bounds for FP8E4M3, restricting to four discrete scaling levels for Int2, and clamping to ±1 for Binary. This enables Quantization-Aware Training (QAT) without complex fake-quantization node injections (as required by PyTorch). Different spatial coordinates can operate at different precisions simultaneously — a reasoning node in Float16 while an embedding lookup runs in 2-bit. 98.4% On-Disk Compression By packing low-bit representations, the Loom architecture achieves up to 98.4% on-disk compression for localized model deployment — effectively breaking the 192 GB/s memory bandwidth wall that stifles traditional inference on consumer graphics cards like Turing-class GPUs. Section 3 Systolic Grid Propagation: The Discrete-Time Neural Mesh Standard deep learning inference operates in a continuously flowing waterfall pattern — layer 1 finishes, passes memory to layer 2, and so on. Loom introduces Systolic Grid Propagation , modelled after the hardware systolic arrays used in Google's TPUs. Under this model, the 3D Volumetric Grid is a discrete-time neural mesh . The SystolicForward function advances the entire 3D grid by a single temporal "tick" — every coordinate calculates its output simultaneously based solely on input states from the previous tick. 🔁 Double Buffering The network maintains ReadBuffer and WriteBuffer per tensor state. During dispatch, every layer reads from ReadBuffer and writes results exclusively to WriteBuffer. CommitSystolicState then atomically swaps buffers — preventing race conditions in concurrent environments. ⏱️ Temporal Pattern Learning Information takes time to propagate geometrically across the network. This fundamentally alters how sequence data is processed — enabling true temporal learning that standard feedforward networks cannot achieve. 🔀 Asynchronous Layers Because layers operate asynchronously relative to continuous data flow, the systolic mesh supports online learning patterns that are impossible in standard sequential epoch-based training. Section 4 Neural Target Propagation (TargetProp) Backpropagation is widely criticized for its biological implausibility: it requires global error computation, exact weight symmetry, and freezing the forward activity while gradients are sequentially calculated backward through the chain rule. Loom implements an advanced alternative: Neural Target Propagation . Instead of computing continuous derivatives, TargetProp computes a proposed "target" state for each hidden layer. Each layer's objective is no longer to minimize the global loss via partial derivatives, but simply to map its forward activation to the proposed backward target . How TargetProp Works in Loom During the forward pass, actual activations are captured in ForwardActs . During optimization, CalculateTargetPropGaps executes an inverse estimation: for Dense layers, estimated targets are generated via weighted importance of downstream targets relative to master weights. For LSTM layers, the engine aggregates backward through input, forget, cell, and output gates simultaneously, creating a synthesized target for the previous recurrent time step. Gap-Based Hebbian Optimization: Once targets are generated, ApplyTargetPropGaps applies a local Hebbian-style learning rule. The weight update follows: ΔW = η · input · (target − actual) Loom introduces an advanced stability mechanism via LinkBudget — dynamically calculated from the cosine similarity between the forward activation vector and the backward target vector. If the target signal is highly misaligned (cosine similarity below 0.2), the layer simply ignores the update . This prevents catastrophic forgetting and exploding signals. Crucially, because TargetProp does not require differentiable functions, Loom can natively optimize extreme architectures like binary (1-bit) or ternary networks where standard gradients would vanish or shatter. Section 5 The Topological DNA Engine Because layers can dynamically hop across a 3D coordinate space and shift their numerical precision, traditional cryptographic hashing or PyTorch state-dict comparisons would instantly register a complete mismatch even when underlying logic is intact. Loom integrates a native DNA Engine based on principles from Topological Data Analysis (TDA). ExtractDNA converts every layer into a LayerSignature capturing spatial coordinates, layer type, DType, and a dimensionally normalized weight representation . The SimulatePrecision function expands all active WeightStore versions back to unified Float32 before unit vector normalization — ensuring the geometric "direction" of weights is captured independently of bit-depth magnitude. Logic Shift Detection CompareNetworks identifies Logic Shifts — when a layer signature in Model A aligns with high cosine similarity (>0.8) to a layer in Model B, but at a different spatial coordinate . This allows researchers to observe how architectural search algorithms or systolic propagation patterns naturally migrate logic pathways to more efficient regions of the 3D grid over time. Section 6 Native WebGPU Acceleration & Hardware-Aware Tiling Loom achieves 70+ tokens/second on consumer hardware through low-level optimization. The hardware.go module executes deep OS-level system calls ( sysctl on Darwin, /sys/devices/system/cpu/cpu0/cache/ on Linux) to determine exact L1/L2 cache byte sizes. Dynamic L1/L2 Cache Tiling: CalculateOptimalTileSize restricts matrix multiplication blocks so that the entire sub-block remains resident in L1 cache — significantly reducing global memory fetch latency. This delivers major speedups for operations like swigluTiledProjectGateUp . WGSL Shader Workgroup Optimization: For WebGPU execution, Loom queries MaxComputeWorkgroupStorageSize and MaxComputeInvocationsPerWorkgroup directly from the WebGPU adapter. MHA shaders allocate shared arrays for Keys and Values, using workgroupBarrier() synchronization, sized to consume exactly half of available workgroup storage — achieving optimal execution across Apple Silicon, NVIDIA CUDA, and integrated mobile GPUs. Section 7 Sub-System Autonomy: Tokenization, Ensembling & Telemetry 🔤 Native BPE Tokenizer A full Byte-Pair Encoding tokenizer written in Go, natively parsing HuggingFace tokenizer.json schemas. Includes a byte-fallback mechanism ( gpt2ByteEncode/Decode ) for unknown Unicode characters — enabling completely standalone, offline string-to-tensor processing. 🧮 Mathematical Ensembling FindComplementaryMatches assesses binary correctness masks of multiple models, calculating combined coverage ratio and cosine similarity of success rates — enabling optimized "Mixture of Models" pipelines that complement each other's weaknesses. 📊 Differentiable K-Means KMeansForwardPolymorphic transforms standard K-Means into an end-to-end differentiable operation using temperature-scaled distance metrics and Softmax gating, allowing classification topologies anywhere in the volumetric grid. 📡 Microsecond Telemetry The PolyObserver interface enables real-time tensor interception during forward/backward passes. AdaptationTracker monitors degradation and recovery via moving windows of outputs, accuracy, and throughput ( OutputsPerSec ). Section 8 Comparative Analysis: Loom vs Python Ecosystem (2026) The global deep learning industry has historically been dominated by Python-based frameworks. Comparing Loom to these heavyweights highlights distinct philosophical and technical divergences. Feature Loom (M-POLY-VTD) PyTorch (+ TorchAO) JAX (+ Flax/Optax) Execution Paradigm 3D Volumetric Mesh / Spatial Routing 1D Sequential / Dynamic DAG Functional / Compiled Static Graph (XLA) Language Pure Go (Compiled Native Binary) Python (C++ / CUDA backend) Python (C++ / XLA backend) Quantization 21 types native (FP64 down to Binary 1-bit) Native FP8, INT4, INT2, 1-bit via TorchAO Native FP8, INT8; sub-byte via custom libs QAT (Hardware Emulation) Built-in polymorphic SimulatePrecision FakeQuantize modules (complex node injection) Custom JAX primitives Optimization Engine Polymorphic BPTT + Native Target Propagation Native Autograd (reverse-mode AD) Functional forward & reverse AD Target Propagation First-class native Requires extensive custom class overrides High research support via custom logic flows GPU Acceleration WebGPU (cross-platform, edge & browser) CUDA, ROCm, Metal (vendor-specific) TPU, CUDA, ROCm (heavy compiler reliance) Structural Analysis Topological DNA Engine + Logic Shifts Standard dict/parameter hashing Standard dict/parameter hashing Deployment Footprint Single binary, zero dependencies Large runtime (PyTorch + CUDA variables) Large runtime (JAX + XLA toolchains) Section 9 Comparative Analysis: Loom vs Go ML Ecosystem (2026) Feature Loom Born ML GoMLX Gorgonia (Legacy) Core Architecture 3D Spatial Grid (Volumetric routing) 1D Sequential module stacks 1D Sequential computation graphs Static graph (Theano/TF1 style) Compute Backend Pure Go + WebGPU (Zero CGO) Pure Go + WebGPU (Zero CGO) OpenXLA (Heavy C++ bindings) CGO / CUDA (C++ bindings) Modern LLM Topology MHA, SwiGLU, RMSNorm, RoPE MHA, GQA, SwiGLU, KV-Cache, RMSNorm Gemma support / ONNX translation None (basic perceptrons/CNNs only) Quantization Spectrum 21 types (FP64 down to Binary 1-bit) Standard (FP32/FP16) Standard (dictated by XLA compiler) FP32/FP64 only Optimization Engine Backprop (BPTT) + Native Target Propagation Automatic Differentiation (Autograd) Automatic Differentiation via XLA Symbolic & Automatic Differentiation Non-Standard Layers Native Differentiable K-Means Clustering Requires external implementation Requires external implementation Requires external implementation System Telemetry Advanced window-based Adaptation Tracking Standard terminal logging Standard terminal logging Standard terminal logging Conclusions Strategic Outlook The Loom M-POLY-VTD architecture represents a radical divergence from established norms of deep learning engineering in 2026. By replacing the 1D computational graph with a cycle-accurate 3D Volumetric Grid, the framework physically maps neural structures in a manner that accommodates advanced biological routing — spatial hopping, systolic parallelism, and polymorphic precision. Its exhaustive 21-type polymorphism and simulated precision mechanisms directly confront the hardware memory bandwidth crisis, enabling dynamic on-the-fly quantization to 1-bit precision without structural memory reallocation. Neural Target Propagation provides a mathematically viable path for continuous, asynchronous training on power-constrained edge hardware. Complemented by the DNA Engine's topological signature matching, native BPE tokenization, and pure-Go WebGPU acceleration, Loom provides a self-contained, enterprise-grade ecosystem — vastly surpassing legacy Go frameworks, matching Born ML's deployment efficiency, and introducing architectural innovations previously reserved for experimental Python and JAX research environments. View Loom on GitHub Loom Documentation Back to Loom Overview --- Source: https://openfluke.com/soulglitch # SoulGlitch — Offline AI Digital Pet (Google Play) > SoulGlitch is a private on-device AI companion on Google Play (Android) and Linux x86_64. Reactive glitch face, swarm Q&A, emotion training, 100% offline—powered by Loom. iOS, macOS, Windows coming soon. Canonical: https://openfluke.com/soulglitch --- Now on Google Play · offline AI pet SoulGlitch A private AI digital pet that lives entirely on your device. Chat with it, train its emotional reactions on your own data, share moments with it, or ask a swarm of personalities and watch them vote on what you should do. Get it on Google Play Linux Founder unlock Roadmap Available now Android — download from Google Play. Linux — x86_64 zip below. No cloud, no accounts, no tracking. Coming soon: Apple App Store (iOS), Mac App Store, and Microsoft Store (Windows). Google Play Linux iOS macOS Windows Ask a swarm, not just one bot Build your own panel of personalities, ask them anything, and watch them respond with different tones, opinions, and chaos levels — all locally on your phone. “Should I fart on a crowded train?” Gremlin Mode YES Absolutely. This is a once-in-a-lifetime social experiment. Polite One NO Please do not weaponise public transport. Chaos Analyst YES Data suggests the emotional outcome will be unforgettable. Swarm Result: 2 / 3 personalities say yes Loom live on your phone Watch models run in real time — via TANHI Loom trains and executes on your computer while SoulGlitch on your local phone receives live UDP telemetry. See mixed layer types, remote links, and volumetric topology as a spatial trace you can scrub through — not just log lines. TANHI × Regional Mix — Loom’s regional_mix harness pairs Dense, MHA, SwiGLU, RNN, and LSTM branches with regional remote links. TANHI (Tensor Activation Network Holographic Interface) streams layer activity to SoulGlitch over your LAN. Read the TANHI docs Screenshots See SoulGlitch in action Tap any screenshot to view it larger and explore how the app feels and behaves. × ‹ › What it does More expressive than a chatbot. More private than the cloud. SoulGlitch is designed as a playful, reactive local AI experience. It is meant to feel alive, weird, emotional, and personal — not like another generic wrapper around a text box. 👁️ Reactive glitch face A living full-screen face that blinks, shifts, glitches, and morphs through emoji-driven expression changes in real time. 🧠 Train your entity Use text, images, and eventually video to shape how your AI interprets emotion and reacts to your world. 🗳️ Swarm personalities Create multiple agents with different prompts and let them respond, disagree, and vote on questions together. 📸 Share experiences Share a photo or video with your pet and let it react emotionally based on the worldview you trained into it. 📱 Runs on-device Chats stay local. Training stays local. Reactions stay local. Your AI lives on your hardware, not someone else’s server. 🌀 Built to expand SoulGlitch starts as a digital pet and grows toward encrypted sessions, swarm networking, future dimensions, and richer local agent behaviour. Free vs Founder Start free. Unlock the inner layer. The free tier is the expressive local AI pet. Founder unlock opens the deeper customization and multi-entity system behind the hidden door. Free tier Local AI pet The playful, reactive outer layer — designed to be approachable, expressive, and fun from the first launch. Talk to a full-screen glitch face with emoji-driven reactions Train text sentiment and emotional response behaviour Train on images and shape your entity’s reactions Share photos and let the AI react to them Share videos and build toward video-based emotional reactions Export and share moments to social media Founder unlock • $8.88 Inner layer access For people who want the deeper system: more control, more entities, more weirdness, more experimentation. Create more than one entity with different instructions and personalities Ask a question to a swarm of agents and collect a group vote Choose from a large entity pool to build your own custom groups Customize entity colour, shape, and visual style Access prototype features and in-progress experiments behind the hidden layer Support development while locking in the early founder tier Why SoulGlitch Not another serious AI assistant. SoulGlitch is intentionally playful. It is built for people who want to experiment, laugh, share, and see how local AI can feel expressive and alive without sending their whole life into the cloud. 😂 Chaotic by design Ask dumb questions, get strange answers, and treat it like a digital creature rather than a corporate productivity dashboard. 🔒 Privacy-first The fun part is the personality. The serious part is that your chats and training stay on your own device. ✨ Expressive visuals Face morphs, emojis, glitch states, future dimensions, and more — the interface is part of the personality, not just a skin around a text box. Roadmap What’s coming next SoulGlitch is live on Google Play (Android) and Linux (x86_64) . iOS, macOS, and Windows Store builds are coming soon. 01 Polish the core pet Sharpen the face, refine emotion training, expand sharing, improve reaction quality, and make the free tier instantly fun. 02 Deepen the founder layer Multi-entity setup, better swarm workflows, richer customization, and more experiments hidden behind the inner door. 03 Expand the system Encrypted chats, swarm networking, better face expressions, device-to-device syncing, cross-platform builds, and future dimension spaces. 04 Enter its dimension Audio and video training, optional distributed compute, 3D environments, and a much more embodied version of the entity. SoulGlitch Train your entity. Ask a swarm. Don’t take it too seriously. The outer layer is a strange little offline AI pet. The inner layer is where the system starts mutating into something bigger. Download on Google Play Linux Dev notes --- Source: https://openfluke.com/soulglitch/notes # SoulGlitch Design Notes — Vision & Mechanics > Design document for SoulGlitch: creature evolution, Hall of Fame, swarm networking, emotion training, privacy-first AI companion—not a cloud chatbot wrapper. Canonical: https://openfluke.com/soulglitch/notes --- Developer notes • current direction • public build plan SoulGlitch Notes These are the live notes behind SoulGlitch’s current direction: a local offline AI pet with emotional reactions, swarm personalities, founder-only inner layers, and a longer-term path toward richer on-device agent behaviour. Back to SoulGlitch Pricing & Tiers Coming Features 🧬 What SoulGlitch is now The current app direction SoulGlitch is no longer framed as the old creature-simulation placeholder. The current product is an offline AI digital pet with a glitch face, emotional reactions, local training, and a swarm layer for asking multiple personalities the same question and collecting their answers. The goal is to make local AI feel expressive, playful, reactive, and personal — something that feels alive on your phone, not another sterile cloud chatbot. Privacy matters, but the front-facing experience is still meant to be fun. 📱 Core app loop How the main experience is supposed to feel Open the app and interact with a living full-screen face rather than a plain message list. Talk to the entity and watch it react visually with glitches, emoji-driven moods, and changing expression. Train its sentiment and emotional behaviour using your own data. Share a photo or video and let it react based on what you previously taught it. Move into the founder layer and ask a swarm of personalities the same question to get a group vote. 💸 Free, Founder, and later pricing The monetization direction as it stands right now Free tier The expressive outer layer Talk to the glitch face and train text sentiment reactions Train on images and influence emotional response Share a photo and get a mood-based reaction Share video and move toward video-aware emotional reactions Export and share moments to social media Founder unlock • $8.88 The hidden inner layer Customize entity colour, shape, and other visual traits Create more than one entity with different prompts and instructions Ask a question to a swarm of agents and collect their votes Build your own custom groups from a larger entity pool Access deeper prototype features behind the hidden door 🌀 Coming features The longer-term buildout pushing toward the higher tier Later roadmap • $18.88 direction System expansion Encrypted chats Network swarm hosting Windows, Linux, and Mac releases Device-to-device encrypted syncing Better face expressions and richer emotional display Audio and video training Optional distributed compute / donate device power 3D environments to engage with the entity in its own dimension Kernel fusion and other inference speed improvements 🎯 Targeting notes Who this appears to be for right now Current likely audience AI-curious adults who like playful tech, weird chat experiences, expressive interfaces, privacy-first software, and experimental apps that do not feel overly corporate or serious. Why the tone matters SoulGlitch works better when framed as fun, strange, reactive, and slightly chaotic. The privacy story is important, but “don’t take AI too seriously” is likely the sharper outer hook. 🧠 Model layer notes How the local brain tiers are being thought about 135M — light brain: fast reactions, playful tone, lower load, ideal for lively mobile interaction. 360M — balanced brain: stronger all-round local personality, better for fuller interaction on more capable devices. 1.7B — deep brain: heavier reasoning tier intended more for server or stronger hardware contexts. 🚀 Launch notes How rollout is currently framed The rollout path starts on Android, where the current build and local model story make the most sense. The founder direction then expands slowly into iOS, desktop storefronts, and eventually Steam. The app that launches first is not the final universe. It is the first stable layer: local pet, emotional reactions, founder swarm, and enough weirdness to prove the concept. SoulGlitch is being built in layers: first the expressive local pet, then the founder inner system, then the broader dimensional expansion. --- Source: https://openfluke.com/primecraft # Primecraft — Voxel Simulation Engine with Embedded AI > Primecraft is OpenFluke's distributed simulation engine: procedural worlds, physics, voxel scenes, and embedded neural AI. Android, Windows, Linux, Steam. Canonical: https://openfluke.com/primecraft --- Early Access Available Now Primecraft A distributed simulation engine for procedural worlds and embedded neural agents — delivered as a playable game across mobile, desktop, and the web. Android Windows Linux Wishlist on Steam Alpha v0.30.0 — Model Sharing Patch 8B+ Planets 100% Offline Real Neural AI Overview What Is Primecraft? More than a game — it's an experimental engine exploring AI, physics, procedural generation, and distributed gameplay. Primecraft is an experimental, physics-driven sandbox built on top of a custom simulation engine. The game blends procedural world generation, real neural-network AI, player-driven construction, and multi-device gameplay. It is also the foundation the creature game SoulGlitch is built on. In simple terms: Procedural Worlds + Real AI + Multiplayer Sandbox Explore billions of procedural planets — each deterministically generated from its coordinates Train on-device neural networks — companions that learn and evolve with you Drop into bubble scenes — mini 3D levels loaded directly from the web Build or import levels — using a simple JSON-based scene format Couch co-op & LAN sync — play together on one device or across your network Physics-driven gameplay — destructible objects, planetary gravity, and dynamic abilities Fully offline — including AI training, no cloud required Features Core Gameplay Features Experience a unique blend of exploration, creation, and AI-driven gameplay. Procedural Universe Navigate through billions of unique planets, each with distinct terrain, resources, and environmental conditions. Every world is algorithmically crafted for endless exploration. Neural AI Companions Train real neural networks directly on your device. Your companions learn from gameplay, developing unique behaviors and abilities through the Loom AI framework. Physics Sandbox Experience realistic planetary gravity, destructible environments, and physics-based abilities. Every object in the world responds to forces and collisions. Bubble Scenes Discover and enter bubble scenes — self-contained 3D levels created by players and loaded from the web. Play puzzles, challenges, and custom worlds. Multiplayer Experiences Play couch co-op on a single device or sync across multiple devices over LAN. Online multiplayer and server-hosted worlds are in active development. Level Creation Build your own worlds using the web-based editor. Export as JSON and share with the community, or import others' creations into your game. Videos See Primecraft In Action Real footage of the engine being stress-tested and explored — from physics simulations to procedural planetary constructs. Preview · v0.20.0 AI Returns Home, Planet Travel & Couch Co-Op Fly across planetary space, train an AI to fly itself home, jump into bubble scenes, and play local co-op with synced AI movement — all in one unscripted preview. Engine Test · Physics Low-Fidelity Simulation Stress Testing Discrete Element stress test: thousands of rigid bodies to find the saturation point of the physics solver. Tracking frame latency, CPU physics process time, and static RAM. Full Suite · Tests 1–9 Procedural Planetary Constructs 9 tests in one: SnakeBots, animated skeletons, procedural bestiary (Walkers, Worms, Star-creatures), planetary skyscrapers aligned to spherical surfaces, discovery satellites, defensive grids, and the Great Transfiguration — 150+ magical objects spawned onto a single planet. Watch on YouTube Technical Under the Hood Primecraft is built as a cross-platform simulation engine, not just a game. Godot / C# Runtime Native performance High-performance physics, rendering, and input handling for mobile and desktop builds. Optimized for real-time simulation with thousands of entities. TypeScript / Web Runtime Browser-based tooling Scene editing, constraint systems, and AI tooling that runs directly in your browser. Shares the same scene format for seamless interoperability. Engine Capabilities Deterministic Planet Generator — 8–15 billion reachable locations with consistent generation Embedded Neural Runtime (Loom) — native inference for on-device AI AI Training Layer — movement, control, and companion behaviour learning Authoritative Multiplayer — server architecture in development Web-Based Scene Editor — create puzzles, levels, and simulation experiments FAQ Frequently Asked Questions Is this a game or a research project? Both. Primecraft is a fully playable game, but it's also an experimental engine exploring AI, physics, procedural generation, and distributed input systems. How does the AI work? Each companion uses a real neural network running natively on your device — no cloud, no external servers. You train them through gameplay inside bubble scenes using the Loom framework. Does Primecraft collect my data? No. All AI models and training stay entirely on your device unless you explicitly choose to export or publish your scenes and models. How many planets are there? The coordinate system supports over 8 billion unique planets. Each one is deterministically generated from its grid location, ensuring consistency across sessions. Can I make my own levels? Absolutely! Use the web-based editor to create scenes, then load them directly into the game. Levels are stored as human-readable JSON files. Is multiplayer supported? Couch co-op and LAN play are available now. Online multiplayer with server-hosted bubble scenes is actively in development. Why does Primecraft look chaotic? By design! It's a physics-first sandbox with experimental AI. The emergent chaos is part of the experience — players are encouraged to break things creatively. Audience Who Is Primecraft For? Built for curious minds who love experimentation and creative chaos. Sandbox Enthusiasts Love physics sandboxes and chaotic emergent gameplay AI Hobbyists Train neural networks in a real-time environment Level Creators Build 3D worlds without complex tools Researchers Explore embodied AI and procedural ecosystems Explorers Discover a weird, beautiful universe to mess around in Join the Universe Start building, training, and exploring today. The cosmos awaits. Android Windows Linux Wishlist on Steam --- Source: https://openfluke.com/gallery # Primecraft Scene Gallery — Voxel Worlds by OpenFluke > Browse voxel scenes built in Primecraft while testing the simulation engine—3D worlds, reflex automation, and neural networks from the OpenFluke lab. Canonical: https://openfluke.com/gallery --- Scene Gallery … scenes ← Prev Page 1 of 1 Next → × Scene Name Associations Finding related items... View in Biocraft Lab Copy Path Download JSON --- Source: https://openfluke.com/privacy # Privacy Policy — OpenFluke > How OpenFluke, Loom, SoulGlitch, and Primecraft handle your data. SoulGlitch runs offline on-device; the website uses minimal analytics via Cloudflare. Canonical: https://openfluke.com/privacy --- Legal Privacy Policy Last Updated: November 28, 2025 1 Introduction OpenFluke ("we", "our", or "us") is committed to protecting your privacy. This Privacy Policy explains how your information is handled when you use our services, including Primecraft , Biocraft , and the OpenFluke website . 2 Information Collection We prioritize data minimization. We do not store your personal data on our own servers beyond what is necessary for core functionality. Our services use the following third-party providers: Google Sign-In: Used for authentication. We only receive basic profile information (name, email, profile picture) to identify your account. Google Play Services: Used for achievements, leaderboards, and cloud saves in Primecraft. Google Play Billing: Handles in-app purchases securely. We never see your payment information. 3 How Information is Used Account Management: To identify you and provide access to your Lab and saved content. Game Progress: To save your game state, scenes, and unlocks locally or via cloud sync. Diagnostics: We collect anonymous crash data to improve stability and fix bugs. 4 Data Security We rely on the robust security measures provided by Google Cloud, the Android operating system, and industry-standard encryption to protect your data. While no method of transmission is 100% secure, we use commercially acceptable means to protect your information. 5 Your Rights You have the right to access, correct, or delete your personal information. You can revoke Google sign-in permissions at any time through your Google account settings. Deleting your OpenFluke account will remove all associated data from our systems. 6 Children's Privacy Our services are not directed at children under 13. We do not knowingly collect personally identifiable information from children under 13. If you believe we have collected such information, please contact us immediately. 7 Changes to This Policy We may update this Privacy Policy from time to time. Changes will be posted on this page with an updated revision date. Continued use of our services constitutes acceptance of the updated policy. 8 Contact Us If you have any questions about this Privacy Policy, please reach out: support@openfluke.com --- Source: https://openfluke.com/terms # Terms of Service — OpenFluke > Terms of service for OpenFluke websites, Loom open-source software, SoulGlitch, and Primecraft. Canonical: https://openfluke.com/terms --- Legal Terms of Service Last Updated: December 10, 2025 1 Acceptance of Terms By accessing or using OpenFluke services, including Primecraft , Soulglitch , Biocraft , the OpenFluke website , and the LOOM AI framework , you agree to be bound by these Terms of Service. If you do not agree to these terms, please do not use our services. 2 Description of Services OpenFluke provides a platform for creating, sharing, and running physics simulations and AI-driven experiences. Our services include: Primecraft: A native game application for exploring procedural worlds and training AI companions. Biocraft: A browser-based studio for creating and testing physics scenes. Your Lab: A personal workspace for managing your scenes and AI models. LOOM: An open-source neural network framework for AI training. 3 User Accounts To access certain features, you must create an account using Google Sign-In. You are responsible for maintaining the confidentiality of your account and for all activities that occur under your account. Your account is free and gives you access to your personal Lab, scene publishing, and AI training features. 4 User Content You retain ownership of any content you create using our services ("User Content"). By publishing User Content, you grant OpenFluke a non-exclusive, worldwide, royalty-free license to display and distribute your content through our platforms. You agree not to publish content that: Is illegal, harmful, threatening, abusive, or violates any laws Infringes on intellectual property rights of others Contains malware, viruses, or malicious code Is designed to harm, exploit, or mislead other users 5 Intellectual Property The OpenFluke name, logo, Primecraft, Biocraft, and associated branding are trademarks of OpenFluke. The LOOM AI framework and IsoCard scene format are released under open-source licenses and may be used according to their respective terms. 6 Disclaimer of Warranties Our services are provided "as is" and "as available" without warranties of any kind. We do not guarantee that our services will be uninterrupted, secure, or error-free. Your use of our services is at your own risk. 7 Limitation of Liability To the maximum extent permitted by law, OpenFluke shall not be liable for any indirect, incidental, special, consequential, or punitive damages arising from your use of our services. 8 Termination We reserve the right to suspend or terminate your account at any time for violations of these terms or for any other reason at our discretion. You may delete your account at any time through your account settings. 9 Changes to Terms We may update these Terms of Service from time to time. Material changes will be communicated through our services. Continued use after changes constitutes acceptance of the updated terms. 10 Contact Us If you have any questions about these Terms of Service, please reach out: support@openfluke.com --- Source: https://openfluke.com/docs # Loom Docs — Golang AI Engine Reference > Official Loom Golang AI docs: M-POLY-VTD overview, v0.79 bedrock validation, Go layers & dispatch, GPU/WebGPU, quantization, Lucy seven-layer suite, BitNet CPU, deployment, TANHI telemetry. Canonical: https://openfluke.com/docs --- Loom Documentation M-POLY-VTD Engine Docs Complete reference for Loom's poly package — the Multi-numerical POLYmorphic Volumetric Tiled-tensor Dispatcher that powers every Loom integration. Architecture Overview Quick Reference GitHub Where to start? New? Read Overview → Layers → Training. Deploying to web or mobile? Go to Deployment. Need a snippet? Quick Reference has everything copy-paste ready. Architecture Layers Training Deployment GPU Quick Reference Read docs ## Loom documentation (full text) --- ## M-POLY-VTD: Architecture Overview Source: https://openfluke.com/docs/overview Markdown: https://openfluke.com/docs/overview.md # M-POLY-VTD: Architecture Overview **Multi-numerical POLYmorphic Volumetric Tiled-tensor Dispatcher** M-POLY-VTD is a neural inference and training engine built from first principles in Go. It treats a neural network not as a sequential stack of layers, but as a **spatial 3D grid** where each cell can hold any layer type, and every layer can morph its numerical precision on demand. > [!NOTE] > Current version: **0.79.0 (Bedrock Validation)**. Previous: **0.78.0 (ASM CPU)**. The Loom stack is **Go + `poly/asm` + WebGPU** only. **Numerical Tiling (SC/MC)** is live across all 21 DTypes; **Dense forward** can use Plan 9 assembly via `UseAsmForward`. **v0.79** hardens CPU train/save/reload, MHA layout + KV decode, and C-ABI parity (see [`bedrock_validation.md`](bedrock_validation.md)). Checkpoints save **native packed weights per layer dtype** (not FP32-only JSON). See [`poly/README.md`](../poly/README.md) for the live checklist and [`testing_and_validation.md`](testing_and_validation.md) for Lucy log interpretation. --- ## The Full Architecture ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ M-POLY-VTD ARCHITECTURE │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────────────────────────────────────────────────────────────┐ │ │ │ POLYGLOT BINDINGS (C-ABI FFI Layer) │ │ │ │ Python │ TS (@openfluke/welvet) │ C# │ Java │ Dart │ WASM Browser │ │ │ └─────────────────────────────┬────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────────────────────────────────┐ │ │ │ VolumetricNetwork (3D Grid) │ │ │ │ │ │ │ │ Depth × Rows × Cols × LayersPerCell │ │ │ │ │ │ │ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │ │ │ │ (0,0,0,0) │ │ (0,0,1,0) │ │ (0,0,2,0) │ ← Depth=0, Row=0 │ │ │ │ │VolumetricL│ │VolumetricL│ │VolumetricL│ │ │ │ │ │ayer │ │ayer │ │ayer │ │ │ │ │ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ │ │ │ │ │ │ │ │ │ │ │ ┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐ │ │ │ │ │ (0,1,0,0) │ │ (0,1,1,0) │ │ (0,1,2,0) │ ← Depth=0, Row=1 │ │ │ │ └───────────┘ └───────────┘ └───────────┘ │ │ │ └──────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ┌─────────────────┼──────────────────────┐ │ │ ▼ ▼ ▼ │ │ ┌───────────────┐ ┌──────────────────┐ ┌───────────────────────────┐ │ │ │ CPU Backend │ │ Step mesh engine │ │ WebGPU Backend (WGPU) │ │ │ │ │ │ │ │ │ │ │ │ ForwardPoly- │ │ StepForward │ │ BeginFrame / FlushFrame │ │ │ │ morphic[T] │ │ StepBackward │ │ DispatchForwardLayer │ │ │ │ │ │ Tween (NTP) │ │ DispatchBackwardLayer │ │ │ │ All 21 DTypes │ │ │ │ WGSL compute shaders │ │ │ └───────────────┘ └──────────────────┘ └───────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────────────────────────────┐ │ │ │ WeightStore (Morphic Precision Engine) │ │ │ │ │ │ │ │ Master []float32 ──┬──▶ Versions[DTypeFP4] []int8 │ │ │ │ (Source of Truth) ├──▶ Versions[DTypeInt8] []int8 │ │ │ │ ├──▶ Versions[DTypeBinary] []int8 │ │ │ │ └──▶ GPUWeights[DTypeFloat32] *wgpu.Buffer │ │ │ └──────────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────────────────────────────┐ │ │ │ DNA Engine │ │ │ │ ExtractDNA ──▶ LayerSignature[] ──▶ CompareNetworks ──▶ SI Score │ │ │ └──────────────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` --- ## The Six Core Pillars ### I. Multi-Numerical Architecture (the "M") The engine natively dispatches forward and backward passes across **21 distinct numerical types** (DTypes), from `float64` all the way down to 1-bit `binary`. Each layer stores its weights in a `WeightStore` that holds a `float32` master copy plus optional converted versions for inference. ``` DType Hierarchy: ┌────────────────────────────────────────────────────────┐ │ High-Precision │ Float64, Int64, Uint64 │ │ Standard │ Float32, Int32, Uint32, Int16, Uint16│ │ Optimized │ Float16, BFloat16, Int8, Uint8 │ │ Low-Bit │ FP8E4M3, FP8E5M2, Int4, Uint4, FP4 │ │ Extreme │ Int2, Uint2, Ternary, Binary │ └────────────────────────────────────────────────────────┘ ``` Layers are not restricted to a single precision. The dispatcher reads `layer.DType`, fetches the right version from the `WeightStore`, and falls back to the master FP32 weights if no converted version exists. See [numerical_types.md](./numerical_types.md) for the full breakdown. ### II. Polymorphic Layer-Morphing (the "POLY") Every layer is a **polymorphic processing unit**. Its numerical representation can be changed at any time via `WeightStore.Morph(dtype)` without reallocating the layer structure. The master FP32 weights are never destroyed—they remain the source of truth. ``` Metamorphosis sequence: FP32 (training) ──▶ Morph(INT8) ──▶ Morph(FP4) ──▶ Morph(Binary) ▲ │ └──── Unpack(dtype) ──── always recoverable ─────────┘ ``` After gradients are applied via `WeightStore.ApplyGradients`, all cached low-bit versions are **automatically cleared**, forcing re-quantization on the next forward pass. ### III. Volumetric Tensor Dispatch (the "VTD") The network is a **4D array** of `VolumetricLayer` values indexed by `(Depth, Row, Col, LayerIndex)`. The flattened index is: ``` idx = z * Rows * Cols * LayersPerCell + y * Cols * LayersPerCell + x * LayersPerCell + l ``` Data flows through the grid in reading order: Z outer loop, then Y, then X, then L. This gives the programmer a spatial metaphor to compose complex non-linear topologies. #### Remote Links (Spatial Hopping) Any layer can set `IsRemoteLink = true` and point to any other coordinate via `TargetZ / TargetY / TargetX / TargetL`. When the step mesh engine fires that layer, it reads input from the *target* coordinate's output buffer instead of the preceding layer. This enables biological-style feedback loops anywhere in the grid. ``` Normal flow: Remote link (skip connection): (0,0,0) (0,0,0) │ │ ◄────────────────────────┐ ▼ ▼ │ (0,0,1) (0,0,1) ─ IsRemoteLink ──▶ (0,2,3) │ │ ▼ ▼ (0,0,2) (0,0,2) ``` ### IV. The Dispatcher Pattern `DispatchLayer[T]` and `DispatchLayerBackward[T]` are **generic runtime jump tables**. They inspect `layer.Type` and call the correct polymorphic function, returning `(preAct, postAct)` tensors of the same type `T`. The separation from the grid traversal loop makes GPU kernel fusion possible—the driver can look ahead and pre-load the next tile's weights while the current tile computes. ```go func DispatchLayer[T Numeric](layer *VolumetricLayer, input, skip *Tensor[T]) (preAct, postAct *Tensor[T]) ``` There are 19 `LayerType` values routed here. An unknown type falls through to `DenseForwardPolymorphic`. **Numerical tiling** is orthogonal to volumetric traversal: `ForwardPolymorphic` can walk the grid in spatial tiles or sequentially (`network.UseTiling`). **CPU** layers use a **single** tile map (`CPUTileSizes`). **GPU** layers carry **two** maps (`GPUSCTileSizes`, `GPUMCTileSizes`); **`EnableMultiCoreTiling`** on `VolumetricNetwork` selects MC vs SC dispatch (see [dispatch.md](./dispatch.md) and [gpu.md](./gpu.md)). ### V. The Step Mesh Engine Unlike `ForwardPolymorphic`, which executes the entire network per input in one pass, `StepForward` fires **all layers simultaneously** every clock cycle. Each layer reads from the previous cycle's output buffer (`LayerData`) and writes to `NextBuffer`. After all layers have fired, the buffers are swapped. This double-buffering pattern is race-condition-free and supports parallel tile dispatch via goroutines. ### VI. The DNA Engine `ExtractDNA` converts a network into a slice of `LayerSignature` values. Each signature contains the layer's 3D coordinates, type, DType, and a **normalized** (unit-vector) representation of its weights after precision simulation. `CompareNetworks(dna1, dna2)` then uses cosine similarity to produce an `OverallOverlap` score and identifies `LogicShift` events where a functional pattern has migrated to a different spatial coordinate. --- ## Key Types at a Glance | Type | File | Role | |:-----|:-----|:-----| | `VolumetricNetwork` | `poly.go` | The 3D grid container | | `VolumetricLayer` | `poly.go` | A single processing unit with coordinates | | `WeightStore` | `weights.go` | Master FP32 + versioned low-bit storage | | `Tensor[T Numeric]` | `poly.go` | Generic data container with `Shape` and `Nested` | | `DType` | `poly.go` | 21-value enum for numerical types | | `LayerType` | `poly.go` | 19-value enum for layer kinds | | `WGPUContext` | `wgpu_context.go` | GPU device, queue, pipeline cache | | `StepState[T]` | `step.go` | Double-buffered temporal mesh state | | `NetworkDNA` | `dna.go` | `[]LayerSignature` topological blueprint | | `TrainingConfig` | `training.go` | Epochs, LR, loss type, GPU flag | --- ## The `Tensor[T]` Type ```go type Tensor[T Numeric] struct { Data []T DType DType Shape []int Nested []*Tensor[T] // activation tree for Parallel/Sequential layers } ``` `Nested` is the key structural innovation. During a `ParallelForward` pass, each branch produces its own `preAct` tensor, and these are stored in `Nested` on the returned preAct. The backward pass reads them back, routing gradients to the correct branch without any external bookkeeping. This recursive tree property makes arbitrary nesting of `Parallel` and `Sequential` layers fully differentiable. --- ## Performance Snapshot From the README benchmark table, measured on a GTX 1650 Super (Vulkan/WebGPU): | Layer type | CPU Tiled | GPU | Speedup | |:-----------|:----------|:----|:--------| | Dense | 5.42ms | 400µs | 13.6x | | CNN 1D | 4.34ms | 195µs | 22.3x | | CNN 2D | 182ms | 100µs | 1826x | | CNN 3D | 1522ms | 200µs | 7602x | | RMSNorm | 1.16ms | 103µs | 11.3x | End-to-end GPU training (20 epochs): | Architecture | CPU | GPU | Speedup | |:-------------|:----|:----|:--------| | Dense MLP (128→512→512→8) | 12.1s | 693ms | 17.5x | | CNN 2D (3ch×32×32 → 16f→32f→8) | 1m57s | 1.81s | 64.8x | | Deep Dense (128→512×4→8) | 31.7s | 1.23s | 25.7x | --- ## Next Steps - [numerical_types.md](./numerical_types.md) — DType system, WeightStore, Metamorphosis - [layers.md](./layers.md) — Every layer type in detail - [dispatch.md](./dispatch.md) — The dispatcher pattern and 3D coordinates - [training.md](./training.md) — Forward/backward, optimizers, Tween - [gpu.md](./gpu.md) — WebGPU backend and BeginFrame/FlushFrame pattern - [step.md](./step.md) — The step mesh engine - [quick_reference.md](./quick_reference.md) — Common code snippets --- ## Deployment: TypeScript, WASM, and NPM Source: https://openfluke.com/docs/deployment Markdown: https://openfluke.com/docs/deployment.md # Deployment: TypeScript, WASM, and NPM Loom is designed to be **isomorphic**, meaning the exact same mathematical engine runs in both Node.js (backend) and the Browser (frontend) via a bit-perfect WebAssembly (WASM) bridge. --- ## 📦 The NPM Package: `@openfluke/welvet` The primary way to use Loom in the JavaScript ecosystem is through the **Welvet** SDK. ### Installation ```bash npm install @openfluke/welvet ``` ### Quick Start (Node.js) ```typescript import { init, createNetwork } from '@openfluke/welvet'; // Initialize the WASM runtime await init(); // Build a network from a JSON specification const net = await createNetwork({ id: "demo-net", depth: 1, rows: 2, cols: 1, layers_per_cell: 1, layers: [ { z: 0, y: 0, x: 0, l: 0, type: "Dense", input_height: 128, output_height: 64, activation: "ReLU" }, { z: 0, y: 1, x: 0, l: 0, type: "Dense", input_height: 64, output_height: 10, activation: "Linear" } ] }); // Run a forward pass const input = new Float32Array(128).fill(0.5); const output = await net.sequentialForward(input); console.log("Network output:", output); ``` --- ## 🌐 WASM & FFI Bridge The TypeScript SDK communicates with the Go-compiled core via the **Universal C-ABI**. This ensures that complex logic (like NEAT evolution or DNA extraction) remains fast while providing a high-level, idiomatic JS interface. ### Verified Capabilities (v0.74.0) The isomorphic bridge has been verified through a 36-count diagnostic suite: - **Core Exports**: 8/8 internal WASM symbols verified. * **Network Methods**: 16/16 functional wrappers (Forward, DNA, Morph, etc.) passed. * **NEAT Population**: 8/8 evolutionary logic methods verified. * **Bit-Perfect Parity**: 0.000000% divergence vs the Go native reference. --- ## 🖼️ Browser Deployment (WebGPU) When running in the browser, the WASM runtime can automatically detect and utilize **WebGPU** for massive parallel speedups. ```typescript import { setupWebGPU } from '@openfluke/welvet'; // Initialize WebGPU context await setupWebGPU(); // Networks created after this point will utilize GPU kernels // for forward and backward passes. ``` ### Performance Tiers | Environment | Backend | Best For | | :--- | :--- | :--- | | **Node.js** | WASM (SIMD) | Backend inference, server-side DNA comparison. | | **Browser** | WASM + WebGPU | High-performance interactive AI, on-device training. | | **Mobile Web** | WASM | Lightweight edge execution. | --- ## 🧬 DNA & Evolution in JS The TypeScript SDK provides full access to the DNA logic: - **`net.extractDNA()`**: Generates a topological fingerprint. - **`compareLoomDNA(dnaA, dnaB)`**: Cross-platform similarity score. - **`createLoomNEATPopulation(id, size, cfg)`**: High-speed evolutionary architecture search. For more details on the underlying DNA math, see [dna.md](dna.md). --- ## Donate compute (TCP) Source: https://openfluke.com/docs/donate-compute Markdown: https://openfluke.com/docs/donate-compute.md # Donate compute (TCP) The **`donate_compute_*.go`** files in `poly/` implement an optional **TCP protocol** so a **donor** machine can accept inference-style work from clients on the same network (or loopback). Work is exchanged as **length-prefixed JSON** frames over a single connection — there is no HTTP server inside `poly` for this path. **Status:** The server’s inference and prompt paths are **stubs** (`stubInfer` / `stubPrompt`) until wired to real model loading, `poly` execution, or subprocess hooks. --- ## Why it exists - **LAN-friendly**: bind to `0.0.0.0` (or a specific interface) and let another host submit jobs without bundling a separate HTTP stack in `poly`. - **Two modes** (see below): push weights + token **`infer`**, or **`prompt`**-only against a local LM path advertised in the hello. --- ## Wire format (`donate_compute_framing.go`) Each message is: 1. **`uint32` length**, little-endian (4 bytes). 2. **UTF-8 JSON object** of exactly that length. Constants: | Constant | Value | Meaning | | :--- | :--- | :--- | | `DonateComputeDefaultPort` | **17001** | Default listen/dial port (adjacent to construct TCP dev on **17000**). | | `MaxDonateFrameBytes` | 64 MiB | Maximum single-frame payload; large models use **many** weight chunks, not one giant frame. | Helpers: **`WriteDonateFrame`**, **`ReadDonateFrame`**. --- ## Message types (`donate_compute_types.go`) Version **v1** uses a `"type"` string discriminator. Constants include: - **`hello`** — first frame from server after connect; client may echo hello. - **`model_begin`**, **`weights_chunk`**, **`model_commit`**, **`model_status`** — **model_push** upload lifecycle. - **`infer`**, **`infer_result`** — token-ID jobs against a **mounted** pushed model. - **`prompt`**, **`prompt_result`** — text jobs for **local_lm** nodes. - **`queue_status`**, **`error`** — optional / error paths. Structs (`DonateHello`, `DonateModelBegin`, `DonateWeightsChunk`, `DonateInfer`, `DonatePrompt`, …) mirror the JSON fields. --- ## Server (`donate_compute_server.go`) **`ServeDonateComputeTCP(opts DonateComputeServerOptions)`** returns a **`net.Listener`**. It: - Sends a **`DonateHello`** immediately after each accept (mode, role `server`, optional `LocalLmPath`, queue capacity hint). - Parses frames in a loop per connection. - Enqueues **`infer`** / **`prompt`** jobs on a **global FIFO channel**; **one worker** drains the queue (serial execution — not N parallel model mounts). **`DonateComputeServerMode`:** | Mode | Behavior | | :--- | :--- | | **`model_push`** | Client sends **`model_begin`** (config JSON + expected weight length), **`weights_chunk`** (base64 slices), **`model_commit`**. Server acknowledges with **`model_status`**. Then client may send **`infer`** with `input_ids` / `max_tokens`. | | **`local_lm`** | **`infer`** is rejected; client uses **`prompt`** with full text. Server may advertise **`LocalLmPath`** in hello (informational). | **`CloseDonateListener`** closes the listener. --- ## Client (`donate_compute_client.go`) - **`DialDonateCompute(addr)`** — TCP dial (default `127.0.0.1:17001` if empty), read server hello, send client hello; returns **`DonateClient`** + **`DonateHello`**. - **`PutModel(configJSON, weights)`** — stream model for **model_push** nodes. - **`EnqueueInfer`**, **`EnqueuePrompt`** — send one job and wait for the matching result frame. --- ## Tests **`donate_compute_test.go`** covers framing and client/server interaction. --- ## Security **v1 has no TLS and no authentication.** It is intended for **trusted networks** (e.g. same Wi‑Fi lab). Do not expose the raw port to the public Internet without a VPN, SSH tunnel, or application-layer gateway. --- ## File map | File | Role | | :--- | :--- | | `donate_compute_types.go` | v1 constants and JSON structs | | `donate_compute_framing.go` | Frame encode/decode, default port, size limits | | `donate_compute_server.go` | TCP server, modes, queue, stubs | | `donate_compute_client.go` | Dial, `PutModel`, `EnqueueInfer`, `EnqueuePrompt` | | `donate_compute_test.go` | Tests | --- ## TANHI — UDP layer telemetry Source: https://openfluke.com/docs/tanhi Markdown: https://openfluke.com/docs/tanhi.md # TANHI — UDP layer telemetry **TANHI** streams **sparse, non-blocking JSON-line events** over **UDP** so external tools (notably the **SoulGlitch → TANHI** HUD) can visualize **per-layer forward/backward** activity, timing, dtypes, and **routing links** (parallel branches / sequential substeps). Implementation: **`poly/tanhi.go`**. Integration hooks live in **`poly/forward.go`**, **`poly/backward.go`**, and **`poly/wgpu_forward.go`** (GPU transformer path). Optional **Welvet C-ABI** exports: **`welvet/cabi/tanhi_ext.go`**. --- ## Defaults | Constant / env | Default | Meaning | | :--- | :--- | :--- | | **`poly.DefaultTanhiUDPPort`** | **17481** | UDP destination port (IANA unassigned range). | | Host | `127.0.0.1` | When `TanhiUDPConfig.Host` is empty. | | Disabled | `nil` / off | `VolumetricNetwork.Tanhi == nil` or `Enabled == false` → no UDP. | --- ## Configuration (`TanhiUDPConfig`) Set on **`VolumetricNetwork.Tanhi`**: - **`Enabled`** — master switch. - **`Host`**, **`Port`** — UDP listener address (engine **sends** to this address). - **`SendShape`** — include approximate tensor **`shape`** in each event (CPU path uses activations when available; GPU path uses **`TanhiGPULayerShapeHint`** — no readback). Telemetry is **best-effort**: a buffered queue (**1024** packets); **overflow drops** silently so training/inference never blocks on HUD lag. --- ## Wire format Each datagram payload is **one JSON object per line** (newline-terminated). Schema version **`v`: `"tanhi1"`**. Typical fields: | Field | Meaning | | :--- | :--- | | `seq` | Monotonic sequence number | | `phase` | `"fwd"` or `"bwd"` | | `idx` | Layer index in traversal (-1 or special indices possible on GPU fused paths) | | `z`, `y`, `x`, `l` | Volumetric coordinate | | `layer` | Layer type string | | `dtype` | Integer dtype code | | `connections` | Fan-out hint from weight masters (or override for GPU LM head) | | `t0_ns`, `t1_ns` | Wall-clock nanoseconds around the layer | | `shape` | Optional shape slice when `SendShape` is true | | `links` | Optional routing targets for **LayerParallel** / **LayerSequential** (capped, for arc drawing) | --- ## SoulGlitch / Glitch CLI - **`GLITCH_TANHI=1`** — enable when running **`loom/glitch`** interactively (or answer the prompt). - **`TANHI_HOST`**, **`TANHI_PORT`**, **`TANHI_SHAPE=1`** — override host, port, and shape inclusion (same conventions in **`glitch/measure/*`** harnesses). Open SoulGlitch **first**; set the listener **port** to match **`TANHI_PORT`** / **17481**. --- ## C-ABI (Welvet) - **`LoomNetworkTanhiConfigure`** — enable/disable, host C string, port (0 → default **17481**), send-shape flag. - **`LoomNetworkTanhiDisable`** — clear `Tanhi` on the network handle. - **`LoomTanhiDefaultPort`** — returns **`DefaultTanhiUDPPort`**. --- ## Security note UDP telemetry is **localhost-oriented** by default. Pointing **`Host`** at a remote machine sends layer metadata and timing to that address — use only on **trusted networks** when not using **`127.0.0.1`**. --- ## Numerical Types, DType System, and WeightStore Source: https://openfluke.com/docs/numerical-types Markdown: https://openfluke.com/docs/numerical-types.md # Numerical Types, DType System, and WeightStore This document covers all 21 `DType` values, the `Numeric` generic constraint, the `WeightStore` master/versioned architecture, and the Metamorphosis mechanism that lets a layer switch precision on the fly. --- ## The 21 DTypes ```go type DType int ``` Every `VolumetricLayer` carries a `DType` field that controls which numerical format its weights are active in. The full set: ``` ┌─────┬───────────────┬──────────────────────────────────────────────┐ │ ID │ Name │ Description │ ├─────┼───────────────┼──────────────────────────────────────────────┤ │ 0 │ DTypeFloat64 │ IEEE 754 double (8 bytes per weight) │ │ 1 │ DTypeFloat32 │ Standard single (4 bytes) — Master baseline │ │ 2 │ DTypeFloat16 │ 16-bit float (simulated, stored as f32) │ │ 3 │ DTypeBFloat16 │ Brain Float: 8 exp bits, 7 mantissa │ │ 4 │ DTypeFP8E4M3 │ 8-bit FP, 4-exponent 3-mantissa │ │ 5 │ DTypeFP8E5M2 │ 8-bit FP, 5-exponent 2-mantissa │ │ 6 │ DTypeInt64 │ 64-bit signed integer │ │ 7 │ DTypeInt32 │ 32-bit signed integer │ │ 8 │ DTypeInt16 │ 16-bit signed integer │ │ 9 │ DTypeInt8 │ 8-bit signed integer (0.625–1.0 B/weight) │ │ 10 │ DTypeUint64 │ 64-bit unsigned integer │ │ 11 │ DTypeUint32 │ 32-bit unsigned integer │ │ 12 │ DTypeUint16 │ 16-bit unsigned integer │ │ 13 │ DTypeUint8 │ 8-bit unsigned integer │ │ 14 │ DTypeInt4 │ 4-bit signed (2 weights per byte) │ │ 15 │ DTypeUint4 │ 4-bit unsigned (2 weights per byte) │ │ 16 │ DTypeFP4 │ 4-bit floating point E2M1 (2 per byte) │ │ 17 │ DTypeInt2 │ 2-bit signed (4 weights per byte) │ │ 18 │ DTypeUint2 │ 2-bit unsigned (4 weights per byte) │ │ 19 │ DTypeTernary │ 2-bit ternary: -1, 0, +1 │ │ 20 │ DTypeBinary │ 1-bit XNOR-Net (8 weights per byte) │ └─────┴───────────────┴──────────────────────────────────────────────┘ ``` ### Storage Size per Weight ``` ┌────────────────────────────────────────────────────────┐ │ DType Bits/weight Bytes/1024 weights │ ├────────────────────────────────────────────────────────┤ │ Float64 64 8192 │ │ Float32 32 4096 │ │ Float16 16 2048 │ │ BFloat16 16 2048 │ │ FP8E4M3 8 1024 │ │ FP8E5M2 8 1024 │ │ Int8/Uint8 8 1024 │ │ Int4/Uint4 4 512 (2 per byte) │ │ FP4 4 512 (2 per byte) │ │ Int2/Uint2 2 256 (4 per byte) │ │ Ternary 2 256 (4 per byte) │ │ Binary 1 128 (8 per byte) ← 98.4% │ │ compression vs FP32 │ └────────────────────────────────────────────────────────┘ ``` ### Parsing DTypes from Strings `ParseDType(s string) DType` accepts aliases: | Input strings | Result | |:-------------|:-------| | `"float32"`, `"fp32"`, `"f32"` | `DTypeFloat32` | | `"bfloat16"`, `"bf16"` | `DTypeBFloat16` | | `"fp8e4m3"`, `"fp8"` | `DTypeFP8E4M3` | | `"int4"` | `DTypeInt4` | | `"fp4"`, `"f4"` | `DTypeFP4` | | `"ternary"` | `DTypeTernary` | | `"binary"` | `DTypeBinary` | --- ## The `Numeric` Constraint ```go type Numeric interface { ~int | ~int8 | ~int16 | ~int32 | ~int64 | ~uint | ~uint8 | ~uint16 | ~uint32 | ~uint64 | ~float32 | ~float64 } ``` This constraint makes `Tensor[T]`, `DispatchLayer[T]`, `ForwardPolymorphic[T]`, and all other generic functions work across any of Go's numeric primitives. The constraint is deliberately limited to types the compiler can generate native arithmetic for—no reflection, no `interface{}` boxing at the hot path. > [!NOTE] > FP4, FP8, BFloat16, and other non-native types are **simulated** via PTQ. Weights are stored as `float32` masters and quantized to the target dtype at GPU upload time via `MorphToFloat32ForGPU` (quantize → dequantize round-trip). On GPU, Dense/SwiGLU/MHA use native packed payloads in WGSL shaders; other layer types receive the pre-simulated float32 buffer. --- ## The WeightStore ```go type WeightStore struct { Master []float32 // Source of truth — always FP32 Versions map[DType]any // Cached conversions (e.g., []int8 for INT8) GPUWeights map[DType]any // VRAM-resident wgpu.Buffer references GPUScales map[DType]*wgpu.Buffer // Per-block scale buffers for quantized types Scale float32 // Global quantization scale factor } ``` The `Master` slice is allocated with `AlignedFloat32(n)` which aligns to 64-byte boundaries (one CPU cache line), enabling AVX-width SIMD operations. ### Creating and Initializing ```go ws := NewWeightStore(inputSize * outputSize) ws.Scale = 1.0 ws.Randomize(seed, 0.1) // fills Master with uniform [-0.1, 0.1] ``` After `Randomize`, all `Versions` and `GPUWeights` maps are cleared, ensuring no stale low-bit versions survive. ### The Morphic Version System ``` WeightStore.Morph(dtype DType): Master (FP32) │ ▼ DTypeFloat64 ──▶ []float64 (direct cast) DTypeBFloat16 ──▶ []float32 (bits masked to 16-bit BF16) DTypeInt8 ──▶ []int8 (quantized: int8(v / Scale)) DTypeInt4 ──▶ []int8 (quantized, stored 1-per-int8) DTypeBinary ──▶ []int8 (sign bit only: +1 or -1) ``` The BFloat16 path uses a bit-masking trick: ```go u32 := math.Float32bits(wVal) u32 &= 0xFFFF0000 // zero the lower 16 mantissa bits return math.Float32frombits(u32) ``` This preserves the exponent and upper mantissa exactly as BFloat16 would. ### Metamorphosis: Switching Precision On the Fly A layer starts life as FP32. Before inference you can call: ```go layer.WeightStore.Morph(DTypeInt8) layer.DType = DTypeInt8 ``` Now `DenseForwardPolymorphic` will find the `[]int8` version in `Versions[DTypeInt8]` and use the native INT8 fast-path loop. The FP32 master is untouched. After training (`ApplyGradients`), the master is updated and **all cached versions are automatically purged**: ```go func (ws *WeightStore) ApplyGradients(gradWeights *Tensor[float32], lr float32) { for i := 0; i < limit; i++ { ws.Master[i] -= lr * gradWeights.Data[i] } // Stale — force re-quantize on next forward: ws.Versions = make(map[DType]any) ws.GPUWeights = make(map[DType]any) } ``` This guarantees the layer never silently uses outdated quantized weights. ``` ┌──────────────────────────────────────────────────────────────┐ │ Metamorphosis Lifecycle │ ├──────────────────────────────────────────────────────────────┤ │ │ │ NewWeightStore(n) │ │ │ │ │ ▼ │ │ Randomize(seed, scale) ──▶ Master filled, Versions={} │ │ │ │ │ ▼ │ │ layer.DType = DTypeInt8 │ │ │ │ │ ▼ │ │ Forward() ──▶ Morph(DTypeInt8) if Versions[INT8]==nil │ │ │ │ │ │ │ Versions[DTypeInt8] = []int8{...} │ │ │ │ │ ▼ │ │ INT8 fast-path arithmetic executes │ │ │ │ │ ▼ │ │ ApplyGradients(gW, lr) ──▶ Master updated │ │ ──▶ Versions = {} (cleared) │ │ │ │ Next Forward() ──▶ Morph(DTypeInt8) again from new Master │ │ │ └──────────────────────────────────────────────────────────────┘ ``` ### Unpacking for Deserialization When loading a model saved in a low-bit format: ```go ws.Versions[dtype] = decoded // e.g., []int8 from bit-packed JSON ws.Unpack(dtype) // reconstructs Master: Master[i] = packed[i] * Scale ``` This ensures the FP32 master is always available for gradient-based fine-tuning, even on a model that was serialized in INT4. --- ## MorphToFloat32ForGPU This is the PTQ simulation path used when uploading weights to the GPU for layers without a dedicated packed shader (CNN1-3, RNN, LSTM, Embedding): ```go func (ws *WeightStore) MorphToFloat32ForGPU(dtype DType) []float32 ``` It calls `ws.Morph(dtype)` to produce the quantized version, then dequantizes back to float32 by multiplying by `ws.Scale`. The GPU shader sees float32 weights that already reflect quantization rounding loss — no new shader needed. | DType | Round-trip behaviour | |:------|:---------------------| | Float32, Float64 | Master returned as-is (no loss) | | BFloat16 | Upper 16 bits of mantissa preserved; lower 16 zeroed | | FP8, Int8, Uint8 | `round(w/scale) * scale` | | Int4, Uint4, FP4 | `trunc(w/scale) * scale`, range ±7 | | Int2, Uint2 | 4-level round-trip | | Ternary | Threshold snap to `{-scale, 0, +scale}` | | Binary | Sign only: `±scale` | The `scale` comes from `WeightStore.Scale`, set during `Morph` from the max absolute value of the master weights. --- ## The Q4_0 Block Format (GPU Quantization) For GPU inference, the engine uses the Q4_0 block format, matching llama.cpp compatibility: ``` Q4_0Block: ┌────────────────────────────────────────────────────────┐ │ Scale: float32 (4 bytes) │ │ Weights: [16]byte (32 nibbles = 32 × 4-bit weights) │ │ │ │ Total: 20 bytes for 32 weights = 0.625 bytes/weight │ └────────────────────────────────────────────────────────┘ ``` `QuantizeQ4_0(weights []float32) []Q4_0Block` finds the max absolute value in each block of 32, sets `scale = maxAbs / 7.0`, then quantizes each weight to a signed 4-bit integer (`-8` to `7`) packed two-per-byte. On the GPU, the WGSL shader receives the packed uint32 array plus the float32 scales array, and dequantizes on the fly inside the shader without a CPU roundtrip. --- ## CastWeights `CastWeights[T Numeric](weights any) []T` is the universal extraction helper. It type-switches on all 10 concrete slice types and uses `ConvertSlice[In, Out]` to re-cast the values into the requested type `T`. When `DispatchLayer` cannot find a dedicated fast-path for the layer's DType, it falls through to `CastWeights` on the pre-quantized `Versions` data. --- ## Bit-Packed Serialization Ratios From the README, verified across 378 model permutations: | DType | Bytes/weight (serialized) | vs FP32 | |:------|:--------------------------|:--------| | Float32 | 4 | 1.0x | | Float16 | 2 | 0.5x | | Int8 | 1 | 0.25x | | Int4/FP4 | 0.5 | 0.125x | | Int2/Ternary | 0.25 | 0.0625x | | Binary | 0.125 | 0.0313x ← **98.4% reduction** | The packing/unpacking logic lives in `encodeNativeWeights` and `decodeNativeWeights` in `persistence.go`. Binary packs 8 weights per byte using bit shifts; Ternary packs 4 per byte using 2-bit fields; FP4 packs 2 per byte using nibbles. --- ## Layer Reference Source: https://openfluke.com/docs/layers Markdown: https://openfluke.com/docs/layers.md # Layer Reference This document describes every `LayerType` in `poly/`. For each layer: what it computes, which fields of `VolumetricLayer` configure it, weight layout in the `WeightStore`, and an ASCII data-flow diagram. --- ## LayerType Constants ```go const ( LayerDense LayerType = 0 LayerMultiHeadAttention LayerType = 1 LayerSwiGLU LayerType = 2 LayerRMSNorm LayerType = 3 LayerCNN1 LayerType = 4 LayerCNN2 LayerType = 5 LayerCNN3 LayerType = 6 LayerRNN LayerType = 7 LayerLSTM LayerType = 8 LayerLayerNorm LayerType = 9 LayerConvTransposed1D LayerType = 10 LayerConvTransposed2D LayerType = 11 LayerConvTransposed3D LayerType = 12 LayerEmbedding LayerType = 13 LayerKMeans LayerType = 14 LayerSoftmax LayerType = 15 LayerParallel LayerType = 16 LayerSequential LayerType = 17 LayerResidual LayerType = 18 ) ``` > [!NOTE] > There is no explicit `LayerGRU` constant; GRU is implemented in `rnn.go` as a variant of the RNN pattern referenced through the same dispatcher slot. --- ## Dense (LayerDense = 0) **What it does:** Fully-connected linear transformation: `output = input × W^T + b`, followed by an activation function. Every input connects to every output. **Key fields:** | Field | Meaning | |:------|:--------| | `InputHeight` | Number of input features | | `OutputHeight` | Number of output features | | `Activation` | One of ReLU, SiLU, GELU, Tanh, Sigmoid, Linear | | `DType` | Active numerical type | | `UseTiling` | Enables tiled fast paths where implemented (CPU block tiling, sequential propagation to sub-layers, etc.) | | `TileSize` | Legacy scalar fallback when per-dtype maps are empty; prefer **`CPUTileSizes`** on CPU and **`GPUSCTileSizes` / `GPUMCTileSizes`** on GPU after `RefreshRuntimeTileSizes()` | | `EnableMultiCoreTiling` | **GPU:** aligned with `VolumetricNetwork.EnableMultiCoreTiling`; transformer forwards use the network flag to choose **`GetGPUMCTileSize`** vs **`GetGPUSCTileSize`**. **CPU:** often set `true` with training loaders for parity; **does not** switch between two CPU tile maps (only `CPUTileSizes` exists) | **Weight layout:** `WeightStore.Master` is a flat `[OutputHeight × InputHeight]` row-major matrix. No bias is stored in the Master by default (the polymorphic engine absorbs bias via zero-biased initialization). ``` Input [batch, inputSize] │ ▼ ┌─────────────────────────────────────────────┐ │ preAct[b, o] = Σᵢ input[b, i] × W[o, i] │ │ │ │ W shape: [OutputHeight, InputHeight] │ └─────────────────────────────────────────────┘ │ ▼ Activation(preAct) │ ▼ Output [batch, outputSize] ``` The tiled variant (`DenseForwardTiled`) loads input tiles into a local buffer and unrolls the dot product 4× to help the compiler auto-vectorize. The INT8 and Binary tiled paths each have their own hot loops in `denseForwardTiledInt8` and `denseForwardTiledBinary`. --- ## CNN1 / CNN2 / CNN3 (LayerCNN1–3 = 4–6) **What they do:** Convolutional layers across 1D sequences, 2D images, and 3D volumes respectively. A learnable kernel is slid across the spatial dimensions and a dot product is computed at each position. **Key fields:** | Field | Meaning | |:------|:--------| | `InputChannels` | Channels in the input | | `Filters` | Number of output channels (kernels) | | `KernelSize` | Spatial size (k for CNN1, k×k for CNN2, k×k×k for CNN3) | | `Stride` | Step between kernel positions | | `Padding` | Zero-padding added on each side | | `InputHeight` / `InputWidth` / `InputDepth` | Input spatial dimensions | | `OutputHeight` / `OutputWidth` / `OutputDepth` | Output spatial dimensions | **Weight layout:** `Filters × InputChannels × KernelSize^N` ``` CNN2 Data Flow: Input [batch, inChannels, H, W] │ ▼ slide kernel [f, c, kH, kW] over H, W ┌─────────────────────────────────────────────────────────────┐ │ for each filter f: │ │ for each (oh, ow): │ │ out[b,f,oh,ow] = Σ_c Σ_kh Σ_kw in[b,c,oh+kh,ow+kw] │ │ × W[f,c,kh,kw] │ └─────────────────────────────────────────────────────────────┘ │ ▼ Activation Output [batch, Filters, outH, outW] ``` Output size formula (same for each spatial dimension): ``` outDim = (inDim + 2*Padding - KernelSize) / Stride + 1 ``` > [!TIP] > CNN3 on GPU achieves over 7600x speedup versus CPU tiling because the 3D spatial loop maps perfectly to 3D WebGPU workgroups. Always prefer GPU for CNN3. --- ## ConvTransposed1D / 2D / 3D (LayerConvTransposed1D–3D = 10–12) **What they do:** Transposed convolution (also called "deconvolution"). It inverts the spatial compression of a regular convolution — used in decoder networks and generative models to upsample feature maps. **Key fields:** Same as CNN variants plus `OutputPadding` for controlling output dimensions. **Weight layout:** `InputChannels × Filters × KernelSize^N` ``` ConvTransposed2D conceptual reverse: CNN2: [H, W] ──kernel──▶ [H', W'] (downsample) ConvT: [H', W'] ──kernel──▶ [H, W] (upsample) Internal mechanism: insert (Stride-1) zeros between input elements, then apply regular convolution with kernel flipped. ``` --- ## RNN (LayerRNN = 7) **What it does:** Vanilla recurrent network. Processes a sequence step-by-step, feeding the hidden state forward through time. ``` h_t = tanh(x_t × W_ih^T + h_{t-1} × W_hh^T + b_h) ``` **Key fields:** | Field | Meaning | |:------|:--------| | `InputHeight` | Input feature size | | `OutputHeight` | Hidden state size | | `SeqLength` | Number of time steps | **Weight layout in Master:** ``` [ W_ih | W_hh | b_h ] ihSize hhSize hSize ``` Where `ihSize = hiddenSize × inputSize`, `hhSize = hiddenSize × hiddenSize`, `hSize = hiddenSize`. ``` Step 0: Step 1: Step t: x₀ h₋₁=0 x₁ h₀ xₜ h_{t-1} │ │ │ │ │ │ └──┬───┘ └──┬──┘ └──┬───┘ ▼ ▼ ▼ [RNN cell] [RNN cell] [RNN cell] │ │ │ ▼ ▼ ▼ h₀ h₁ hₜ ``` --- ## LSTM (LayerLSTM = 8) **What it does:** Long Short-Term Memory. Adds a cell state `c_t` and three gating mechanisms (forget, input, output) to control information flow through time. Solves the vanishing gradient problem for long sequences. **Gate equations:** ``` i_t = σ(x_t × W_i^T + h_{t-1} × U_i^T + b_i) ← input gate f_t = σ(x_t × W_f^T + h_{t-1} × U_f^T + b_f) ← forget gate g_t = tanh(x_t × W_g^T + h_{t-1} × U_g^T + b_g) ← cell gate o_t = σ(x_t × W_o^T + h_{t-1} × U_o^T + b_o) ← output gate c_t = f_t ⊙ c_{t-1} + i_t ⊙ g_t h_t = o_t ⊙ tanh(c_t) ``` **Weight layout:** Four gate blocks concatenated: ``` [ W_i | U_i | b_i | W_f | U_f | b_f | W_g | U_g | b_g | W_o | U_o | b_o ] ←── gate i ──────────▶ ←── gate f ──────────▶ ... gateWeightCount = ihSize + hhSize + hiddenSize Total = 4 × gateWeightCount ``` ``` ┌─────────────────────────────────────┐ c_{t-1} ──────▶│ │──▶ c_t │ Forget × + Input × Cell │ h_{t-1} ──────▶│ │──▶ h_t │ Output gate × tanh(c_t) │ x_t ──────▶│ │ └─────────────────────────────────────┘ ``` --- ## GRU GRU (Gated Recurrent Unit) is implemented in `rnn.go` alongside the vanilla RNN. It uses two gates (reset and update) and eliminates the separate cell state. ``` z_t = σ(x_t × W_z + h_{t-1} × U_z + b_z) ← update gate r_t = σ(x_t × W_r + h_{t-1} × U_r + b_r) ← reset gate n_t = tanh(x_t × W_n + (r_t ⊙ h_{t-1}) × U_n + b_n) h_t = (1 - z_t) ⊙ h_{t-1} + z_t ⊙ n_t ``` --- ## MultiHeadAttention (LayerMultiHeadAttention = 1) **What it does:** Standard multi-head scaled dot-product attention with optional RoPE positional encoding, Grouped Query Attention (GQA), and a KV cache for autoregressive decoding. **Key fields:** | Field | Meaning | |:------|:--------| | `DModel` | Model dimension (total embedding size) | | `NumHeads` | Number of query heads | | `NumKVHeads` | Number of key/value heads (< NumHeads for GQA/MQA) | | `HeadDim` | Dimension per head (usually DModel / NumHeads) | | `SeqLength` | Current sequence length | | `RoPEFreqBase` | RoPE frequency base (default 10000.0) | | `MaxSeqLen` | KV cache capacity | | `KVCacheK` / `KVCacheV` | CPU-side KV cache tensors | | `KVOffset` | Current filled position in the KV cache | **Weight layout:** ``` Master = [ Q_W | K_W | V_W | O_W | Q_b | K_b | V_b | O_b ] Q_W: [DModel × DModel] K_W: [DModel × kvDim] (kvDim = NumKVHeads × HeadDim) V_W: [DModel × kvDim] O_W: [DModel × DModel] biases follow ``` **Attention computation:** ``` Q = input × Q_W^T + Q_b [seqLen, DModel] K = input × K_W^T + K_b [seqLen, kvDim] V = input × V_W^T + V_b [seqLen, kvDim] Apply RoPE to Q, K (rotate pairs by position-dependent angle) For each head h: q_h = Q[:, h*headDim:(h+1)*headDim] [seqLen, headDim] k_h = K[:, kv_head_idx*headDim:...] [seqLen, headDim] v_h = V[:, kv_head_idx*headDim:...] scores = q_h × k_h^T / sqrt(headDim) [seqLen, seqLen] weights = softmax(scores, causal_mask) out_h = weights × v_h [seqLen, headDim] output = concat(out_0..out_{numHeads-1}) × O_W^T ``` --- ## SwiGLU (LayerSwiGLU = 2) **What it does:** Gated feedforward block used in modern LLMs. Two parallel linear projections, one acting as a gate through SiLU activation, combined element-wise before a down projection. ``` gate = SiLU(x × W_gate^T + b_gate) up = x × W_up^T + b_up hidden = gate ⊙ up output = hidden × W_down^T + b_down ``` **Key fields:** `InputHeight` (in), `OutputHeight` (intermediate/hidden size). The actual output to the next layer is back to `InputHeight` via the down projection. **Weight layout:** ``` Master = [ W_gate | W_up | W_down | b_gate | b_up | b_down ] in×int in×int int×in int int in ``` Where `int = OutputHeight` (intermediate size). ``` Input [seqLen, in] │ ├──────────────────────────────────┐ │ │ ▼ ▼ W_gate (in → int) W_up (in → int) │ │ SiLU │ │ │ └──────────── ⊙ (element multiply) ┘ │ ▼ W_down (int → in) │ ▼ Output [seqLen, in] ``` --- ## RMSNorm (LayerRMSNorm = 3) **What it does:** Root Mean Square normalization. Divides each element by the RMS of the vector, then scales by a learned gamma parameter. ``` rms = sqrt( mean(x²) + ε ) output = (x / rms) × γ ``` **Key fields:** `InputHeight` (size), `DType`. **Always kept in FP32 on GPU** — the `SyncToGPU` code explicitly refuses to quantize RMSNorm weights. **Weight layout:** `Master` is a flat `[InputHeight]` gamma vector (no beta/bias term). --- ## LayerNorm (LayerLayerNorm = 9) **What it does:** Layer normalization. Computes mean and variance across the feature dimension, normalizes, then applies learnable gamma and beta. ``` μ = mean(x), σ² = var(x) x_hat = (x - μ) / sqrt(σ² + ε) output = γ ⊙ x_hat + β ``` **Weight layout:** `Master` is `[2 × InputHeight]`: first half is gamma, second half is beta. --- ## Embedding (LayerEmbedding = 13) **What it does:** Token lookup table. Given a vector of integer token IDs, returns the corresponding rows from the embedding matrix. **Key fields:** `VocabSize`, `EmbeddingDim`. **Weight layout:** `[VocabSize × EmbeddingDim]` row-major matrix. ``` Token IDs: [42, 7, 115] │ ▼ lookup rows 42, 7, 115 ┌──────────────────────────────────────────────────┐ │ Embedding Table [VocabSize × EmbeddingDim] │ │ │ │ Row 7: [0.12, -0.33, 0.87, ...] │ │ Row 42: [0.55, 0.11, -0.22, ...] │ │ Row 115: [-0.01, 0.77, 0.44, ...] │ └──────────────────────────────────────────────────┘ │ ▼ Output [3, EmbeddingDim] (gradient only applied to used rows) ``` --- ## KMeans (LayerKMeans = 14) **What it does:** Differentiable clustering. Computes soft assignment probabilities (or raw feature distances) between the input and a set of learnable cluster centroids. **Key fields:** | Field | Meaning | |:------|:--------| | `NumClusters` | K — number of cluster centers | | `InputHeight` | Feature vector size | | `KMeansTemperature` | Controls sharpness of soft assignment | | `KMeansOutputMode` | `"probabilities"` or `"features"` | **Weight layout:** `[NumClusters × InputHeight]` centroid matrix. ``` Input [batch, featureDim] │ ▼ compute squared distance to each centroid dist[b, k] = ||input[b] - centroid[k]||² │ ▼ temperature-scaled negative softmax p[b, k] = softmax(-dist / temperature) │ ▼ Output [batch, NumClusters] (if mode="probabilities") or [batch, featureDim] (if mode="features") ``` --- ## Softmax (LayerSoftmax = 15) **What it does:** Normalizes a vector (or matrix rows) into a probability distribution. Has 10 variants controlled by `SoftmaxType`. See [softmax.md](./softmax.md) for the full variant reference. **Key fields:** `SoftmaxType`, `Temperature`, `SoftmaxRows`, `SoftmaxCols`, `HierarchyLevels`, `EntmaxAlpha`, `Mask`, `GumbelNoise`. No weights — `WeightStore` is nil for Softmax layers. --- ## Parallel (LayerParallel = 16) **What it does:** Fans the input to N sub-layers simultaneously and combines their outputs. Supports five combination modes. **Key fields:** | Field | Meaning | |:------|:--------| | `ParallelBranches` | `[]VolumetricLayer` — the sub-layer definitions | | `CombineMode` | `"add"`, `"avg"`, `"concat"`, `"filter"`, `"grid_scatter"` | | `FilterGateConfig` | Optional gate network for MoE routing (filter mode) | ``` Input │ ┌──────────┼──────────┐ ▼ ▼ ▼ Branch 0 Branch 1 Branch 2 │ │ │ └──────────┼──────────┘ │ CombineMode: ┌─────────────────────────────────────────┐ │ "add" element-wise sum │ │ "avg" element-wise average │ │ "concat" [b0, b1, b2] concatenated │ │ "filter" gate × b0 + gate × b1 ... │ │ "grid_scatter" same as concat │ └─────────────────────────────────────────┘ │ Output ``` The `preAct` tensor returned by `ParallelForwardPolymorphic` stores the branch preActs in `preAct.Nested`, enabling correct recursive backpropagation. See [parallel_sequential.md](./parallel_sequential.md). --- ## Sequential (LayerSequential = 17) **What it does:** Chains N sub-layers in series. Each sub-layer receives the output of the previous one. The sub-layers can be of any type — this enables composing mini-architectures inside a single grid cell. **Key fields:** `SequentialLayers []VolumetricLayer` ``` Input │ ▼ Sub-layer 0 ──▶ Sub-layer 1 ──▶ Sub-layer 2 │ Output ``` Each step container stores `[bPre, bInput, bSkip]` in the nested tensor for accurate backward computation through skip connections within the sequence. --- ## Residual (LayerResidual = 18) **What it does:** Skip connection — adds the input directly to the output of its sub-network. ``` Input │ ┌────┴────┐ │ │ skip ▼ │ Sub-layers │ │ │ ▼ │ ┌───┐ │ │ + │◀──────┘ └─┬─┘ │ Output = SubLayers(Input) + Input ``` The skip tensor is passed as the second argument to `DispatchLayer` and is added inside `ResidualForwardPolymorphic`. Gradients flow back both through the sub-layers and directly through the skip branch. --- ## Activation Functions All layers that produce a `preAct` / `postAct` pair apply an activation via `Activate[T](v T, act ActivationType)`: | Constant | Formula | |:---------|:--------| | `ActivationReLU` (0) | `max(0, x)` | | `ActivationSilu` (1) | `x × σ(x)` | | `ActivationGELU` (2) | `0.5x(1 + tanh(√(2/π)(x + 0.044715x³)))` | | `ActivationTanh` (3) | `tanh(x)` | | `ActivationSigmoid` (4) | `1/(1+e^−x)` | | `ActivationLinear` (-1) | `x` (identity — no nonlinearity) | `ActivateDerivative[T]` returns the analytic derivative for backpropagation. --- ## Layer Summary Table | Layer | Parameters | GPU Forward | GPU Backward | |:------|:-----------|:-----------|:------------| | Dense | in×out | EXACT | EXACT | | CNN1 | f×c×k | EXACT | EXACT | | CNN2 | f×c×k² | EXACT | EXACT | | CNN3 | f×c×k³ | EXACT | EXACT | | RNN | ih+hh+b | EXACT | — | | LSTM | 4×(ih+hh+b) | EXACT | — | | MHA | 4×d² + biases | BROKEN (dets) | pending | | SwiGLU | 3×in×int | BROKEN (dets) | not wired | | RMSNorm | hidden | EXACT | EXACT | | LayerNorm | 2×hidden | — | — | | Embedding | vocab×dim | EXACT (DW) | — | | KMeans | k×dim | — | — | | Softmax | none | — | — | | Parallel | per-branch | — | — | | Sequential | per-layer | — | — | | Residual | per-sub | — | — | --- ## The Dispatcher Pattern and 3D Coordinate System Source: https://openfluke.com/docs/dispatch Markdown: https://openfluke.com/docs/dispatch.md # The Dispatcher Pattern and 3D Coordinate System This document explains how `DispatchLayer` and `DispatchLayerBackward` work as runtime jump tables, how the 3D coordinate system maps to `VolumetricLayer` positions, and how `IsRemoteLink` enables spatial hopping across the grid. --- ## Why a Dispatcher? A naive implementation of a polymorphic neural network would embed a large `switch` inside the forward loop: ```go // Naive — thread-divergence on GPU, hard to fuse for _, layer := range layers { switch layer.Type { case LayerDense: output = denseForward(layer, input) case LayerCNN2: output = cnn2Forward(layer, input) // ... } } ``` M-POLY-VTD separates concerns: the **traversal loop** iterates coordinates, and the **dispatcher** makes the type-specific call. This decoupling is what makes GPU kernel fusion possible in the future — the driver can inspect a group of same-type layers and launch a single batched shader rather than 19 separate ones. --- ## DispatchLayer ```go func DispatchLayer[T Numeric]( layer *VolumetricLayer, input, skip *Tensor[T], ) (preAct, postAct *Tensor[T]) ``` This is a generic function. The type parameter `T` is inferred from `input`. Every call returns two tensors: - `preAct` — the layer's internal state before the final activation. For Parallel/Sequential layers this carries the nested activation tree in `preAct.Nested`. - `postAct` — the result of applying the activation function to `preAct`. This is what flows to the next layer. The full routing table: ``` layer.Type ──switch──▶ function called ─────────────────────────────────────────────────────────────── LayerResidual ResidualForwardPolymorphic(layer, input, skip) LayerDense DenseForwardPolymorphic(layer, input) LayerCNN1 CNN1ForwardPolymorphic(layer, input) LayerCNN2 CNN2ForwardPolymorphic(layer, input) LayerCNN3 CNN3ForwardPolymorphic(layer, input) LayerRNN RNNForwardPolymorphic(layer, input) LayerLSTM LSTMForwardPolymorphic(layer, input) LayerMultiHeadAttention MHAForwardPolymorphic(layer, input) LayerSwiGLU SwiGLUForwardPolymorphic(layer, input) LayerRMSNorm RMSNormForwardPolymorphic(layer, input) LayerLayerNorm LayerNormForwardPolymorphic(layer, input) LayerConvTransposed1D ConvTransposed1DForwardPolymorphic(layer, input) LayerConvTransposed2D ConvTransposed2DForwardPolymorphic(layer, input) LayerConvTransposed3D ConvTransposed3DForwardPolymorphic(layer, input) LayerEmbedding EmbeddingForwardPolymorphic(layer, input) LayerKMeans KMeansForwardPolymorphic(layer, input) LayerSoftmax SoftmaxForwardPolymorphic(layer, input) LayerParallel ParallelForwardPolymorphic(layer, input) LayerSequential SequentialForwardPolymorphic(layer, input) default DenseForwardPolymorphic(layer, input) ─────────────────────────────────────────────────────────────── ``` --- ## DispatchLayerBackward ```go func DispatchLayerBackward[T Numeric]( layer *VolumetricLayer, gradOutput, input, skip, preAct *Tensor[T], ) (gradInput, gradWeights *Tensor[T]) ``` The mirror of `DispatchLayer`. Returns: - `gradInput` — the gradient to pass to the layer that produced `input` (propagates error upstream) - `gradWeights` — the gradient for this layer's own weights (used to update `WeightStore.Master`) The routing table is symmetric to the forward pass. The `skip` argument is used only by `ResidualBackwardPolymorphic`. --- ## The 3D Grid Traversal `ForwardPolymorphic[T]` iterates the grid in reading order: ```go for z := 0; z < n.Depth; z++ { for y := 0; y < n.Rows; y++ { for x := 0; x < n.Cols; x++ { for l := 0; l < n.LayersPerCell; l++ { idx := n.GetIndex(z, y, x, l) layer := &n.Layers[idx] // ... _, post := DispatchLayer(layer, currentTensor, nil) currentTensor = post } } } } ``` The flattened index formula: ``` idx = z * (Rows * Cols * LayersPerCell) + y * (Cols * LayersPerCell) + x * (LayersPerCell) + l ``` Visually, for a (Depth=1, Rows=2, Cols=3, LayersPerCell=1) network: ``` z=0: ┌─────────────┬─────────────┬─────────────┐ │ (0, 0, 0,0) │ (0, 0, 1,0) │ (0, 0, 2,0) │ ← idx 0,1,2 │ idx=0 │ idx=1 │ idx=2 │ ├─────────────┼─────────────┼─────────────┤ │ (0, 1, 0,0) │ (0, 1, 1,0) │ (0, 1, 2,0) │ ← idx 3,4,5 │ idx=3 │ idx=4 │ idx=5 │ └─────────────┴─────────────┴─────────────┘ Data flows: idx=0 ──▶ idx=1 ──▶ idx=2 ──▶ idx=3 ──▶ idx=4 ──▶ idx=5 ``` `BackwardPolymorphic` walks in reverse (z, y, x, l all reversed), using cached `inputs[idx]` and `preActs[idx]` from the forward pass. --- ## Tiled Traversal When `n.UseTiling = true`, `ForwardPolymorphic` uses a blocked spatial traversal with tile size 4: ``` for zTile := 0; zTile < Depth; zTile += 4 { for yTile := 0; yTile < Rows; yTile += 4 { for xTile := 0; xTile < Cols; xTile += 4 { // Process 4×4×4 tile of cells } } } ``` This is the CPU-side analogue of the GPU workgroup tile strategy. The intent is to improve data locality: all layers in a 4×4×4 spatial neighborhood execute together, keeping their weight data warm in L2/L3 cache. ### SC (single-workgroup) vs MC (multi-workgroup) tiling There are **two different “tiling” knobs** in `poly`: 1. **`VolumetricNetwork.UseTiling`** (see [Tiled Traversal](#tiled-traversal) above) — spatial blocking of the **3D grid** in `ForwardPolymorphic` (4×4×4 cells). Unrelated to transformer matmul tiles. 2. **Per-layer matmul / GPU workgroup tiling** — `RefreshRuntimeTileSizes()` fills per-dtype maps from layer geometry and (for GPU) `WGPUContext` limits. #### GPU: two tile maps, configurable SC vs MC On **GPU**, each layer gets **`GPUSCTileSizes`** and **`GPUMCTileSizes`** (see `refreshRuntimeGPUTileSizes` in `tile_detection.go`). At dispatch, **`VolumetricNetwork.EnableMultiCoreTiling`** chooses which map to use: `GetGPUMCTileSize(dtype)` when `true` (larger / higher-throughput tiles where limits allow), `GetGPUSCTileSize(dtype)` when `false` (smaller workgroups, friendlier to tight limits). So **MC vs SC on GPU is a real switch** — you are not stuck in one profile; set `EnableMultiCoreTiling` (or use **`TrainingModeGPUSC` / `TrainingModeGPUMC`** in `trainBatchWGPU`, which pick tile sizes the same way). Transformer-style forwards in `wgpu_forward.go` read **`network.EnableMultiCoreTiling`** (not per-layer) for that choice. `WGPUContext.GPUTileSize` is the device-tuned baseline that feeds how those SC/MC maps are built, not the only number used at dispatch. #### CPU: one tile map (not an SC/MC pair on the layer) On **CPU**, each layer has a **single** per-dtype map, **`CPUTileSizes`**, via `GetCPUTileSize` — there is **no** `CPUSCTileSizes` / `CPUMCTileSizes` pair. Tiled matmul-style loops (Dense, SwiGLU, CNN, etc.) all use that one size. `TrainingModeCPUSC` and `TrainingModeCPUMC` exist in the enum (and show up in benchmarks), but **`ConfigureNetworkForMode` applies the same wiring to all CPU modes** (`UseTiling`, `EnableMultiCoreTiling`, `RefreshRuntimeTileSizes`), and **`executeBatchCPU` does not receive the mode** — so there is **no** separate “CPU SC tile path” vs “CPU MC tile path” in the layer maps today. **`EnableMultiCoreTiling` on CPU** is set for consistency with GPU-bound nets and training tooling; it does **not** flip between two CPU tile sizes because only one map exists. `WGPUContext.GPUTileSize` is the auto-detected base hint (from limits); concrete SC/MC sizes per layer type on GPU live in the two GPU maps, not in that single int alone. --- ## VolumetricLayer: The Coordinate Record Every `VolumetricLayer` contains its own position: ```go type VolumetricLayer struct { Network *VolumetricNetwork // back-pointer Type LayerType Activation ActivationType DType DType WeightStore *WeightStore Z int // Depth coordinate Y int // Row coordinate X int // Col coordinate L int // Layer index within cell // Spatial Routing IsRemoteLink bool TargetZ, TargetY, TargetX, TargetL int // ... configuration fields } ``` The `(Z, Y, X, L)` fields are set during `NewVolumetricNetwork` and are the canonical address. `GetLayer(z, y, x, l)` returns a pointer into the flat `Layers` slice using `GetIndex`. --- ## IsRemoteLink: Spatial Hopping A layer with `IsRemoteLink = true` does not receive its input from the previous layer in reading order. Instead, it reads from the output of whatever layer lives at `(TargetZ, TargetY, TargetX, TargetL)`. This enables: 1. **Skip connections** — hop over several layers in the grid 2. **Feedback loops** — target a layer at an *earlier* coordinate (biological recurrence) 3. **Parallel expert routing** — multiple layers at different positions all reading the same source 4. **Cross-depth signals** — connect depth=0 outputs to depth=2 inputs ``` Standard flow: Remote link (skip): (0,0,0) → (0,0,1) (0,0,0) ────────────────────┐ │ (0,0,1) → (0,0,2) → ... │ (0,0,2) │ │ (0,2,0) ←── IsRemoteLink ──┘ (0,0,3) └── reads output of (0,0,0) Feedback loop: (0,0,0) │ (0,0,1) │ (0,0,2) ─── IsRemoteLink ──▶ TargetZ=0, TargetY=0, TargetX=0 (reads from cycle N-1's output of layer (0,0,0) — step mesh only) ``` In `ForwardPolymorphic`, a remote-linked layer simply receives `currentTensor` like any other layer; the remote link semantic is only fully honored by `StepForward`, which maintains per-layer output buffers across time steps. In `ParallelForwardPolymorphic` and `SequentialForwardPolymorphic`, remote links are resolved by calling `layer.Network.GetLayer(branch.TargetZ, ...)` and dispatching the resolved layer pointer. --- ## The GPU Dispatch Path When `n.UseGPU = true`, the training loop calls `ctx.DispatchForwardLayer(l, batchSize, curBuf, preBuf)` instead of `DispatchLayer`. This function is in `wgpu_forward.go` and routes to the appropriate WGSL compute shader based on `l.Type`. The same dispatcher philosophy applies: one function, one switch, explicit routing. The difference is that inputs and outputs are `*wgpu.Buffer` handles in VRAM rather than `*Tensor[T]` in RAM. ``` trainBatchWGPU: BeginFrame() ← create shared CommandEncoder │ ├── for each layer forward: │ └── ctx.DispatchForwardLayer(l, ...) ← records into encoder │ ├── DispatchMSEGradPartialLoss(...) ← records into encoder │ ├── for each layer backward (reverse): │ ├── ctx.DispatchActivationBackward(...) │ ├── ctx.DispatchBackwardLayer(l, ...) │ └── ctx.DispatchApplyGradients(...) │ FlushFrame() ← ONE submit for entire forward + backward + weight update │ ReadBuffer(partialsBuf) ← only reads back tiny loss scalars ``` This single-submission design reduces Go-to-GPU driver overhead from ~150+ round trips per batch to exactly 1. --- ## Disabled Layers Setting `layer.IsDisabled = true` causes both `ForwardPolymorphic` and `StepForward` to skip the layer entirely. In `StepForward`, a disabled layer passes its input buffer through to `NextBuffer` unchanged. This is the mechanism for implementing sparse MoE expert activation — gate layers can conditionally disable branches. --- ## Training: Forward Pass, Backward Pass, Optimizers, and Learning Source: https://openfluke.com/docs/training Markdown: https://openfluke.com/docs/training.md # Training: Forward Pass, Backward Pass, Optimizers, and Learning This document covers the full training pipeline: the forward and backward pass mechanics, loss computation, weight update strategies, gradient clipping, Tween, and the `VGStepBP` adaptive rate. --- ## The Training Loop ```go result, err := poly.Train[float32](network, batches, config) ``` `Train[T Numeric]` is the high-level entry point. It wraps `trainBatchCPU` or `trainBatchWGPU` depending on `config.UseGPU`. ```go type TrainingConfig struct { Epochs int LearningRate float32 LossType string // "mse" or "cross_entropy" GradientClip float32 // 0 = no clipping Verbose bool UseGPU bool DeviceID int TrackPerf bool } ``` A `TrainingBatch[T]` pairs `Input *Tensor[T]` with `Target *Tensor[T]`. Multiple batches are provided as a slice — the loop iterates over batches for each epoch, averages the loss, and prints progress if `Verbose = true`. --- ## Runtime tiling (`ConfigureNetworkForMode`) Before the training loop runs, `Train` wires the network through `ConfigureNetworkForMode` (`training.go`), which aligns tiling flags with the selected `TrainingMode`: - **CPU modes** (`TrainingModeCPUNormal`, `TrainingModeCPUSC`, `TrainingModeCPUMC`): **all three are configured the same way** — `EnableMultiCoreTiling = true`, `RefreshRuntimeTileSizes()`, then `UseTiling` and `EnableMultiCoreTiling` on **every** layer. The CPU forward (`executeBatchCPU`) does **not** branch on `TrainingMode`; poly has **one** CPU tile map per layer (`CPUTileSizes`), not separate SC/MC maps. The SC/MC names in the enum are for labeling and benchmarks, not a second tile-size profile on CPU today. - **GPU modes** (`TrainingModeGPUNormal`, `TrainingModeGPUSC`, `TrainingModeGPUMC`): initializes WebGPU if needed, `RefreshRuntimeTileSizes()`, resets the bind-group cache, `SyncToGPU()`, and ensures FP32 master buffers exist for backward. **`trainBatchWGPU`** uses **`TrainingModeGPUSC`** vs **`TrainingModeGPUMC`** to select **`GetGPUSCTileSize`** vs **`GetGPUMCTileSize`** per layer; **`GPUNormal`** uses untiled or generic dispatch per layer type. For **interactive inference** (no explicit training mode), toggling **`VolumetricNetwork.EnableMultiCoreTiling`** chooses GPU SC vs MC tile maps (`wgpu_forward.go`), the same underlying maps training uses. --- ## CPU Training: Step by Step ```go func trainBatchCPU[T Numeric](n *VolumetricNetwork, batch TrainingBatch[T], config *TrainingConfig) float64 ``` ### 1. Forward Pass with History Capture ``` histIn [numLayers]*Tensor[T] ← input to each layer histPre [numLayers]*Tensor[T] ← preAct from each layer curr = batch.Input for each layer idx: histIn[idx] = curr pre, post = DispatchLayer(layer, curr, nil) histPre[idx] = pre curr = post ``` The history arrays are what make backpropagation possible without a tape. Every layer caches what it received and what it produced before activation. ### 2. Loss and Gradient Computation ``` gradOut = ComputeLossGradient(curr, batch.Target, "mse") lossVal = CalculateLoss(curr, batch.Target, "mse") ``` **MSE loss:** ``` L = (1/N) Σᵢ (output[i] - target[i])² gradOut[i] = (2/N) × (output[i] - target[i]) ``` ### 3. Backward Pass ```go _, layerGradients, _ := BackwardPolymorphic(n, gradOut, histIn, histPre) ``` `BackwardPolymorphic` walks the grid in **reverse** order (Z high to low, Y high to low, X high to low, L high to low). At each step: ``` gIn, gW = DispatchLayerBackward(layer, currentGrad, histIn[idx], nil, histPre[idx]) currentGrad = gIn ← flows back to previous layer layerGradients[idx] = {gIn, gW} ← stored for weight update ``` The backward pass for Dense computes: ``` gradPre[b,o] = gradOutput[b,o] × activation'(preAct[b,o]) gradWeights[o,i] += input[b,i] × gradPre[b,o] (accumulated over batch) gradInput[b,i] += W[o,i] × gradPre[b,o] ``` ### 4. Weight Update ```go for idx := range n.Layers { if layerGradients[idx][1] != nil { gW := ConvertTensor[T, float32](layerGradients[idx][1]) ApplyRecursiveGradients(l, gW, config.LearningRate) } } ``` `ApplyRecursiveGradients` calls `WeightStore.ApplyGradients(gW, lr)`: ``` Master[i] -= lr × gradWeights[i] ``` After this, all cached `Versions` and `GPUWeights` are cleared, forcing re-quantization on the next forward pass. `ApplyRecursiveGradients` also recurses into `ParallelBranches` and `SequentialLayers`, using the `Nested` structure of the returned `gradWeights` tensor to route updates to the correct sub-layer. --- ## GPU Training: BeginFrame / FlushFrame The GPU training path batches the entire forward + backward + weight-update into **one command buffer**: ``` ctx.BeginFrame() ← create shared CommandEncoder │ ├── forward pass: DispatchForwardLayer per layer ├── loss grad: DispatchMSEGradPartialLoss ├── backward: DispatchActivationBackward + DispatchBackwardLayer per layer └── update: DispatchApplyGradients per layer ctx.FlushFrame() ← ONE submit + destroy temp uniform bufs │ ReadBuffer(partialsBuf) ← only reads back numWG × float32 scalars ``` The loss value is computed from partial sums: `numWG = (totalOutput + 255) / 256` workgroups each sum 256 elements. The Go side only reads back `numWG` floats rather than the full output tensor. GPU weight updates are applied directly in VRAM via `DispatchApplyGradients`, which runs a WGSL shader: ```wgsl weights[i] -= lr * gradients[i] ``` This means the CPU master weights become stale after GPU training. A `ReadBuffer` + `Unpack` cycle is required if you want to access updated weights on the CPU. --- ## Loss Functions | `LossType` | Formula | Gradient | |:-----------|:--------|:---------| | `"mse"` | `(1/N) Σ (out-target)²` | `(2/N)(out-target)` | | `"cross_entropy"` | (not yet in `training.go`) | — | The GPU MSE gradient shader (`DispatchMSEGradPartialLoss`) computes both the gradient tensor and partial sums in a single pass. --- ## Tween (neural target propagation) **Tween** is the name used in this codebase for layer-local target propagation. In papers it often appears as *target propagation*, *difference target propagation*, or similar. Implementation: `tween.go`. Tween is a gradient-free alternative that estimates what each layer *should* have produced rather than computing exact chain-rule gradients. ### Two Modes **Chain Rule mode** (`UseChainRule = true`): ``` target = actual + gradient × GradientScale ``` This uses backpropagation to compute gradients, then shifts the target in the gradient direction. It is standard backprop dressed in Tween clothing. **Pure Tween mode** (`UseChainRule = false`): ``` target[i] = Σⱼ w[i,j] × currentTarget[j] / totalWeight[j] ``` Estimates input targets using weighted importance from the layer's own weights, without computing derivatives. This is the biologically-motivated "local learning" variant. Supported for Dense, RNN, LSTM, MHA, and SwiGLU. ### The TweenState ```go type TweenState[T Numeric] struct { ForwardActs []*Tensor[T] // what layers produced BackwardTargets []*Tensor[T] // what they should have produced Gradients []*Tensor[float32] LinkBudgets []float32 // cosine similarity: actual vs target Gaps []float32 // RMS distance: actual vs target Config *TweenConfig } ``` ### Usage Pattern ```go state := poly.NewTweenState[float32](network, poly.DefaultTweenConfig()) output := poly.TweenForward(network, state, input) poly.TweenBackward(network, state, target) state.CalculateLinkBudgets() poly.ApplyTweenGaps(network, state, lr) ``` ### Link Budget Gating Before applying any weight update, the engine checks the layer's `LinkBudget` (cosine similarity between actual output and backward target, normalized to [0,1]): ``` if budget < 0.2 { skip update // prevent corrupting "dead" layers } layerRate = lr × (0.5 + budget × 0.5) // good signal = higher rate ``` This prevents gradient corruption in layers where the signal has been destroyed. --- ## VGStepBP Adaptive Rate The README mentions `VGStepBP` (Variable Gradient Step Backpropagation) as an adaptive rate calculation. This integrates with the Tween `DepthScaleFactor` field: ```go DepthScaleFactor: 1.1 // each deeper layer gets 1.1× the base rate ``` Deeper layers receive slightly higher learning rates to compensate for gradient attenuation through the network depth. This is a simple heuristic that avoids the full computation of per-layer adaptive optimizers. --- ## Gradient Explosion Detection The `GradientClip` field in `TrainingConfig` (when non-zero) clips gradient norms. Additionally, the Tween gap system implicitly detects explosion: if `Gaps[i]` grows very large, the gap-based update `delta = lr × input × gap` will also be large, but the Link Budget gating prevents this from firing if the cosine similarity is low. The README references "Gradient Explosion Detection & Damping" as a completed feature in the training automation section. --- ## Activation Functions (Forward and Backward) All activation derivatives are computed analytically in `ActivateDerivative[T]`: ``` ReLU: dA/dx = 1 if x > 0, else 0 SiLU: dA/dx = σ(x)(1 + x(1-σ(x))) GELU: dA/dx ≈ CDF(x) + x × PDF(x) Tanh: dA/dx = 1 - tanh(x)² Sigmoid: dA/dx = σ(x)(1 - σ(x)) Linear: dA/dx = 1 ``` In the backward pass, `gradOutput` is multiplied elementwise by the derivative of `preAct` before accumulating `gradWeights` and `gradInput`. --- ## The Full Training Data Flow ``` ┌─────────────────────────────────────────────────────────────────┐ │ EPOCH LOOP │ │ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ BATCH │ │ │ │ │ │ │ │ batch.Input │ │ │ │ │ │ │ │ │ ▼ │ │ │ │ [Forward Pass] ──▶ histIn, histPre captured │ │ │ │ │ │ │ │ │ ▼ │ │ │ │ prediction │ │ │ │ │ │ │ │ │ ▼ │ │ │ │ [Loss + gradOut] ◀── batch.Target │ │ │ │ │ │ │ │ │ ▼ │ │ │ │ [Backward Pass] ──▶ layerGradients │ │ │ │ │ │ │ │ │ ▼ │ │ │ │ [ApplyRecursiveGradients] ──▶ Master updated │ │ │ │ Versions cleared │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ LossHistory appended, EpochTimes recorded │ └─────────────────────────────────────────────────────────────────┘ ``` --- ## TrainingResult ```go type TrainingResult struct { FinalLoss float64 TotalTime time.Duration LossHistory []float64 // one entry per epoch EpochTimes []time.Duration } ``` `Train` returns this struct regardless of CPU or GPU path, making it easy to log or compare runs. --- ## GPU Backend: WebGPU (WGPU) Source: https://openfluke.com/docs/gpu Markdown: https://openfluke.com/docs/gpu.md # GPU Backend: WebGPU (WGPU) This document covers the WebGPU backend: initialization, the `BeginFrame`/`FlushFrame` command batching pattern, the buffer pool and pipeline cache, which layers have GPU support, and the tiling strategy. --- ## Why WebGPU M-POLY-VTD uses the `github.com/openfluke/webgpu/wgpu` Go bindings for hardware acceleration. WebGPU compiles to: - **Vulkan** on Windows/Linux - **Metal** on macOS/iOS - **DX12** on Windows - **WebGPU** in browser via WASM No CUDA, no CGO beyond the wgpu bindings. All shaders are WGSL (WebGPU Shading Language) strings generated at runtime by Go functions in `wgpu_shaders.go`, `wgpu_kernels.go`, and `wgpu_backward_shaders.go`. --- ## WGPUContext ```go type WGPUContext struct { Instance *wgpu.Instance Adapter *wgpu.Adapter Device *wgpu.Device Queue *wgpu.Queue PipelineCache map[string]*wgpu.ComputePipeline // keyed by shader source hash ActivationPool map[string]*wgpu.Buffer // named activation buffers LayoutCache map[string]*wgpu.BindGroupLayout BindGroupCache map[uint64]*wgpu.BindGroup // keyed by buffer-set hash UniformPool []*wgpu.Buffer // pre-allocated uniform buffer pool UniformIdx int ActiveEncoder *wgpu.CommandEncoder // non-nil during BeginFrame/FlushFrame PendingDestroys []*wgpu.Buffer // temp bufs destroyed after FlushFrame GPUTileSize int // auto-detected optimal tile size Limits wgpu.Limits } ``` ### Initialization ```go err := network.InitWGPU() ``` `InitWGPU` performs three WebGPU steps: 1. Create an `Instance` and request a `HighPerformance` `Adapter` 2. Query the default device for its limits, then boost `MaxStorageBufferBindingSize` to 1 GB and `MaxBufferSize` to 2 GB for large embedding tables 3. Request the final `Device` with boosted limits, then auto-detect the optimal `GPUTileSize` from the workgroup storage and invocation limits ``` CalculateOptimalGPUTileSizeFromLimits( MaxComputeWorkgroupStorageSize, MaxComputeInvocationsPerWorkgroup, headDim=64, ) → GPUTileSize (e.g., 8 or 16) ``` After init, call `network.SyncAllToGPU()` to upload all layer weights to VRAM. This also creates GPU KV cache buffers for MHA layers and pre-allocates named activation buffers (`hidden_A`, `hidden_B`, `norm_out`, etc.). --- ## BeginFrame / FlushFrame Pattern The most important design decision in the GPU backend. Instead of submitting a command buffer per layer (which would mean 100+ GPU driver calls per token), all operations are recorded into a single shared encoder: ``` ctx.BeginFrame() ← creates ctx.ActiveEncoder ← resets ctx.PendingDestroys // All Dispatch* calls record into ActiveEncoder: ctx.DispatchForwardLayer(...) ctx.DispatchActivation(...) ctx.DispatchMSEGradPartialLoss(...) ctx.DispatchBackwardLayer(...) ctx.DispatchApplyGradients(...) ctx.FlushFrame() ← enc.Finish() + Queue.Submit(cmd) ← destroys PendingDestroys buffers ← resets UniformIdx ``` Temporary uniform buffers (holding layer parameters like `batchSize`, `inputSize`, etc.) must stay alive until `FlushFrame` because the GPU reads them asynchronously. They are collected in `PendingDestroys` and destroyed only after the submit. `Queue.WriteBuffer` calls (to upload inputs, targets, and zero DW buffers) are **queue-level operations** — they are safe to call between `BeginFrame` and `FlushFrame` because the WebGPU spec guarantees they complete before the encoder submit executes. --- ## Buffer Management ### ActivationPool Named persistent buffers that survive across frames: ```go buf := ctx.GetActivationBuffer("hidden_A", size, wgpu.BufferUsageStorage) ``` If a buffer with this name already exists and is large enough, it is reused. Otherwise a new one is created and cached. This avoids per-step allocations during inference. ### CreatePersistentBuffer ```go buf, err := ctx.CreatePersistentBuffer(data []float32, label string) ``` Uploads a `[]float32` to a VRAM storage buffer with `Storage | CopySrc | CopyDst` usage. Used for weight buffers that stay resident across many forward passes. ### ReadBuffer ```go values, err := ctx.ReadBuffer(buf *wgpu.Buffer) ``` Copies a GPU buffer to a CPU staging buffer, maps it, and returns `[]float32`. This is the only synchronous GPU→CPU roundtrip in the training path; it is called once per batch to read back the partial loss sums. ### BindGroup Cache `GetBindGroup(pipeline, buffers...)` hashes the pipeline pointer and buffer pointers into a `uint64` key. If a matching `BindGroup` already exists, it is returned without re-creating it. This avoids rebuilding the descriptor set on every frame for stable weight+activation buffer pairs. --- ## Weight Sync Strategies `SyncToGPU()` on a `VolumetricLayer` uses different strategies depending on layer type and DType: ``` RMSNorm: Always uploads FP32 master. Quantization destroys normalization precision. SwiGLU (FP32): Splits Master into Gate, Up, Down slices. Uploads three separate persistent buffers. SwiGLU (INT4 / Q4_0): Calls syncQuantizedSwiGLU which quantizes each slice independently. Each component gets a scales buffer + packed uint32 buffer. Dense (INT4 / Q4_0): syncQuantizedDense: 32-weight blocks, scale per block, packed nibbles. MHA (FP32): Splits into Q/K/V/O weight buffers at internal DType codes 200/201/202/203. Also uploads optional q_norm/k_norm buffers at 204/205 when present. MHA (INT4): syncQuantizedMHA: quantizes each of Q/K/V/O separately. ``` The internal DType codes (100–102 for SwiGLU components, 200–203 for MHA projections) are a namespacing trick to store multiple named GPU buffers in the single `GPUWeights map[DType]any` without adding new struct fields. --- ## Forward Dispatch (wgpu_forward.go) `ctx.DispatchForwardLayer(l, batchSize, inBuf, outBuf)` routes to the correct WGSL shader. Key functions: | Function | WGSL kernel | Notes | |:---------|:------------|:------| | `DispatchDenseForward` | matmul shader | register-tiled | | `DispatchRMSNorm` | RMSNorm shader | always FP32 weights | | `DispatchCNN1Forward` | 1D conv shader | | | `DispatchCNN2Forward` | 2D conv shader | 1826x vs CPU | | `DispatchCNN3Forward` | 3D conv shader | 7602x vs CPU | | `DispatchRNNForward` | RNN cell shader | | | `DispatchLSTMForward` | LSTM cell shader | | | `DispatchEmbedding` | gather shader | | | `DispatchMHAForward` | Q/K/V + attention | separate kernels | | `DispatchSwiGLUForward` | gate+up+down | BROKEN determinism | `DispatchActivation(n, act, inBuf, outBuf)` dispatches a shader that applies ReLU, SiLU, GELU, Tanh, or Sigmoid elementwise over `n` elements. --- ## Backward Dispatch (wgpu_backward_shaders.go) WGSL shaders for gradient computation: **Dense DX shader** (`ShaderDenseBackwardDX`): ```wgsl dx[b, i] = Σ_o dy[b, o] × W[o, i] // Implemented as tiled matmul using shared memory tiles: var dyTile: array; var wTile: array; ``` **Dense DW shader** (`ShaderDenseBackwardDW`): ```wgsl dW[o, i] = Σ_b dy[b, o] × x[b, i] // Uses atomic add for race-free accumulation across batch ``` **CNN DX/DW shaders**: Implement the "strided convolution" backward pass — the input gradient is the transposed convolution of the output gradient with the kernel, and the weight gradient is the correlation of the input with the output gradient. **Activation backward**: `DispatchActivationBackward` applies the activation derivative elementwise: `gradPre[i] = gradOut[i] × act'(preAct[i])`. **MSE gradient + partial loss** (`DispatchMSEGradPartialLoss`): ```wgsl grad[i] = (2.0 / N) × (pred[i] - target[i]) partial[wg] = Σ_{i in group} (pred[i] - target[i])² ``` **Apply gradients** (`DispatchApplyGradients`): ```wgsl weights[i] -= lr × dw[i] ``` --- ## GPU support: layer × `DType` (one table) Scope: **`VolumetricLayer.SyncToGPU`** + **`(*WGPUContext).DispatchForwardLayer`** in `poly.go` / `wgpu_kernels.go`. Symbol **`T`** means **`Transformer.ForwardTokenIDsWGPU`** / **`wgpu_forward.go`** (LLM inference) for that layer+dtype, not generic batch dispatch. Activations are **`f32`** WGSL; **`DTypeFloat64`** is coerced to the **`Float32`** weight-buffer path in the `hasSpecialPath` / morph block (see `SyncToGPU`). | Symbol | Meaning | |:------:|---------| | **Y** | **Generic GPU forward OK**: `SyncToGPU` does not skip the `MorphToFloat32ForGPU` upload **or** uses a matching native path (`DispatchDenseQ4` for **Dense+Int4** only; **CNN1** packed when `isCNN1NativeGPUQuantDType`). | | **T** | **Transformer path only** (`wgpu_forward.go`): QKV/O use **`DispatchDenseQ4`** / **`DispatchDenseI8`**; SwiGLU gate/up may use **`DispatchSwiGLUQ4`**. **Not** correct for generic **`DispatchForwardLayer`** on that dtype (quantized buffers + **`DispatchDense`** / **`DispatchSwiGLUWithActCache`** mismatch). | | **–** | **Not supported** after vanilla `SyncToGPU` + generic `DispatchForwardLayer` (skipped morph with no valid weight buffer, or packed weights fed to an **`f32`** matmul / SwiGLU shader). | | **·** | **DType N/A** (no weight tensor for that layer). | **Dense:** only **`DTypeInt4`** selects **`DispatchDenseQ4`**. Wider dtypes (**2–13, 15–20** except **14**) hit **`hasSpecialPath`** with no quant branch → morph skipped → **–**. Eight-bit dtypes on Dense get **`syncQuantizedDenseI8`** but **`DispatchDenseTiled`** expects **`f32`** layout → **–**. **`ensureGPUFloat32Weights`** (training) can still attach **`GPUWeights[Float32]`** so matmul runs on the **FP32 master** regardless of `l.DType` (not reflected as **Y** here). | ID | `DType` | Dense | RMSNorm | CNN1 | CNN2 | CNN3 | RNN | LSTM | Embedding | Softmax | MHA | SwiGLU | Residual | |---:|---------|:-----:|:-------:|:----:|:----:|:----:|:---:|:----:|:---------:|:-------:|:---:|:------:|:--------:| | 0 | Float64 | Y | Y | Y | Y | Y | Y | Y | Y | · | Y | Y | · | | 1 | Float32 | Y | Y | Y | Y | Y | Y | Y | Y | · | Y | Y | · | | 2 | Float16 | – | Y | Y | Y | Y | Y | Y | Y | · | Y | Y | · | | 3 | BFloat16 | – | Y | Y | Y | Y | Y | Y | Y | · | Y | Y | · | | 4 | FP8 E4M3 | – | Y | Y | Y | Y | Y | Y | Y | · | T | T | · | | 5 | FP8 E5M2 | – | Y | Y | Y | Y | Y | Y | Y | · | T | T | · | | 6 | Int64 | – | Y | Y | Y | Y | Y | Y | Y | · | Y | Y | · | | 7 | Int32 | – | Y | Y | Y | Y | Y | Y | Y | · | Y | Y | · | | 8 | Int16 | – | Y | Y | Y | Y | Y | Y | Y | · | Y | Y | · | | 9 | Int8 | – | Y | Y | Y | Y | Y | Y | Y | · | T | T | · | | 10 | Uint64 | – | Y | Y | Y | Y | Y | Y | Y | · | Y | Y | · | | 11 | Uint32 | – | Y | Y | Y | Y | Y | Y | Y | · | Y | Y | · | | 12 | Uint16 | – | Y | Y | Y | Y | Y | Y | Y | · | Y | Y | · | | 13 | Uint8 | – | Y | Y | Y | Y | Y | Y | Y | · | T | T | · | | 14 | Int4 | Y | Y | Y | Y | Y | Y | Y | Y | · | T | T | · | | 15 | Uint4 | – | Y | Y | Y | Y | Y | Y | Y | · | T | T | · | | 16 | FP4 | – | Y | Y | Y | Y | Y | Y | Y | · | T | T | · | | 17 | Int2 | – | Y | Y | Y | Y | Y | Y | Y | · | T | T | · | | 18 | Uint2 | – | Y | Y | Y | Y | Y | Y | Y | · | T | T | · | | 19 | Ternary | – | Y | Y | Y | Y | Y | Y | Y | · | T | T | · | | 20 | Binary | – | Y | Y | Y | Y | Y | Y | Y | · | T | T | · | **CNN1 column:** **Y** = either **`DispatchCNN1Packed`** (dtype in `isCNN1NativeGPUQuantDType`: Int8, Int4, Int2, FP4, Ternary, Binary, FP8×2, Uint8, Uint4, Uint2, Float16, BFloat16, Int16) or **`DispatchCNN1`** on **`MorphToFloat32ForGPU`** otherwise. **Not in this table:** `LayerLayerNorm`, `LayerConvTransposed*`, `LayerKMeans`, `LayerParallel`, `LayerSequential`, `LayerMetacognition` (no `DispatchForwardLayer` arm). See [numerical_types.md](numerical_types.md) for the **`DType`** enum and **`WeightStore`**. **GPU training:** `gpuTrainingNeedsCPUFallback` in `training.go` forces a **CPU** optimizer step when the net includes **MHA**, **SwiGLU**, **Dense+Int4**, or **RNN/LSTM** with **Int8/Int4**. --- The project uses **Numerical Tiling** to map 3D volumetric layers to GPU workgroups. ### SC (single-workgroup) vs MC (multi-workgroup) profiles Loom differentiates two dispatch profiles for GPU kernels (attention, dense, SwiGLU, CNN, etc.): - **SC**: Smaller workgroups / tiles — lower register pressure, friendlier to tight limits (edge GPUs, WASM). - **MC**: Larger tiles where limits allow — higher throughput on desktop-class GPUs. At **inference**, transformer-style forwards (`wgpu_forward.go`) choose per-layer tile sizes with `layer.GetGPUSCTileSize(dtype)` vs `layer.GetGPUMCTileSize(dtype)` according to **`VolumetricNetwork.EnableMultiCoreTiling`** (with the same field mirrored on layers when set). That is the primary switch — not `GPUTileSize` alone. `WGPUContext.GPUTileSize` is still the device-tuned baseline derived from `CalculateOptimalGPUTileSizeFromLimits` and feeds into how SC/MC maps are built in `refreshRuntimeGPUTileSizes`. **GPU training** may ignore the network flag and pick SC vs MC directly via `TrainingModeGPUSC` / `TrainingModeGPUMC` (`training.go`). **CPU:** poly does **not** expose SC vs MC as two tile maps on the CPU side — layers use **`CPUTileSizes` / `GetCPUTileSize` only**. See the **“GPU: two tile maps…”** and **“CPU: one tile map…”** subsections in [dispatch.md](dispatch.md). --- ## Transformer GPU Forward (wgpu_forward.go) `Transformer.ForwardTokenIDsWGPU` is the optimized path for LLM inference: 1. If `tokens != nil` and GPU embeddings are loaded, dispatch a gather shader to convert token IDs → hidden states entirely on-GPU 2. `BeginFrame()` — all subsequent ops recorded into one encoder 3. For each transformer block (4 layers: RMSNorm → MHA → RMSNorm → SwiGLU): - Dispatch `DispatchRMSNorm` - Dispatch Q/K/V projections separately (supports expanded QueryDim) - Optional Q/K RMSNorm using q_norm/k_norm buffers - Dispatch RoPE rotation - Dispatch attention score + softmax - Dispatch output projection - Add residual 4. Final norm + LM head if on GPU 5. `FlushFrame()` — single submit 6. Read back only the logits (one small buffer) This path achieves the "260+ tokens/s prefill on M4" figure mentioned in the README. ### Qwen / Expanded-Query Notes Loom's GPU path now supports architectures where `query_dim != d_model` (for example Qwen3-0.6B with `head_dim=128`, `num_heads=16`, `query_dim=2048`, `d_model=1024`). Key implementation details: - MHA shader workgroup width scales with `head_dim` (not hardcoded to 64). - Q projection and attention output buffers use `query_dim`. - O projection uses `input=query_dim`, `output=d_model`. - RMSNorm epsilon is propagated from checkpoint config (`rms_norm_eps`) for parity with CPU. --- ## The step mesh engine Source: https://openfluke.com/docs/step Markdown: https://openfluke.com/docs/step.md # The step mesh engine This document covers the `StepState`, `StepForward`, `StepBackward`, and `StepApplyTween` functions that implement a clock-cycle-accurate discrete-time neural mesh. --- ## What is the step mesh? Standard `ForwardPolymorphic` runs the entire network in one sequential sweep — input enters at coordinate (0,0,0,0) and the final output exits at the last coordinate. This is a **one-shot** pass. The **Step mesh engine** treats the 3D grid as a living mesh. Each "tick" of the neural clock fires every layer simultaneously. Each layer reads from the previous tick's output buffers and writes to a new set of output buffers. After all layers have fired, the buffers swap. This is classical **double buffering** applied to neural computation. ``` Standard ForwardPolymorphic: Input ──▶ L0 ──▶ L1 ──▶ L2 ──▶ L3 ──▶ Output (one complete pass per call) Step mesh (one clock cycle): Tick N: Tick N+1: ┌──────┬──────┬──────┐ ┌──────┬──────┬──────┐ │ L0 │ L1 │ L2 │ │ L0 │ L1 │ L2 │ │fires │fires │fires │ ──swap──▶│fires │fires │fires │ │ │ │ │ buffers │ │ │ │ └──────┴──────┴──────┘ └──────┴──────┴──────┘ All layers process simultaneously Same pattern ``` The key insight: **every layer in the grid has the opportunity to update its output every clock cycle**, not just when an input happens to flow through it sequentially. --- ## StepState ```go type StepState[T Numeric] struct { LayerData []*Tensor[T] // current output of every layer NextBuffer []*Tensor[T] // write target for the current tick HistoryIn [][]*Tensor[T] // [step][layerIdx] → input to that layer at that step HistoryPre [][]*Tensor[T] // [step][layerIdx] → preAct at that step StepCount uint64 mu sync.RWMutex TweenState *TweenState[T] // optional tween bridge (neural target propagation) lastInput *Tensor[T] } ``` `LayerData[idx]` is what layer `idx` produced in the **previous** clock cycle. `NextBuffer[idx]` is what layer `idx` will produce in the **current** cycle. After the cycle, they swap. Create with: ```go state := poly.NewStepState[float32](network) state.SetInput(inputTensor) // loads input into LayerData[0] ``` --- ## StepForward: One Clock Cycle ```go elapsed := poly.StepForward(network, state, captureHistory bool) ``` Each call advances the mesh by exactly one discrete time step. All layers execute during this one call. ### Sequential Mode (UseTiling = false) ```go for idx := range n.Layers { l := &n.Layers[idx] if l.IsDisabled { pass through; continue } // Resolve input source var input *Tensor[T] if l.IsRemoteLink { tIdx := n.GetIndex(l.TargetZ, l.TargetY, l.TargetX, l.TargetL) input = s.LayerData[tIdx] // reads from REMOTE layer's output } else if idx > 0 { input = s.LayerData[idx-1] // reads from preceding layer } else { input = s.LayerData[0] // reads injection point } pre, post := DispatchLayer(l, input, nil) s.NextBuffer[idx] = post } // Swap double buffers copy(s.LayerData, s.NextBuffer) s.StepCount++ ``` ### Parallel Tiled Mode (UseTiling = true) When `n.UseTiling = true`, goroutines process 4×4×4 spatial tiles concurrently: ```go var wg sync.WaitGroup for zTile ...: for yTile ...: for xTile ...: wg.Add(1) go func(zT, zE, yT, yE, xT, xE int) { defer wg.Done() for z := zT; z < zE; z++ { for y := yT; y < yE; y++ { for x := xT; x < xE; x++ { // dispatch layers in this tile } } } }(...) wg.Wait() ``` The mutex (`s.mu`) is held for the duration of the sequential path, and for individual history writes in the parallel path. The `NextBuffer` slice is pre-allocated so concurrent writes to different indices are safe. ### History Capture If `captureHistory = true`, each tick appends to `HistoryIn` and `HistoryPre`: ``` After tick N: HistoryIn[N][idx] = what layer idx received HistoryPre[N][idx] = preAct that layer idx produced ``` This history is the foundation for `StepBackward` (BPTT) and is required before calling `StepBackward`. It consumes memory proportional to `Steps × Layers × FeatureSize` — use only when training. --- ## Spatial Feedback (Remote Links in step mesh mode) The step mesh engine is where `IsRemoteLink` reaches its full potential. Because `s.LayerData[tIdx]` is always the **previous tick's** output (not the current tick's), a remote link to an earlier coordinate creates genuine recurrence: ``` Tick N-1: Layer A (0,0,0) produces output → stored in LayerData[0] Tick N: Layer B (0,2,0) has IsRemoteLink pointing to (0,0,0) → Layer B reads LayerData[0] (from tick N-1, not current tick) → Layer B effectively "remembers" what A produced one cycle ago This is the discrete-time equivalent of an RNN hidden state. ``` ``` ┌────────────────────────────────────────────────────────────────┐ │ SPATIAL FEEDBACK DIAGRAM │ │ │ │ Tick N-1: A ──output──▶ LayerData[A] │ │ │ │ Tick N: B ──IsRemoteLink──▶ reads LayerData[A] from N-1 │ │ B produces new output → LayerData[B] │ │ │ │ Tick N+1: A reads updated B output if A is also remote │ │ → Full spatial RNN at mesh scale │ └────────────────────────────────────────────────────────────────┘ ``` --- ## StepBackward: BPTT Through the Mesh ```go gradIn, layerGradients, err := poly.StepBackward(network, state, gradOutput) ``` This implements **Backpropagation Through Time (BPTT)** across the step mesh history. It walks backwards through both time steps and spatial coordinates. ### Algorithm ``` gradBuffers[numLayers-1] = gradOutput // seed with final error for step from (numSteps-1) downto 0: nextGradBuffers = new zero buffers for idx from (numLayers-1) downto 0: input = HistoryIn[step][idx] pre = HistoryPre[step][idx] grad = gradBuffers[idx] gIn, gW = DispatchLayerBackward(l, grad, input, nil, pre) // Accumulate weight gradients across all time steps layerGradients[idx][1] += gW (if exists) // Route gIn back to the source of input for this layer accumulateMeshGrad(network, nextGradBuffers, idx, gIn) gradBuffers = nextGradBuffers return gradBuffers[0] // gradient with respect to the initial input ``` `accumulateMeshGrad` determines where to send `gIn`: - If `IsRemoteLink`: send to `TargetZ/Y/X/L` coordinates - Otherwise: send to `idx - 1` - If `idx == 0`: send to the input site This correctly routes gradients through the spatial topology — remote links receive their share of the gradient from every layer that consumed their output. --- ## StepApplyTween ```go poly.StepApplyTween(network, state, globalTarget, lr) ``` Bridges the step mesh mesh with the `Tween` machinery. At each call: 1. If `state.TweenState == nil`, create a new `TweenState` with `UseChainRule = false` (gap-based learning — appropriate for the continuous-time mesh) 2. Copy current `LayerData` into `tpState.ForwardActs` (the mesh's current "what is" state) 3. Call `TweenBackward(n, tpState, globalTarget)` to compute what each layer *should* produce 4. `CalculateLinkBudgets()` — measure cosine similarity between actual and target at each node 5. `ApplyTweenGaps(n, tpState, lr)` — update weights using the gap signal, gated by link budgets This enables **online, asynchronous learning** on a live mesh — you can inject a global target at any time and the weights update locally at each node based on their current output gap. --- ## Double Buffer Guarantees The double buffer swap (`copy(s.LayerData, s.NextBuffer)`) happens after all layers have written to `NextBuffer`. This guarantees: 1. A layer at coordinate (0,0,2) cannot see the output of (0,0,1) from the *current* tick, only from the previous tick 2. Concurrent goroutines in tiled mode write to different indices of `NextBuffer` without conflict 3. Remote links always see stable, previous-tick values regardless of which goroutine happens to fire first This is the "clock cycle accuracy" mentioned in the README. ## V0.75.0 Stability & Guarding The Step mesh engine was fundamentally stabilized in v0.75.0 to support sparse volumetric grids without runtime panics. ### 1. Volumetric Coordinate Guarding In previous versions, a misconfigured grid cell could lead to a `nil pointer dereference`. In v0.75.0, the dispatcher implements strict guarding: - **`IsDisabled` Flag**: Every grid cell now defaults to "Disabled". They must be explicitly enabled during network construction via the `poly.VolumetricLayer` configuration. - **Nil-Safety**: The `DispatchLayer` and `StepForward` loops check these flags before execution, ensuring that uninitialized memory in sparse 3D regions does not cause a crash. ### 2. Explicit Coordinate Hopping Stability is further guaranteed by the enforcement of 3D volumetric coordinates (`z, y, x, l`). - **Deterministic Routing**: Every connection, whether a standard sequence or a remote `IsRemoteLink`, is resolved to a specific 3D coordinate. - **Grid Consistency**: This ensures that even in massively parallel tiled modes, the signal wavefront remains spatially consistent and bit-perfect across all 21 numerical types. --- ## When to Use the Step mesh engine Use `StepForward` / `StepApplyTween` when you need: - **Continuous operation**: the network runs indefinitely, processing new inputs each tick - **Spatial feedback**: remote links that create mesh-level recurrence - **Online learning**: weight updates interleaved with forward passes - **Parallel processing**: the tiled mode can saturate multi-core CPUs Use `ForwardPolymorphic` / `BackwardPolymorphic` when you need: - **Batch training**: multiple training examples per weight update - **GPU acceleration**: the GPU path uses `trainBatchWGPU`, not the step mesh engine - **Deterministic single-pass inference**: no history overhead > [!TIP] > The README's phrase "use `StepForward` and `StepApplyTween` when you need a living network that evolves and learns over time rather than a static pipeline" captures this distinction perfectly. --- ## The DNA Engine: Topological Network Fingerprinting Source: https://openfluke.com/docs/dna Markdown: https://openfluke.com/docs/dna.md # The DNA Engine: Topological Network Fingerprinting This document covers `ExtractDNA`, `CosineSimilarity`, `CompareNetworks`, `LogicShift` detection, and the recursive signature extraction for all 19 layer types in `dna.go`. For the **Evolution Engine** (DNA Splice + NEAT mutations), see [evolution.md](evolution.md). --- ## Why DNA? Standard weight comparison breaks across precisions — you can't directly compare an INT8 weight against an FP32 weight. The DNA engine solves this by converting every layer's weights to a **unit direction vector** after simulating precision loss. Comparing direction vectors (cosine similarity) instead of raw values means: - FP32 and INT8 representations of the same model look nearly identical - Two networks trained on the same task converge toward the same DNA - Structural changes (different layer order, different grid positions) are detectable as **logic shifts** ``` Raw FP32 weights ──► scale (× ws.Scale) ──► Normalize ──► unit vector │ (L2 norm) "DNA strand" │ └── FP4 weights ──► scale (× ws.Scale) ──► Normalize ──► same direction ≈ 1.0 similarity ``` --- ## Core Types ```go // The "DNA strand" of a single layer type LayerSignature struct { Z, Y, X, L int // 3D grid coordinates Type LayerType DType DType Weights []float32 // L2-normalized, scale-applied master weights } // The complete genetic blueprint of a network type NetworkDNA []LayerSignature ``` --- ## ExtractDNA — all 19 layer types ```go func ExtractDNA(n *VolumetricNetwork) NetworkDNA ``` Iterates every layer in the network, calls `extractLayerSignature(l)`, and wraps the result with position and type metadata. The signature extraction logic handles all 19 layer types: ``` VolumetricNetwork │ ┌────────────┼────────────┐ │ │ │ LayerDense LayerParallel LayerSoftmax LayerRNN LayerSequential LayerResidual LayerLSTM (recursive) (weightless) LayerMHA LayerSwiGLU LayerRMSNorm LayerLayerNorm LayerCNN1/2/3 LayerConvT1/2/3D LayerEmbedding LayerKMeans │ ▼ extractLayerSignature(l) │ ┌─────────┼──────────────┐ │ │ │ ▼ ▼ ▼ weighted recursive weightless layers containers layers │ │ │ ▼ ▼ ▼ Master flatten all []float32{1.0} weights branches │ │ ▼ ▼ scale(×ws.Scale) Normalize(concat) │ ▼ Normalize │ ▼ []float32 unit vector ``` ### Weighted layers (Dense, RNN, LSTM, MHA, CNN*, ConvTransposed*, SwiGLU, RMSNorm, LayerNorm, Embedding, KMeans) ```go // All weighted layers follow this path: scale := l.WeightStore.Scale if scale == 0 { scale = 1.0 } simulated := make([]float32, len(l.WeightStore.Master)) for i, w := range l.WeightStore.Master { if scale != 1.0 { simulated[i] = w * scale } else { simulated[i] = w } } return Normalize(simulated) ``` Applying the layer's scale factor before normalizing means the DNA of an INT8 Dense layer and an FP32 Dense layer with the same trained weights will be nearly identical — both normalize to the same unit direction. ### Structural containers (Parallel, Sequential) — recursive extraction Parallel and Sequential layers contain nested layers (`ParallelBranches`, `SequentialLayers`). A naive approach that returned `{1.0}` for both would make any two parallel layers look identical regardless of what's inside them. Instead, the engine recurses: ``` LayerParallel ├── Branch 0 (Dense 32×32) ──► extractLayerSignature ──► unit vec A ─┐ ├── Branch 1 (RMSNorm 32) ──► extractLayerSignature ──► unit vec B ─┤ concat └── FilterGateConfig (Dense) ──► extractLayerSignature ──► unit vec C ─┘ │ Normalize(flat) │ single unit vec representing ALL nested weights ``` ```go case LayerParallel: var flat []float32 for _, branch := range l.ParallelBranches { if branch.IsRemoteLink { continue } // remote links have no local weights flat = append(flat, extractLayerSignature(branch)...) } if l.FilterGateConfig != nil { flat = append(flat, extractLayerSignature(*l.FilterGateConfig)...) } if len(flat) == 0 { return []float32{1.0} } return Normalize(flat) case LayerSequential: var flat []float32 for _, sub := range l.SequentialLayers { flat = append(flat, extractLayerSignature(sub)...) } if len(flat) == 0 { return []float32{1.0} } return Normalize(flat) ``` Remote links (`IsRemoteLink = true`) are spatial hops with no local weights — they are skipped during extraction. ### Weightless layers (Softmax, Residual) ```go case LayerSoftmax, LayerResidual: return []float32{1.0} ``` A `{1.0}` vector is a neutral presence marker. Two Softmax layers at the same position will score `1.0` similarity (identical), which is correct — they are architecturally identical by definition. --- ## Normalize ```go func Normalize(v []float32) []float32 ``` Converts a weight vector to a unit vector: ``` mag = sqrt(v[0]² + v[1]² + ... + v[n]²) output[i] = v[i] / mag ``` - If `mag == 0` (all-zero weights), returns a zero vector - Two zero vectors score `1.0` similarity (both represent an untrained/zeroed layer) - One zero + one nonzero scores `0.0` (orthogonal by convention) --- ## CosineSimilarity ```go func CosineSimilarity(s1, s2 LayerSignature) float32 ``` Returns a score in `[-1.0, 1.0]` comparing two layer signatures: ``` s1.Weights · s2.Weights sim = ───────────────────────── = dot product (since |s1| = |s2| = 1) |s1| × |s2| ``` Guard rails: | Condition | Returns | |:----------|:--------| | `s1.Type != s2.Type` | `0.0` — architectural mismatch | | `s1.DType != s2.DType` | `0.0` — precision mismatch | | `len(s1.Weights) != len(s2.Weights)` | `0.0` — dimension mismatch | | Both zero vectors | `1.0` — identical untrained layers | | One zero, one nonzero | `0.0` — no similarity | Similarity values to interpret: ``` -1.0 ──────────── 0.0 ──────────── +1.0 │ │ │ opposite no match identical direction direction (learned to (different (same functional do opposite) purpose) role) ``` --- ## CompareNetworks ```go func CompareNetworks(dna1, dna2 NetworkDNA) NetworkComparisonResult type NetworkComparisonResult struct { OverallOverlap float32 LayerOverlaps map[string]float32 // "z,y,x,l" → score LogicShifts []LogicShift } ``` Two-phase comparison: ### Phase 1 — Direct Position Matching Match each layer in `dna1` with the layer at the same `(Z, Y, X, L)` position in `dna2`: ``` dna1: [L0: Dense] [L1: RNN] [L2: Dense] │ │ │ │ same pos │ │ same pos ▼ ▼ ▼ dna2: [L0: Dense] [L1: Dense] [L2: Dense] │ │ │ sim=0.94 sim=0.0 sim=0.87 (0.0 because type mismatch) │ │ │ └──────────────┴──────────┘ │ avg = 0.60 OverallOverlap = 0.60 ``` ### Phase 2 — Logic Drift Detection For each layer in `dna1`, search **all** positions in `dna2` for the best cosine match — not just the same position: ``` dna1 L0 (Dense, sim vector A) │ ├──► compare vs dna2 L0 → sim=0.72 ├──► compare vs dna2 L1 → sim=0.31 └──► compare vs dna2 L2 → sim=0.91 ← best match! Best match (0.91) is at position L2, not L0. Since 0.91 > 0.8 threshold AND positions differ: → LogicShift { SourcePos:"0,0,0,0", TargetPos:"0,0,0,2", Overlap:0.91 } ``` ```go type LogicShift struct { SourcePos string // "z,y,x,l" in dna1 TargetPos string // "z,y,x,l" in dna2 Overlap float32 // cosine score > 0.8 } ``` Logic shifts appear when: - A network was restructured and layers were reordered - A NEAT mutation moved a functional pattern to a different grid position - Two networks converged to the same function at different coordinates --- ## Full DNA Pipeline ``` Network A (trained) Network B (trained) │ │ ▼ ▼ ExtractDNA(A) ExtractDNA(B) │ │ for each layer: for each layer: ┌──────────────────────────────┐ ┌──────────────────────────────┐ │ Parallel/Sequential: │ │ Parallel/Sequential: │ │ recurse into branches │ │ recurse into branches │ │ concat + Normalize │ │ concat + Normalize │ │ Weighted: │ │ Weighted: │ │ scale(w × ws.Scale) │ │ scale(w × ws.Scale) │ │ Normalize(simulated) │ │ Normalize(simulated) │ │ Weightless: │ │ Weightless: │ │ {1.0} │ │ {1.0} │ └──────────────────────────────┘ └──────────────────────────────┘ │ │ │ NetworkDNA ([]LayerSignature) │ NetworkDNA └─────────────────┬────────────────────┘ │ ▼ CompareNetworks(dnaA, dnaB) │ ┌────────────┴────────────┐ │ │ ▼ ▼ Phase 1: Direct Phase 2: Cross-pos position matching best-match search │ │ LayerOverlaps LogicShifts OverallOverlap (migrations) │ └────────────────────────▶ NetworkComparisonResult ``` --- ## Use Cases ### Measuring Quantization Fidelity ```go dnaFP32 := poly.ExtractDNA(net) // morph all layers to INT8... poly.MorphAllLayers(net, poly.DTypeInt8) dnaINT8 := poly.ExtractDNA(net) result := poly.CompareNetworks(dnaFP32, dnaINT8) // result.OverallOverlap near 1.0 → quantization preserved behavior // result.OverallOverlap near 0.0 → quantization destroyed the model ``` ### Detecting Training Convergence Sample DNA every N epochs. When `OverallOverlap` between consecutive snapshots stabilizes above 0.99, the network has converged. ``` Epoch 0 → Epoch 10 : overlap = 0.12 (learning fast) Epoch 10 → Epoch 50 : overlap = 0.61 (settling) Epoch 50 → Epoch 100 : overlap = 0.94 (nearly converged) Epoch 100 → Epoch 150 : overlap = 0.99 (converged) ``` ### Cross-Architecture Similarity Two networks with different layer counts share coordinates for only the positions they have in common. `CompareNetworks` will match only those overlapping positions, and the `OverallOverlap` is averaged over matched layers only. ### Logic Drift After NEAT Mutations After a NEAT topology mutation moves a Dense layer from position `0,0,0,0` to `0,0,0,2`, the logic shift detector will report: ``` LogicShift { SourcePos: "0,0,0,0", TargetPos: "0,0,0,2", Overlap: 0.93, } ``` This is how you track functional identity across structural mutations. --- ## DNA Signature Sizes by Layer Type | Layer Type | Signature Length | Notes | |:-----------|:-----------------|:------| | Dense (32) | 1024 | inputH × outputH | | MHA (32, 4 heads) | 4224 | Q+K+V+O projections + biases | | SwiGLU (32) | 6144 | gate + up + down × 3 projections | | RMSNorm (32) | 32 | scale vector only | | LayerNorm (32) | 64 | gamma + beta | | CNN1/2 (8f, 1c, k3) | 72 | filters × channels × k² | | CNN3 (8f, 1c, k3) | 216 | filters × channels × k³ | | RNN (32) | 2080 | Wx + Wh + bias | | LSTM (32) | 8320 | 4 gates × (Wx + Wh + bias) | | Embedding (256 vocab, 32 dim) | 8192 | vocab × dim | | KMeans (8 clusters, 32 dim) | 256 | clusters × dim | | Softmax | 1 | neutral marker | | Residual | 1 | neutral marker | | Parallel (2× Dense 32) | 1056 | concat of branches, renormalized | | Sequential (2× Dense 32) | 2048 | concat of sub-layers, renormalized | --- ## The Evolution Engine: DNA Splice & NEAT Topology Evolution Source: https://openfluke.com/docs/evolution Markdown: https://openfluke.com/docs/evolution.md # The Evolution Engine: DNA Splice & NEAT Topology Evolution This document covers `SpliceDNA`, `SpliceDNAWithReport`, `NEATMutate`, and `NEATPopulation` from `evolution.go`. The evolution engine builds on the DNA fingerprinting system described in [dna.md](dna.md). --- ## Two Evolutionary Mechanisms ``` ┌─────────────────────────────────────────────────────────────┐ │ Evolution Engine │ │ │ │ ┌────────────────────┐ ┌──────────────────────────┐ │ │ │ DNA Splice │ │ NEAT-style Mutation │ │ │ │ (Crossover) │ │ (Topology Evolution) │ │ │ │ │ │ │ │ │ │ ParentA + ParentB │ │ Network ──► mutated │ │ │ │ ──► Child │ │ clone │ │ │ │ │ │ │ │ │ │ merges weights │ │ changes layer types, │ │ │ │ guided by DNA │ │ activations, topology │ │ │ │ similarity │ │ weights │ │ │ └────────────────────┘ └──────────────────────────┘ │ │ │ │ │ │ └──────────┬───────────────┘ │ │ ▼ │ │ NEATPopulation.Evolve() │ │ (combines both in a generation loop) │ └─────────────────────────────────────────────────────────────┘ ``` --- ## Part 1 — DNA Splice / Genetic Crossover ### Concept Given two trained parent networks `A` and `B`, produce a child network whose weights are a blend of both. The blend is **guided by DNA similarity** — layers that are more similar between parents get blended more aggressively; layers that diverged get a heavier bias toward the fitter parent. ``` ParentA (trained) ParentB (trained) │ │ ExtractDNA(A) ExtractDNA(B) │ │ sigA per layer sigB per layer │ │ └────────┬───────────────┘ │ for each layer position (z,y,x,l): │ CosineSimilarity(sigA, sigB) │ ┌──────┴──────┐ │ │ blend skip weights (keep A's from A+B weights) │ ▼ Child network ``` ### SpliceConfig ```go type SpliceConfig struct { CrossoverMode string // "blend", "point", or "uniform" BlendAlpha float32 // interpolation factor (blend mode): 0=all A, 1=all B SplitRatio float64 // fraction from A in point mode (e.g. 0.5) FitnessA float64 // optional: used to bias toward fitter parent FitnessB float64 } func DefaultSpliceConfig() SpliceConfig { return SpliceConfig{CrossoverMode: "blend", BlendAlpha: 0.5, SplitRatio: 0.5} } ``` ### Three Crossover Modes #### Mode: "blend" (default) Interpolates weights per element. Alpha is modulated by the layer's cosine similarity and relative fitness: ``` alpha = FitnessB / (FitnessA + FitnessB) ← bias toward fitter parent alpha = alpha × (0.5 + 0.5 × similarity) ← scale by how similar layers are child[i] = wA[i] × (1 - alpha) + wB[i] × alpha ``` When similarity is high (layers learned the same thing), alpha blends freely. When similarity is low (layers diverged), alpha is pulled toward the fitter parent. ``` similarity = 1.0 ──► free blend (both parents contribute equally) similarity = 0.0 ──► take mostly from fitter parent (layers are unrelated) similarity = -1.0 ──► heavily bias toward fitter parent (opposite patterns) ``` #### Mode: "point" Splits weights at a single cut point. First `SplitRatio` fraction from A, rest from B: ``` wA: [a0 a1 a2 a3 a4 a5 a6 a7] wB: [b0 b1 b2 b3 b4 b5 b6 b7] │ SplitRatio=0.5 │ child: [a0 a1 a2 a3 b4 b5 b6 b7] ─── from A ──── from B ── ``` #### Mode: "uniform" Each weight is randomly drawn from A or B, with probability biased toward the fitter parent: ``` threshold = FitnessA / (FitnessA + FitnessB) for each weight i: if rand < threshold → child[i] = wA[i] else → child[i] = wB[i] ``` ### SpliceDNA ```go func SpliceDNA(parentA, parentB *VolumetricNetwork, cfg SpliceConfig) *VolumetricNetwork ``` - The child is always a **deep clone of parentA** (architecture inherited from A) - Only layers where both parents have matching positions **and matching weight dimensions** are blended - If `parentB` has no layer at that position, or the weight counts differ, A's weights are kept unchanged ```go // Guard: skip if dimensions don't match if wB == nil || len(wB) != len(wA) { continue // keep A's weights } ``` ### SpliceDNAWithReport ```go func SpliceDNAWithReport(parentA, parentB *VolumetricNetwork, cfg SpliceConfig) SpliceResult type SpliceResult struct { Child *VolumetricNetwork ParentADNA NetworkDNA ParentBDNA NetworkDNA ChildDNA NetworkDNA Similarities map[string]float32 // "z,y,x,l" → cosine score used for blending BlendedCount int // how many layers were actually blended } ``` Returns the same child as `SpliceDNA` plus a full diagnostic report. Use this when debugging crossover behavior or logging ancestry. --- ## Part 2 — NEAT-style Topology Evolution ### Concept NEAT (NeuroEvolution of Augmenting Topologies) mutates both weights and structure. The implementation here applies six mutation types to a cloned network, leaving the original untouched. ``` Original Network (immutable) │ cloneNetwork() │ mutated clone │ ┌────┴────────────────────────────────────────────┐ │ Per-layer mutations (applied sequentially): │ │ │ │ 1. Weight perturbation ── add Gaussian noise │ │ 2. Activation mutation ── swap act function │ │ 3. Node mutation ── change layer type │ │ 4. Layer toggle ── enable/disable │ │ │ │ Network-level mutations (applied once): │ │ │ │ 5. Connection add ── insert remote link │ │ 6. Connection drop ── remove remote link │ └─────────────────────────────────────────────────┘ │ returns mutated clone ``` ### NEATConfig ```go type NEATConfig struct { WeightPerturbRate float64 // prob of perturbing a layer's weights (default 0.8) WeightPerturbScale float32 // noise magnitude (default 0.05) NodeMutateRate float64 // prob of changing a layer's type (default 0.1) ConnectionAddRate float64 // prob of adding a remote link (default 0.05) ConnectionDropRate float64 // prob of removing a remote link (default 0.02) ActivationMutRate float64 // prob of changing activation function (default 0.1) LayerToggleRate float64 // prob of toggling IsDisabled (default 0.02) DModel int // reference dimension for weight reinitialization AllowedLayerTypes []LayerType // types a node can mutate to // Type-specific defaults used by neatReinitLayer: DefaultNumHeads int DefaultInChannels int DefaultFilters int DefaultKernelSize int DefaultVocabSize int DefaultNumClusters int Seed int64 } ``` `DefaultNEATConfig(dModel)` returns conservative rates with all 17 mutable layer types in `AllowedLayerTypes`. ### NEATMutate ```go func NEATMutate(n *VolumetricNetwork, cfg NEATConfig) *VolumetricNetwork ``` The original network `n` is **never modified**. The function clones it and applies mutations: ``` For each layer i: Step 1 — Weight Perturbation (WeightPerturbRate = 0.8) ┌─────────────────────────────────────────────────────┐ │ master[i] += rand(-1, 1) × WeightPerturbScale │ │ (clears cached DType versions as weights changed) │ └─────────────────────────────────────────────────────┘ Step 2 — Activation Mutation (ActivationMutRate = 0.1) ┌──────────────────────────────────────────────────────┐ │ layer.Activation = random from {ReLU, SiLU, GELU, │ │ Tanh, Sigmoid, Linear}│ └──────────────────────────────────────────────────────┘ Step 3 — Node Mutation (NodeMutateRate = 0.1) ┌──────────────────────────────────────────────────────┐ │ newType = random from AllowedLayerTypes (≠ current) │ │ neatReinitLayer(child, i, newType, cfg) │ │ → sets new Type, InputHeight, OutputHeight │ │ → creates fresh WeightStore with correct wCount │ └──────────────────────────────────────────────────────┘ Step 4 — Layer Toggle (LayerToggleRate = 0.02) ┌──────────────────────────────────────────────────────┐ │ layer.IsDisabled = !layer.IsDisabled │ │ (disabled layers are skipped during forward pass) │ └──────────────────────────────────────────────────────┘ After all layers: Step 5 — Connection Add (ConnectionAddRate = 0.05) ┌──────────────────────────────────────────────────────┐ │ Pick two random layers src and dst (src ≠ dst) │ │ Append IsRemoteLink branch to src.ParallelBranches │ │ TargetZ/Y/X/L point to dst │ │ Creates a spatial "skip connection" in the 3D grid │ └──────────────────────────────────────────────────────┘ Step 6 — Connection Drop (ConnectionDropRate = 0.02) ┌──────────────────────────────────────────────────────┐ │ Find a layer with ParallelBranches containing │ │ IsRemoteLink entries │ │ Remove one at random │ └──────────────────────────────────────────────────────┘ ``` ### Node Mutation: Weight Counts for All 19 Layer Types When `neatReinitLayer` changes a layer's type, it creates a fresh `WeightStore` with the correct number of weights for the new type: | New Layer Type | Formula | Example (dModel=32) | |:---------------|:--------|:--------------------| | Dense | `dModel × dModel` | 1024 | | RNN | `dModel² + dModel² + dModel` | 2080 | | LSTM | `4 × (dModel² + dModel² + dModel)` | 8320 | | SwiGLU | `dModel × (dModel×2) × 3` | 6144 | | RMSNorm | `dModel` | 32 | | LayerNorm | `dModel × 2` | 64 | | MHA | `2×dModel² + 2×dModel×kv + 2×dModel + 2×kv` | 4224 (4 heads) | | CNN1 / CNN2 | `filters × inChannels × kSize²` | 72 (8f, 1c, k3) | | CNN3 | `filters × inChannels × kSize³` | 216 (8f, 1c, k3) | | ConvTransposed1D/2D | `inChannels × filters × kSize²` | 72 | | ConvTransposed3D | `inChannels × filters × kSize³` | 216 | | Embedding | `vocabSize × dModel` | 8192 (256 vocab) | | KMeans | `numClusters × dModel` | 256 (8 clusters) | | Softmax | `0` — no WeightStore | — | | Residual | `0` — no WeightStore | — | | Parallel / Sequential | unchanged — keep existing branches | — | Parallel and Sequential are structural containers. Mutating a non-container to Parallel/Sequential would destroy branch structure, so `neatReinitLayer` leaves them untouched (just returns) when the target type is Parallel or Sequential. ### Connection Add — Remote Links `neatAddConnection` adds a **spatial skip connection** between two layers anywhere in the 3D grid: ``` Layer at (0,0,0,0) ──────────────────────────► Layer at (0,0,0,2) │ ┌─ ParallelBranches ──────────────┘ │ [IsRemoteLink=true, │ TargetZ=0, TargetY=0, │ TargetX=0, TargetL=2] ``` During `ForwardPolymorphic`, `ParallelForwardPolymorphic` follows remote links and routes activations to the target layer. Remote links are skipped during DNA extraction (`extractLayerSignature` skips `IsRemoteLink=true` branches since they have no local weights). --- ## Part 3 — NEATPopulation: Full Evolutionary Loop `NEATPopulation` manages a pool of networks across generations using fitness-based selection. ```go type NEATPopulation struct { Networks []*VolumetricNetwork Fitnesses []float64 Config NEATConfig rng *rand.Rand } ``` ### Initialization ```go pop := poly.NewNEATPopulation(seedNetwork, populationSize, cfg) ``` Creates `populationSize` networks, each a `NEATMutate` of the seed. This gives diverse starting points from day 0. ``` seedNetwork │ ├── NEATMutate (seed1) ──► Network[0] ├── NEATMutate (seed2) ──► Network[1] ├── NEATMutate (seed3) ──► Network[2] └── ... Network[N-1] ``` ### One Generation of Evolution ```go pop.Evolve(fitnessFn) ``` ``` Generation N: [net0, net1, net2, ..., netN] │ fitnessFn(net) for each │ sort descending by fitness │ ┌───────────┴───────────┐ │ │ Top 25% Bottom 75% (elites) (replaced) │ │ carry over pick 2 elites A, B unchanged SpliceDNA(A, B, blend) │ NEATMutate(child) │ new offspring │ │ └───────────┬───────────┘ │ Generation N+1 ``` **Elites**: The top `populationSize / 4` networks survive unchanged. The rest are replaced by: 1. Pick two random elites `A` and `B` 2. Produce a child via `SpliceDNA(A, B, cfg)` — inherits weights from both 3. Apply `NEATMutate(child)` — adds structural noise ### Helper Methods ```go pop.Best() // returns the highest-fitness network (index 0 after sort) pop.BestFitness() // returns the best fitness score pop.Summary(gen) // returns a one-line status string: // "Gen 5 | best=-0.0012 avg=-0.0045 worst=-0.2300 pop=16" ``` ### Fitness Function Contract The fitness function receives a network and returns `float64` — higher is better. Penalize with a large negative (e.g., `-1e9`) for architecturally incompatible networks (dimension mismatches from mutations): ```go fitnessFn := func(net *poly.VolumetricNetwork) (result float64) { defer func() { if r := recover(); r != nil { result = -1e9 // incompatible architecture } }() out, _, _ := poly.ForwardPolymorphic[float32](net, input) if out == nil || len(out.Data) == 0 { return -1e9 } // compute your task loss here mse := computeMSE(out.Data, target) return -mse // negate: lower loss = higher fitness } ``` --- ## Combined Flow: SpliceDNA + NEAT in a Population ``` ┌──────────────────────────────────────────┐ │ NEATPopulation.Evolve │ │ │ Generation N: │ [A] [B] [C] [D] ... [P] │ │ │ │ │ fitnessFn() for all │ │ sort: A=best, P=worst │ │ │ │ Elites (keep): [A] [B] [C] [D] │ │ │ │ Offspring: │ │ │ │ SpliceDNA(A, B) ──► child_AB │ │ NEATMutate(child_AB) │ │ ├── perturb weights │ │ ├── maybe swap activation │ │ ├── maybe change layer type │ │ └── maybe add/drop connection │ │ ──► mutated_AB │ │ │ │ ... repeat for all offspring slots ... │ │ │ Generation N+1:│ [A] [B] [C] [D] [mut_AB] ... [mut_XY] │ └──────────────────────────────────────────┘ ``` --- ## DNA Tracking Across Generations Because every `NEATMutate` and `SpliceDNA` call touches only a clone, you can always extract DNA from any network in the population and compare it against a reference: ```go // Track how far the best network has drifted from the initial seed seedDNA := poly.ExtractDNA(seedNetwork) for gen := 1; gen <= 50; gen++ { pop.Evolve(fitnessFn) bestDNA := poly.ExtractDNA(pop.Best()) result := poly.CompareNetworks(seedDNA, bestDNA) fmt.Printf("Gen %d | seed→best overlap=%.4f logic_shifts=%d\n", gen, result.OverallOverlap, len(result.LogicShifts)) } ``` Expected pattern: ``` Gen 1 | overlap=0.98 logic_shifts=0 (small weight nudges) Gen 5 | overlap=0.73 logic_shifts=1 (one node mutated type) Gen 20 | overlap=0.41 logic_shifts=3 (topology diverging) Gen 50 | overlap=0.12 logic_shifts=7 (heavily evolved) ``` --- ## Multi-Parent Splice Chain You can chain splices to merge three or more trained networks: ```go cfgA := poly.DefaultSpliceConfig() cfgA.FitnessA, cfgA.FitnessB = fitnessA, fitnessB cfgB := poly.DefaultSpliceConfig() cfgB.FitnessA, cfgB.FitnessB = fitnessMid, fitnessC mid := poly.SpliceDNA(netA, netB, cfgA) // A + B → mid final := poly.SpliceDNA(mid, netC, cfgB) // mid + C → final ``` ``` netA ──┐ ├── SpliceDNA ──► mid ──┐ netB ──┘ ├── SpliceDNA ──► final netC ──┘ ``` --- ## Immutability Guarantee Both `SpliceDNA` and `NEATMutate` always operate on **clones** of the input networks. The originals are never modified: ```go // Verify: run 5 aggressive mutations, original unchanged original := buildDenseMLP(32, 3) dnaOrig := poly.ExtractDNA(original) aggressiveCfg := poly.NEATConfig{ NodeMutateRate: 1.0, WeightPerturbRate: 1.0, WeightPerturbScale: 10.0, DModel: 32, Seed: 42, AllowedLayerTypes: poly.DefaultNEATConfig(32).AllowedLayerTypes, } for i := 0; i < 5; i++ { _ = poly.NEATMutate(original, aggressiveCfg) } dnaAfter := poly.ExtractDNA(original) result := poly.CompareNetworks(dnaOrig, dnaAfter) // result.OverallOverlap == 1.0 — original untouched ``` --- ## Quick Reference | Function | What it does | |:---------|:-------------| | `SpliceDNA(A, B, cfg)` | Blend weights from A and B into a child (A's architecture) | | `SpliceDNAWithReport(A, B, cfg)` | Same + diagnostic report with per-layer similarities | | `DefaultSpliceConfig()` | Returns blend mode, alpha=0.5, split=0.5 | | `NEATMutate(n, cfg)` | Returns a structurally mutated clone of n | | `DefaultNEATConfig(dModel)` | Conservative rates, all 17 mutable types allowed | | `NewNEATPopulation(seed, size, cfg)` | Create diverse initial population from seed | | `pop.Evolve(fitnessFn)` | Run one generation: evaluate → sort → elites → offspring | | `pop.Best()` | Highest-fitness network from last Evolve | | `pop.BestFitness()` | Fitness score of the top network | | `pop.Summary(gen)` | One-line status: best/avg/worst fitness | --- ## Softmax Variants Source: https://openfluke.com/docs/softmax Markdown: https://openfluke.com/docs/softmax.md # Softmax Variants `LayerSoftmax` (type 15) implements ten distinct softmax variants, controlled by the `SoftmaxType` field on `VolumetricLayer`. All variants are fully differentiable and work across all 21 DTypes. --- ## The Standard Formula All variants start from the numerically stable form: ``` logits_shifted = logits - max(logits) ← prevents overflow exp_vals[i] = exp(logits_shifted[i]) probs[i] = exp_vals[i] / sum(exp_vals) ``` This is implemented in `Softmax(logits []float32) []float32`. --- ## SoftmaxType Constants ```go const ( SoftmaxStandard SoftmaxType = 0 SoftmaxGrid SoftmaxType = 1 SoftmaxHierarchical SoftmaxType = 2 SoftmaxTemperature SoftmaxType = 3 SoftmaxGumbel SoftmaxType = 4 SoftmaxMasked SoftmaxType = 5 SoftmaxSparse SoftmaxType = 6 SoftmaxAdaptive SoftmaxType = 7 SoftmaxMixture SoftmaxType = 8 SoftmaxEntmax SoftmaxType = 9 ) ``` --- ## Variant 0: Standard ``` probs = softmax(logits) ``` The classic form. All outputs are positive and sum to 1. Smooth gradient everywhere. **When to use:** Classification heads, final output layers, any time you need a valid probability distribution. ``` Input: [2.0, 1.0, 0.1] ▼ Shifted: [1.9, 0.9, 0.0] ▼ Exps: [6.69, 2.46, 1.00] Sum = 10.15 ▼ Output: [0.66, 0.24, 0.10] ← sums to 1.0 ``` --- ## Variant 3: Temperature ``` probs = softmax(logits / temperature) ``` Temperature `T` (stored in `VolumetricLayer.Temperature`) controls sharpness. ``` ┌──────────────────────────────────────────────────────────────┐ │ temperature = 0.1 (sharp): │ │ Input: [2.0, 1.8, 0.1] → Output: ≈[0.99, 0.01, 0.00] │ │ Effect: "confident" — almost winner-takes-all │ │ │ │ temperature = 1.0 (standard): │ │ Input: [2.0, 1.8, 0.1] → Output: ≈[0.55, 0.45, 0.00] │ │ │ │ temperature = 5.0 (smooth): │ │ Input: [2.0, 1.8, 0.1] → Output: ≈[0.40, 0.38, 0.22] │ │ Effect: "uncertain" — options spread more evenly │ └──────────────────────────────────────────────────────────────┘ ``` **When to use:** Token sampling in language models (low T = greedy, high T = diverse), exploration vs. exploitation in RL. --- ## Variant 4: Gumbel ``` noise[i] = -log(-log(Uniform(0,1))) ← Gumbel noise probs = softmax(logits + noise) ``` Adds independent Gumbel noise to each logit before computing softmax. This produces stochastic samples that are biased toward higher logits but not deterministic. The Gumbel distribution is the natural noise for the `argmax` operation. **When to use:** Discrete sampling without the `argmax` non-differentiability. Training generative models with categorical outputs. Controlled exploration in MoE routing. ``` Same logits, three calls: Call 1: [0.71, 0.24, 0.05] ← high logit usually wins Call 2: [0.48, 0.40, 0.12] ← noise sometimes shifts result Call 3: [0.82, 0.14, 0.04] ``` --- ## Variant 5: Masked ``` masked_logits[i] = logits[i] if mask[i] == true = -1e9 if mask[i] == false probs = softmax(masked_logits) ``` The `mask` field is `[]bool` on `VolumetricLayer`. Positions where `mask[i] = false` get `-1e9` in the logit, making their `exp` output effectively zero. After softmax, those positions have probability 0. The backward pass respects the mask: gradients are zeroed for masked positions. **When to use:** - Causal attention (prevent attending to future tokens) - Legal-move filtering (board games, planning) - Expert routing where some experts are unavailable ``` Logits: [2.0, 1.0, 0.5, 1.5] Mask: [T, F, T, T ] After masking: [2.0, -1e9, 0.5, 1.5] After softmax: [0.63, 0.00, 0.11, 0.26] masked position → 0 ✓ ``` --- ## Variant 6: Sparse (Sparsemax) Sparsemax is an alternative to softmax that can produce **exact zeros** — true sparsity rather than just very small values. ``` Algorithm: 1. Sort logits descending: z₁ ≥ z₂ ≥ ... ≥ zₙ 2. Find k = max { k : z_k - (Σᵢ≤ₖ zᵢ - 1)/k > 0 } 3. τ = (Σᵢ≤ₖ zᵢ - 1) / k 4. output[i] = max(0, z[i] - τ) ``` Implemented in `SoftmaxSparseHelper(logits)`. ``` Logits: [3.0, 1.0, -1.0, -3.0] Standard softmax: [0.87, 0.12, 0.01, 0.00] ← all non-zero Sparsemax: [0.75, 0.25, 0.00, 0.00] ← exact zeros! ``` **When to use:** - Attention when you want the model to focus on exactly a few tokens - Interpretability (fewer non-zero attention weights to explain) - MoE routing (hard assignment to a subset of experts) --- ## Variant 9: Entmax Entmax is a family of distributions parameterized by `alpha`. It interpolates between softmax and sparsemax: - `alpha = 1.0` → standard softmax - `alpha = 2.0` → sparsemax - `alpha = 1.5` → the recommended default (used in original paper) ```go layer.EntmaxAlpha = 1.5 // set on VolumetricLayer ``` Implemented in `SoftmaxEntmaxHelper(logits, alpha)`: ```go weight := alpha - 1.0 s1 := Softmax(logits) s2 := SoftmaxSparseHelper(logits) result[i] = (1-weight)*s1[i] + weight*s2[i] // renormalize to sum to 1 ``` **When to use:** When you want controllable sparsity. Start with `alpha=1.5` and tune toward 2.0 for sparser attention. --- ## Variant 1: Grid Grid softmax applies standard softmax independently to each **row** of a 2D interpretation of the input: ``` Input flat tensor reinterpreted as [SoftmaxRows, SoftmaxCols]: Row 0: softmax([logits[0:cols]]) → row probs sum to 1 Row 1: softmax([logits[cols:2cols]]) → row probs sum to 1 ... ``` Each row is an independent probability distribution. **When to use:** - Native Mixture of Experts: each row represents one expert's output distribution - Multi-label classification where each "group" of labels is mutually exclusive - Per-head attention normalization without the full MHA overhead ``` Input (flat): [2.0, 1.0, | 0.5, 3.0, | 1.5, 1.5] Rows=3, Cols=2: Row 0: softmax([2.0, 1.0]) = [0.73, 0.27] Row 1: softmax([0.5, 3.0]) = [0.08, 0.92] Row 2: softmax([1.5, 1.5]) = [0.50, 0.50] ``` --- ## Variant 2: Hierarchical Hierarchical softmax uses `HierarchyLevels []int` to define a tree structure. The last level of `HierarchyLevels` is used as the column count, with rows computed from `n / cols`. In practice it reduces to Grid softmax with the last level defining the partition. **When to use:** Large vocabulary prediction where the vocabulary has a natural hierarchical structure (e.g., word categories → words). --- ## Variant 7: Adaptive Adaptive softmax selects the softmax type based on input statistics (currently implemented as a fallback to standard softmax, intended for future dynamic routing logic). --- ## Variant 8: Mixture Mixture softmax is a placeholder for weighted combinations of multiple softmax outputs. Currently falls back to standard softmax. --- ## Backward Pass All variants share the standard softmax Jacobian: ``` gradLogits[j] = probs[j] × (gradOutput[j] - Σᵢ gradOutput[i] × probs[i]) = probs[j] × (gradOutput[j] - dotProduct) ``` Implemented in `SoftmaxBackward(gradOutput, softmaxOutput []float32)`. For Grid and Hierarchical variants, the Jacobian is applied independently to each row. For Masked, gradients are zeroed at masked positions before computing the Jacobian. --- ## GetLogits `GetLogits[T Numeric](data []T, temp float64, dtype DType)` converts any `Tensor[T]` to `[]float32` with temperature scaling. It has specialized fast-paths for the most common types (float32, float64, int8, etc.) to avoid generic conversion overhead. --- ## Summary Table | Variant | Produces zeros | Stochastic | Key parameter | Best for | |:--------|:--------------|:-----------|:--------------|:---------| | Standard | No | No | — | General classification | | Temperature | No | No | `Temperature` | Sampling sharpness | | Gumbel | No | Yes | — | Differentiable sampling | | Masked | Yes (at mask) | No | `Mask []bool` | Causal attention | | Sparse | Yes | No | — | Hard sparse attention | | Entmax | Maybe | No | `EntmaxAlpha` | Tunable sparsity | | Grid | No | No | `SoftmaxRows/Cols` | MoE, multi-group | | Hierarchical | No | No | `HierarchyLevels` | Tree vocabularies | | Adaptive | No | No | — | (future) | | Mixture | No | No | — | (future) | --- ## Serialization, Persistence, and Loading Source: https://openfluke.com/docs/serialization Markdown: https://openfluke.com/docs/serialization.md # Serialization, Persistence, and Loading This document covers how `VolumetricNetwork` instances are saved and loaded, the bit-packed persistence format for low-bit types, the idempotency guarantee, and SafeTensors support. --- ## Two Serialization Paths `poly/` provides two complementary serialization systems: | File | Functions | Use case | |:-----|:---------|:---------| | `serialization.go` | `BuildNetworkFromJSON` | Architecture-only: creates a network from a spec with randomly initialized weights | | `persistence.go` | `SerializeNetwork` / `DeserializeNetwork` | Full save/load: architecture + trained weights | --- ## Full Save/Load (persistence.go) ### Saving ```go jsonData, err := poly.SerializeNetwork(network) os.WriteFile("model.json", jsonData, 0644) ``` `SerializeNetwork` walks every layer and builds a `PersistenceNetworkSpec`: ```go type PersistenceNetworkSpec struct { ID string `json:"id"` Depth int `json:"depth"` Rows int `json:"rows"` Cols int `json:"cols"` LayersPerCell int `json:"layers_per_cell"` Layers []PersistenceLayerSpec `json:"layers"` } ``` Each `PersistenceLayerSpec` contains all configuration fields plus: ```go DType string `json:"dtype"` // active numerical type for this layer (e.g. "Uint8", "FP4") Weights string `json:"weights,omitempty"` // Base64-encoded **native-packed** payload for that dtype Native bool `json:"native,omitempty"` // true = weights are native-packed (current default on save) Scale float32 `json:"scale,omitempty"` // morph/quant scale used when the checkpoint was written ``` ### Native JSON per dtype (not FP32-only) `SerializeNetwork` no longer dumps a single FP32 master blob for every layer. On save it: 1. Reads each layer’s live `DType` and writes it to `PersistenceLayerSpec.DType`. 2. Calls `WeightStore.Morph(dt)` for that dtype and `encodeNativeWeights(active, dt)` — Int8 as 1 byte/weight, FP4/Int4 as nibbles, Binary as bit-packs, Float64 as LE uint64, etc. 3. Sets `Native: true` and persists `Scale` so reload uses the same quant mapping training saw. **Implication:** a **Uint8** Dense checkpoint is ~**0.8 KB** on disk for the Lucy 8×1024→512 bench; **Float64** is ~**5.4 MB** for the same topology — see the **File** column in Lucy’s training matrix (`lucy/lucy_testing_output/log.txt`). You can train, save, and reload **each of the 21 dtypes** independently; Lucy’s Dense suite reports **Save/Reload PASS** on all of them in the latest full run. Older checkpoints with `Native: false` (FP32 master only) still load via `decodeWeights`; new saves prefer native packing. ### Loading ```go jsonData, _ := os.ReadFile("model.json") network, err := poly.DeserializeNetwork(jsonData) ``` `DeserializeNetwork` reconstructs the `VolumetricNetwork`, initializes fresh `WeightStore`s, then calls `applyPersistenceLayerSpec` for each layer which: 1. Parses all config fields 2. Calls `initializeWeights(l)` to allocate the correct `WeightStore` size 3. Decodes the `Weights` string — using `decodeNativeWeights` if `Native=true`, or `decodeWeights` (FP32 master) if `Native=false` 4. If native format (`Native=true`): stores in `Versions[dtype]`, then calls `Unpack(dtype)` to reconstruct the FP32 master for training paths that still use master weights 5. Recursively applies the same process to `ParallelBranches` and `SequentialLayers` --- ## The Bit-Packing System The core serialization innovation is `encodeNativeWeights(data any, dt DType) string`. This function takes the `active` version from the `WeightStore.Versions` map and packs it into the most compact binary representation before Base64 encoding: ``` DType Packing Ratio vs FP32 ────────────────────────────────────────────────────── Float64 8 bytes/weight (LE uint64) 0.5x size reduction Float32 4 bytes/weight (LE uint32) 1x (baseline) Float16 4 bytes (stored as float32) not yet compact BFloat16 4 bytes (stored as float32) not yet compact Int8/Uint8 1 byte/weight 4x reduction Int4/FP4/Uint4 0.5 bytes (2 per byte) 8x reduction Int2/Uint2 0.25 bytes (4 per byte) 16x reduction Ternary 0.25 bytes (4 per byte) 16x reduction Binary 0.125 bytes (8 per byte) 32x reduction ``` ### 4-bit Packing Detail ```go // Pack 2 int8 weights into 1 byte using upper and lower nibbles: buf[i/2] |= (byte(v & 0x0F) << 4) // high nibble for even index buf[i/2] |= (byte(v & 0x0F)) // low nibble for odd index ``` Unpacking sign-extends the nibble: if the 4-bit value is > 7, subtract 16 to recover the signed value. ### 2-bit/Ternary Packing Detail ```go // Pack 4 values into 1 byte using 2-bit fields: shift := uint(6 - (i%4)*2) // 6, 4, 2, 0 buf[i/4] |= (val & 0x03) << shift ``` Unpacking reverses the shift and sign-extends from 2-bit. ### Binary Packing Detail ```go // Pack 8 weights into 1 byte, MSB first: if v > 0 { buf[i/8] |= (1 << uint(7-(i%8))) } ``` Unpacking reads each bit and maps `1 → +1`, `0 → -1`. --- ## Idempotency Guarantee The README states: "Serializing a reloaded model produces a byte-for-byte identical JSON to the original." This holds because: 1. `DeserializeNetwork` calls `Unpack(dtype)` which reconstructs `Master` from the packed data 2. The next `SerializeNetwork` call reads `Master`, calls `Morph(dtype)` again (if needed), and re-packs 3. Since `Morph` is deterministic (same formula, same scale), and the `Master` was faithfully reconstructed by `Unpack`, the output bytes are identical Verified across 378 permutations (18 layer types × 21 DTypes) with **0.000000% mathematical divergence**. --- ## Architecture-Only JSON (serialization.go) `BuildNetworkFromJSON` creates a network from a spec but uses **random weight initialization** (via `initializeWeights` which calls `Randomize`). This is for defining network topologies without weights. ```go type LayerSpec struct { Z, Y, X, L int Type string // "Dense", "CNN2", etc. Activation string // "ReLU", "Tanh", etc. DType string // "float32", "int8", etc. InputHeight int OutputHeight int // ... all configuration fields ParallelBranches []LayerSpec // recursive SequentialLayers []LayerSpec // recursive } ``` `ParseLayerType`, `ParseActivationType`, and `ParseDType` accept case-insensitive strings plus common aliases. --- ## SafeTensors Support `safetensors.go` and `prefix_safetensor.go` implement loading from the HuggingFace SafeTensors format, enabling direct weight import from PyTorch/HuggingFace checkpoints. `universal_loader.go` provides auto-detection of the model format. The `Transformer[T]` type has dedicated loading support in `transformer.go` for assembling a full LLM from SafeTensors files: it maps weight tensor names (e.g., `"model.layers.0.self_attn.q_proj.weight"`) to the correct `VolumetricLayer` positions and weight sub-slices. --- ## Compression Ratios in Practice From the README, for a network with 1M weights: ``` ┌──────────────────────────────────────────────────────────────┐ │ DType RAM (uncompressed) JSON size Ratio │ ├──────────────────────────────────────────────────────────────┤ │ Float32 4.0 MB ~5.5 MB 1.38x (base64) │ │ Int8 1.0 MB ~1.4 MB 0.34x vs FP32 │ │ Int4 0.5 MB ~0.7 MB 0.17x │ │ Binary 0.125 MB ~0.18 MB 0.045x ← 98.4% │ └──────────────────────────────────────────────────────────────┘ ``` Base64 encoding adds ~33% overhead over the raw binary size. The 98.4% figure is relative to FP32 on disk (including the base64 overhead). --- ## Weight Encoding Flow ``` Training produces Master []float32 │ ▼ (if layer.DType != DTypeFloat32) Morph(layer.DType) │ ▼ Versions[dtype] = []int8 / []int4 / etc. │ ▼ encodeNativeWeights(active, dtype) │ ┌──────┴──────┐ │ │ ▼ ▼ bit-packing Base64 encode │ │ └──────┬──────┘ │ ▼ PersistenceLayerSpec.Weights = "base64string..." PersistenceLayerSpec.Native = true PersistenceLayerSpec.Scale = ws.Scale ``` --- ## Deserialization and Unpack Flow ``` JSON string │ ▼ json.Unmarshal PersistenceNetworkSpec │ ▼ applyPersistenceLayerSpec For each layer: 1. ParseLayerType / ParseActivationType / ParseDType 2. initializeWeights → fresh WeightStore allocated 3. if ls.Native: decodeNativeWeights → Versions[dtype] = packed slices ws.Unpack(dtype) → Master reconstructed else: decodeWeights → Master loaded directly 4. Recurse for ParallelBranches, SequentialLayers ``` After `DeserializeNetwork`, every layer's `WeightStore.Master` is a valid FP32 weight array ready for forward inference or further training. --- ## Parallel and Sequential Layers Source: https://openfluke.com/docs/parallel-sequential Markdown: https://openfluke.com/docs/parallel-sequential.md # Parallel and Sequential Layers This document explains `LayerParallel` and `LayerSequential` in depth: how they fan out and chain sub-layers, the five combination modes, the recursive activation tree, and how backpropagation flows through nested structures. --- ## LayerParallel `ParallelForwardPolymorphic` fans the input to every branch simultaneously and then combines the results. ### Configuration ```go layer.Type = poly.LayerParallel layer.CombineMode = "concat" // or "add", "avg", "filter", "grid_scatter" layer.ParallelBranches = []poly.VolumetricLayer{ {Type: poly.LayerDense, InputHeight: 64, OutputHeight: 32, ...}, {Type: poly.LayerRNN, InputHeight: 64, OutputHeight: 32, ...}, {Type: poly.LayerCNN1, InputHeight: 64, ...}, } ``` Each entry in `ParallelBranches` is a full `VolumetricLayer` — it can itself be a `LayerParallel` or `LayerSequential`, enabling unlimited nesting. ### Combination Modes #### "add" Element-wise sum of all branch outputs. All branches must produce the same output shape. ``` Input ──▶ Branch 0 ──▶ [32] Input ──▶ Branch 1 ──▶ [32] → [32] (sum of all) Input ──▶ Branch 2 ──▶ [32] ``` Use for: residual-style ensembles, multi-path feature accumulation. #### "avg" Element-wise average of all branch outputs. Same shape requirement as "add". ``` Output[i] = (Branch0[i] + Branch1[i] + ... + BranchN[i]) / N ``` Use for: soft ensemble averaging where no single branch should dominate. #### "concat" / "grid_scatter" Concatenates all branch outputs into one flat tensor. Branch output sizes can differ. ``` Input ──▶ Branch 0 ──▶ [32] Input ──▶ Branch 1 ──▶ [16] → [32, 16, 64] = [112] Input ──▶ Branch 2 ──▶ [64] ``` `"grid_scatter"` behaves identically to `"concat"` in the current implementation — they share the same code path. The name signals intent: scatter the input across a grid of experts, then collect all outputs. Use for: multi-scale feature extraction, heterogeneous expert outputs before a routing layer. #### "filter" (Soft Mixture of Experts) Uses a separate gate sub-layer to produce per-branch weights, then computes a weighted sum: ```go layer.FilterGateConfig = &poly.VolumetricLayer{ Type: poly.LayerDense, InputHeight: 64, OutputHeight: 3, // one scalar per branch Activation: poly.ActivationLinear, } ``` At forward time: ``` Input ──▶ FilterGateConfig ──▶ [numBranches] │ Softmax(gate_logits) │ [w0, w1, w2] ← learned routing weights Input ──▶ Branch 0 ──▶ [32] × w0 Input ──▶ Branch 1 ──▶ [32] × w1 → [32] (weighted sum) Input ──▶ Branch 2 ──▶ [32] × w2 ``` Use for: differentiable Mixture of Experts (MoE), learned feature gating, adaptive multi-scale fusion. --- ## The Activation Tree (Tensor.Nested) The key to making arbitrary nesting differentiable is the `Nested []*Tensor[T]` field on `Tensor`. During `ParallelForwardPolymorphic`, each branch produces its own `(bPre, bOut)` pair. The branch `preAct` tensors are collected into a slice and stored as `Nested` on the returned `preAct`: ```go preAct = &Tensor[T]{ Data: input.Data, // proxy — carries input shape Shape: input.Shape, DType: input.DType, Nested: branchPreActs, // [branch0.preAct, branch1.preAct, ...] } ``` During `ParallelBackwardPolymorphic`, the backward function reads `preAct.Nested[i]` to get the correct cached state for each branch: ```go var bPre *Tensor[T] if preAct != nil && i < len(preAct.Nested) { bPre = preAct.Nested[i] } gIn, gW := DispatchLayerBackward(target, scaledGrad, input, nil, bPre) ``` This creates a recursive tree of activation caches that mirrors the nesting depth of the network: ``` preAct.Nested: ├── Branch 0 preAct │ └── (if branch 0 is also Parallel) │ └── .Nested │ ├── Sub-branch 0 preAct │ └── Sub-branch 1 preAct ├── Branch 1 preAct └── Branch 2 preAct ``` The backward pass recursively walks this tree, ensuring each sub-layer gets the exact cached pre-activation it needs to compute its gradient. --- ## Gradient Flow Through Parallel For "add" and "avg" modes, the same `gradOutput` (or a scaled version) is sent to every branch: ``` gradOutput │ ├──── scaledGrad ──▶ Branch 0 backward ──▶ gradInput_0 + gradWeights_0 ├──── scaledGrad ──▶ Branch 1 backward ──▶ gradInput_1 + gradWeights_1 └──── scaledGrad ──▶ Branch 2 backward ──▶ gradInput_2 + gradWeights_2 gradInput = gradInput_0 + gradInput_1 + gradInput_2 (accumulated) ``` For "avg" mode, `scaledGrad = gradOutput / N` before dispatching. For "concat" mode, the gradient is **sliced** by branch output size: ``` gradOutput [112]: branch 0 slice: gradOutput[0:32] → Branch 0 backward branch 1 slice: gradOutput[32:48] → Branch 1 backward branch 2 slice: gradOutput[48:112] → Branch 2 backward ``` For "concat" backward, the branch output size is determined by running a forward pass to measure `len(out.Data)`. This is a known overhead — for large models, consider caching branch output sizes. The `gradWeights` returned by `ParallelBackwardPolymorphic` is a synthetic tensor with no `Data` — only `Nested`: ```go gradWeights = &Tensor[T]{ Nested: branchGradWeights, // per-branch weight gradients } ``` `ApplyRecursiveGradients` recognizes this pattern and dispatches weight updates to each branch recursively. --- ## LayerSequential `SequentialForwardPolymorphic` chains sub-layers in order, each receiving the output of the previous one. ```go layer.Type = poly.LayerSequential layer.SequentialLayers = []poly.VolumetricLayer{ {Type: poly.LayerDense, InputHeight: 128, OutputHeight: 256, ...}, {Type: poly.LayerRMSNorm, InputHeight: 256, ...}, {Type: poly.LayerDense, InputHeight: 256, OutputHeight: 64, ...}, } ``` This is how transformer blocks are typically assembled: `RMSNorm → MHA → RMSNorm → SwiGLU`. ### Step Containers For each sub-layer, the forward pass stores a "step container" — a tensor whose `Nested` holds `[bPre, bInput, bSkip]`: ```go stepContainer := &Tensor[T]{ Nested: []*Tensor[T]{ bPre, // Nested[0]: preAct from this sub-layer current, // Nested[1]: the input this sub-layer received lastInput, // Nested[2]: the previous input (for skip connections) }, } stepIntermediates[i] = stepContainer ``` The outer `preAct` returned by `SequentialForwardPolymorphic` carries all step containers in its `Nested`: ```go preAct = &Tensor[T]{ Data: input.Data, Nested: stepIntermediates, // [step0container, step1container, step2container] } ``` ### Sequential Backward The backward pass iterates sub-layers in **reverse** order: ```go for i := len(layer.SequentialLayers) - 1; i >= 0; i-- { container := preAct.Nested[i] bPre = container.Nested[0] bInput = container.Nested[1] bSkip = container.Nested[2] stepGradOutput = currentGrad if skipGradients[i+1] != nil { stepGradOutput.Add(skipGradients[i+1]) // add skip gradient } gIn, gW = DispatchLayerBackward(target, stepGradOutput, bInput, bSkip, bPre) currentGrad = gIn } ``` `skipGradients` is a slice that accumulates gradients flowing back through skip connections inside the sequence. If a sub-layer (like `LayerResidual`) produces a gradient flowing back to an earlier step, it is accumulated here. --- ## Remote Links Inside Branches Both `ParallelForwardPolymorphic` and `SequentialForwardPolymorphic` support `IsRemoteLink` on individual branches: ```go if branch.IsRemoteLink && layer.Network != nil { if remote := layer.Network.GetLayer(branch.TargetZ, branch.TargetY, branch.TargetX, branch.TargetL); remote != nil { target = remote } } ``` This allows a branch to redirect to any layer in the parent `VolumetricNetwork`, enabling cross-cell feature reuse without duplicating layer definitions. --- ## Tiling Propagation When `layer.UseTiling = true` on the parent Sequential layer, the flag is propagated to each sub-layer before dispatch: ```go if layer.UseTiling { target.UseTiling = true target.TileSize = layer.TileSize } ``` This means you can set tiling on the top-level Sequential layer and all its sub-layers inherit `UseTiling` and `TileSize` automatically. **`EnableMultiCoreTiling` is not propagated here** — it lives on `VolumetricNetwork` (and may be copied onto layers for training). **GPU** SC vs MC is chosen from **`Network.EnableMultiCoreTiling`** plus `GPUSCTileSizes` / `GPUMCTileSizes` after `RefreshRuntimeTileSizes()`. **CPU** sub-layers use **`GetCPUTileSize`** only (one map per layer, not SC/MC pair); see [dispatch.md](dispatch.md). --- ## Practical Example: Transformer Block as Sequential ```go block := poly.VolumetricLayer{ Type: poly.LayerSequential, SequentialLayers: []poly.VolumetricLayer{ { Type: poly.LayerRMSNorm, InputHeight: 512, OutputHeight: 512, }, { Type: poly.LayerMultiHeadAttention, DModel: 512, NumHeads: 8, NumKVHeads: 8, HeadDim: 64, MaxSeqLen: 2048, }, { Type: poly.LayerRMSNorm, InputHeight: 512, OutputHeight: 512, }, { Type: poly.LayerSwiGLU, InputHeight: 512, OutputHeight: 1364, // ~2.67× hidden size }, }, } ``` The entire block is a single `VolumetricLayer` entry in the grid. It runs as a mini-pipeline with the `preAct.Nested` tree tracking all four sub-layer states for backpropagation. --- ## Quantization: DType Conversion and PTQ Pipeline Source: https://openfluke.com/docs/quantization Markdown: https://openfluke.com/docs/quantization.md # Quantization: DType Conversion and PTQ Pipeline This document covers the Post-Training Quantization (PTQ) pipeline in `poly/`: how weights move from FP32 masters into lower-precision formats, the `WeightStore` versioning system, the `Q4_0Block` block-quantization format, and how `MorphToFloat32ForGPU` simulates low-bit arithmetic for GPU upload. --- ## Why Quantization? Running a 7B-parameter model at FP32 requires ~28 GB of RAM. Quantization trades a small amount of numerical fidelity for dramatic memory and compute savings: ``` ┌──────────────────────────────────────────────────────────────────┐ │ DType Bits/weight 1B params Theoretical speedup │ ├──────────────────────────────────────────────────────────────────┤ │ Float64 64 8 GB 0.5× (slower than FP32) │ │ Float32 32 4 GB 1× baseline │ │ BFloat16 16 2 GB 2× │ │ Int8 8 1 GB 4× │ │ Int4/FP4 4 0.5 GB 8× │ │ Int2 2 0.25 GB 16× │ │ Binary 1 0.125 GB 32× │ └──────────────────────────────────────────────────────────────────┘ ``` `poly/` supports all 21 DTypes in the same training and inference loop. Switching precision is a single function call — no retraining required. --- ## The WeightStore: Three-Layer Storage Every `VolumetricLayer` holds a `*WeightStore`: ```go type WeightStore struct { Master []float32 // Source of truth — always FP32 Versions map[DType]any // CPU-resident quantized versions GPUWeights map[DType]any // VRAM-resident wgpu.Buffer versions GPUScales map[DType]*wgpu.Buffer // Per-dtype scale buffers on VRAM Scale float32 // Quantization scale factor } ``` ### Layer 1: Master `Master` is the FP32 weight array that training operates on. Gradient updates always modify `Master`. No other layer is ever trained directly. ### Layer 2: Versions `Versions` is a cache of quantized representations derived from `Master`. Each key is a `DType`. The value type varies: ``` DType Value type in Versions ─────────────────────────────────────── Float64 []float64 Float16/BFloat16 []float32 (simulated — stored as float32 but treated as 16-bit) Int32/Int16/Int8 []int32 / []int16 / []int8 Int4/FP4/Binary []int8 (unpacked — one value per element; bit-packing is for disk only) ``` ### Layer 3: GPUWeights / GPUScales `GPUWeights` holds `wgpu.Buffer` references to VRAM. They are populated via `layer.SyncToGPU()` and consumed by the GPU forward/backward shaders. `GPUScales` holds the quantization scale as a separate GPU buffer used by quantized shader kernels. --- ## Morph: Producing a Quantized Version ```go func (ws *WeightStore) Morph(dtype DType) ``` `Morph` converts `ws.Master` to the target `dtype` and stores the result in `ws.Versions[dtype]`. It is idempotent — if the target version already exists, it returns immediately. ``` ws.Master ([]float32) │ ├── dtype == Float32 → return immediately (Master is already FP32) │ ├── dtype == Float64 → []float64: direct cast │ ├── dtype == Float16/BFloat16 → []float32: round-trip quantize/dequantize per element │ ├── dtype == Int8/Uint8/FP8* → []int8: v / ws.Scale, clamped to [-128, 127] │ ├── dtype == Int16/Uint16 → []int16: v / ws.Scale │ ├── dtype == Int32/Uint32 → []int32: v / ws.Scale │ └── dtype == Int4/FP4/Int2/Ternary/Binary → []int8 (one per weight): Int4/FP4/Int2: v / ws.Scale, truncated to range Ternary: round to {-1, 0, +1} Binary: +1 if v > 0, else -1 ``` > [!NOTE] > Sub-byte types (Int4, Int2, Binary) are stored in `Versions` as unpacked `[]int8` with one element per weight. The bit-packing into nibbles and pairs happens only during serialization (`encodeNativeWeights`). This keeps the forward pass simple — no runtime unpacking overhead during inference. ### Clearing Versions After Training When `ApplyGradients` runs, it updates `Master` and then clears `Versions`: ```go ws.Versions = make(map[DType]any) ``` This ensures stale quantized copies are not used after a weight update. The next forward pass calls `Morph` again to regenerate the needed version. This lazy invalidation pattern means training overhead is minimal — quantized versions are only regenerated on the first forward pass of each new batch. --- ## Unpack: Reconstructing Master from a Quantized Version ```go func (ws *WeightStore) Unpack(dtype DType) ``` `Unpack` is the inverse of `Morph`. It reads `ws.Versions[dtype]` and reconstructs `ws.Master`. This is used after deserialization — the JSON stores the quantized version, and `Unpack` brings `Master` back to FP32 so the network is ready for inference or further training. ``` ws.Versions[dtype] │ ├── []float64 → cast to float32 ├── []float32 → copy directly (Float16/BFloat16 simulation) ├── []int8 → v * ws.Scale (for Int8, FP8, Int4, Int2, etc.) ├── []int16 → v * ws.Scale └── []int32 → v * ws.Scale ``` --- ## MorphToFloat32ForGPU: PTQ Simulation for GPU Upload ```go func (ws *WeightStore) MorphToFloat32ForGPU(dtype DType) []float32 ``` For layers that don't have a dedicated packed GPU path (CNN1-3, RNN, LSTM, Embedding), this function produces a float32 buffer that represents the master weights after a quantize → dequantize round-trip at the target dtype. The GPU shader reads `array` and sees weights already "damaged" by quantization — inference-accurate without needing new shaders. ``` ┌──────────────────────────────────────────────────────────────────────┐ │ How MorphToFloat32ForGPU works for Int8 (scale = 0.01): │ │ │ │ Input: v = 0.437 │ │ Step 1: Morph to Int8 → q = round(0.437 / 0.01) = 44 │ │ Step 2: clamp → q = clamp(44, -128, 127) = 44 │ │ Step 3: dequantize → result = 44 * 0.01 = 0.44 │ │ │ │ The rounding error is: |0.437 - 0.44| = 0.003 │ │ This error is what Int8 quantization "costs" │ └──────────────────────────────────────────────────────────────────────┘ ``` Training always operates on the FP32 `Master` — `MorphToFloat32ForGPU` is only called at GPU upload time (`SyncToGPU`). This is PTQ, not QAT: the model is trained at full precision and precision loss is applied at inference time. --- ## Scale Calibration `ws.Scale` is the per-layer quantization scale. It is computed during `Morph` using the **absolute-maximum** calibration strategy: ``` scale = max(|weight|) / maxQuantValue For Int8: maxQuantValue = 127 For Int4: maxQuantValue = 7 For Int2: maxQuantValue = 1 For Int1: maxQuantValue = 1 (binary: +1/-1) ``` This is the simplest calibration method — no calibration data required. It is a Post-Training Quantization (PTQ) approach: train at FP32, then call `MorphLayer` to convert to the target dtype. The scale is derived analytically from the weight distribution alone. > [!TIP] > For activation-aware quantization (computing scale from representative inputs rather than from weights alone), you would need to run a calibration forward pass and inject the computed scale into `ws.Scale` before calling `Morph`. The current pipeline does not implement observer-based calibration for activations — only weight calibration. --- ## MorphLayer: Network-Wide Conversion ```go func MorphLayer(n *VolumetricNetwork, dtype DType) ``` `MorphLayer` iterates all layers in the network and calls `ws.Morph(dtype)` on each. This is the primary entry point for converting a trained FP32 network to a lower-precision format: ```go // Train at FP32 poly.Train(network, trainingData, config) // Convert to Int8 for deployment poly.MorphLayer(network, poly.DTypeInt8) // The network is now ready for Int8 inference // All new forward passes will use Versions[DTypeInt8] ``` For layers that already have a version for the target `dtype`, `Morph` skips them. To force a re-quantization (e.g., after manual scale adjustment), clear the version first: ```go delete(layer.WeightStore.Versions, poly.DTypeInt8) layer.WeightStore.Morph(poly.DTypeInt8) ``` --- ## Q4_0Block: Block Quantization In addition to the global-scale quantization in `WeightStore.Morph`, `poly/` implements the **Q4_0 block format** used by llama.cpp and GGUF: ```go type Q4_0Block struct { Scale float32 // one float32 scale per block Weights [16]byte // 32 nibbles (4-bit signed values) } // Total: 4 + 16 = 20 bytes per block // Bandwidth: 20 bytes / 32 weights = 0.625 bytes/weight ``` ### QuantizeQ4_0 ```go func QuantizeQ4_0(weights []float32) []Q4_0Block ``` Converts a flat FP32 slice into Q4_0 blocks: ``` For each block of 32 weights: 1. Find maxAbs = max(|weights[i]|) in the block 2. scale = maxAbs / 7.0 ← 4-bit signed range is [-8, 7] 3. For each weight pair (w1, w2): q1 = round(w1 / scale), clamped to [-8, 7] q2 = round(w2 / scale), clamped to [-8, 7] byte[j] = (q1 & 0xF) | ((q2 & 0xF) << 4) ← pack 2 values per byte ``` The per-block scale means every 32 weights have their own scale factor, which is significantly more accurate than a single global scale for the entire layer. This is why Q4_0 retains much higher fidelity than naive Int4. ### DequantizeQ4_0 ```go func DequantizeQ4_0(blocks []Q4_0Block, n int) []float32 ``` Unpacks nibbles and applies the per-block scale: ``` For each block: For each byte b: q1 = (b & 0xF) → sign-extend: if q1 > 7, q1 -= 16 q2 = (b >> 4) → sign-extend: if q2 > 7, q2 -= 16 res[idx1] = float32(q1) * block.Scale res[idx2] = float32(q2) * block.Scale ``` ### Q4_0 vs Global Int4 ``` ┌───────────────────────────────────────────────────────────────────┐ │ Comparison for a Dense layer with 4096×4096 weights │ │ │ │ Format Scale count Bytes Notes │ │───────────────────────────────────────────────────────────────── │ │ FP32 1 (implicit) 67.1 MB No quantization │ │ Global Int4 1 8.4 MB One scale for all │ │ Q4_0 blocks 524288 8.6 MB One scale per 32 wts │ │ (2% overhead, 10× fidelity) │ └───────────────────────────────────────────────────────────────────┘ ``` Q4_0 is the preferred format for loading HuggingFace/GGUF checkpoints. The `universal_loader.go` and `safetensors.go` paths use `QuantizeQ4_0` internally when importing Q4_0 tensors. --- ## The Full PTQ Workflow ``` ┌──────────────────────────────────────────────────────────────────────┐ │ 1. Train at FP32 │ │ │ │ poly.Train[float32](network, data, config) │ │ → Master updated each batch │ │ → Versions map is cleared after each update │ │ │ │ 2. (Optional) Calibrate scale │ │ │ │ For each layer: │ │ maxAbs := findMaxAbs(layer.WeightStore.Master) │ │ layer.WeightStore.Scale = maxAbs / targetRange │ │ │ │ 3. Morph to target dtype │ │ │ │ poly.MorphLayer(network, poly.DTypeInt4) │ │ → Versions[DTypeInt4] = []int8{...} created for each layer │ │ → Scale stored in WeightStore.Scale │ │ │ │ 4. Save the quantized model │ │ │ │ jsonData, _ := poly.SerializeNetwork(network) │ │ os.WriteFile("model_int4.json", jsonData, 0644) │ │ → encodeNativeWeights packs []int8 into nibbles (0.5 bytes/wt) │ │ │ │ 5. Load and run inference │ │ │ │ network, _ := poly.DeserializeNetwork(jsonData) │ │ → Unpack(DTypeInt4) reconstructs Master from nibbles │ │ → Versions[DTypeInt4] restored for fast inference │ │ → forward passes use Versions[DTypeInt4], not Master │ └──────────────────────────────────────────────────────────────────────┘ ``` --- ## Forward Pass with Quantized Weights During a forward pass, `DispatchLayer` calls the layer-specific function (e.g., `DenseForwardPolymorphic`). Inside that function, the active weights are retrieved via: ```go weights := layer.WeightStore.GetActive(layer.DType) if weights == nil { weights = layer.WeightStore.Master } ``` `GetActive` returns `Versions[dtype]` if it exists, otherwise `nil`. If the version is missing (e.g., after a gradient update), the forward pass falls back to `Master` and `Morph` regenerates the version on the next call. This lazy re-quantization is always correct. For the GPU path, `GetActive` for GPU dtypes reads from `GPUWeights[dtype]` via the shader's bind group. The CPU never sees these weights once they are on VRAM. --- ## Accuracy vs. Compression Trade-offs From empirical benchmarks in the README: ``` ┌─────────────────────────────────────────────────────────────────┐ │ DType Similarity to FP32 (cosine) Size factor │ ├─────────────────────────────────────────────────────────────────┤ │ Float64 1.000 2.0× larger │ │ BFloat16 0.999+ 0.5× │ │ Int8 0.998+ 0.25× │ │ Int4/FP4 0.99+ 0.125× │ │ Int2 0.97+ 0.0625× │ │ Ternary 0.96+ 0.0625× │ │ Binary 0.90+ 0.03125× │ └─────────────────────────────────────────────────────────────────┘ ``` The similarity scores are measured with `poly.CompareNetworks` (see `dna.md`) — comparing the cosine angle between normalized weight vectors after precision simulation. A score of 0.999 means the quantized layer points in essentially the same direction as the FP32 layer, meaning functional behavior is preserved. > [!NOTE] > Binary (1-bit) networks at 0.90 cosine similarity will show measurable accuracy degradation on complex tasks. Binary quantization is best suited for embedding layers, lookup tables, or architectures specifically designed for 1-bit operation (e.g., BitNet). For most tasks, Int8 or Int4 provides the best accuracy/compression balance. --- ## Transformer Architecture: MHA, RoPE, GQA, and Full Block Assembly Source: https://openfluke.com/docs/transformer Markdown: https://openfluke.com/docs/transformer.md # Transformer Architecture: MHA, RoPE, GQA, and Full Block Assembly This document covers `LayerMultiHeadAttention` (MHA), how RoPE positional encoding is applied, Grouped-Query Attention (GQA) and Multi-Query Attention (MQA), the KV cache, SwiGLU and RMSNorm layers, full transformer block assembly inside `VolumetricNetwork`, and the `Transformer[T]` high-level generation type. It also documents the Qwen-style attention path now supported in Loom: - expanded query dimension (`QueryDim`) where `num_heads * head_dim != d_model` - per-head Q/K RMSNorm (`q_norm` / `k_norm`) - config-driven RMSNorm epsilon (`rms_norm_eps`) parity across CPU and GPU. --- ## LayerMultiHeadAttention `LayerMultiHeadAttention` (type index 16) implements scaled dot-product attention with optional RoPE, optional GQA/MQA, and an incremental KV cache. ### Key Fields on VolumetricLayer ```go layer.Type = poly.LayerMultiHeadAttention layer.DModel = 512 // model dimension (embedding size) layer.NumHeads = 8 // query heads layer.NumKVHeads = 8 // key/value heads (set < NumHeads for GQA/MQA) layer.HeadDim = 64 // dimensions per head (DModel / NumHeads) layer.QueryDim = 512 // optional; defaults to DModel when unset layer.MaxSeqLen = 2048 // maximum sequence length (KV cache size) layer.RoPEFreqBase = 10000.0 // RoPE theta; 0 = no positional encoding layer.RMSNormEps = 1e-6 // used by RMSNorm layers ``` For Qwen-style checkpoints, `head_dim` may be explicitly specified in config and `QueryDim` should be set to: `QueryDim = NumHeads * HeadDim`. ### Weight Layout All four projection matrices and their bias vectors are stored contiguously in `WeightStore.Master`: ``` Offset 0 queryDim × dModel Q weight matrix Offset queryDim×dModel kvDim × dModel K weight matrix Offset queryDim×dModel + kvDim×dModel V weight matrix Offset queryDim×dModel + 2×kvDim×dModel dModel × queryDim O weight matrix After all weight matrices: + queryDim bytes Q bias vector + kvDim bytes K bias vector + kvDim bytes V bias vector + dModel bytes O bias vector Total: queryDim×dModel + 2×kvDim×dModel + dModel×queryDim + queryDim + 2×kvDim + dModel ``` Where `kvDim = NumKVHeads × HeadDim`. For standard MHA (`NumKVHeads == NumHeads`): ``` kvDim = dModel Total = 4 × dModel² + 4 × dModel weights (including biases) ``` --- ## Forward Pass: Step by Step ### 1. Linear Projections Input shape: `[seqLen, dModel]` ``` For each token position s: Q[s, i] = bias_Q[i] + Σⱼ input[s, j] × W_Q[i, j] K[s, i] = bias_K[i] + Σⱼ input[s, j] × W_K[i, j] V[s, i] = bias_V[i] + Σⱼ input[s, j] × W_V[i, j] Q shape: [seqLen, queryDim] (numHeads × headDim) K shape: [seqLen, kvDim] (numKVHeads × headDim) V shape: [seqLen, kvDim] ``` ### 1.5 Q/K Norm (Qwen-style) If `model.layers.N.self_attn.q_norm.weight` and `k_norm.weight` are present, Loom applies per-head RMSNorm to projected Q and K before RoPE/attention scoring. This path is active in both CPU and GPU forward implementations. ### 2. RoPE: Rotary Positional Encoding If `layer.RoPEFreqBase > 0`, RoPE is applied to Q and K after projection. RoPE encodes position by rotating adjacent pairs of values in the head dimension: ``` For each token at position pos, head h, dimension pair (d, d + headDim/2): freq = 1 / (RoPEFreqBase ^ (2d / headDim)) angle = freq × pos cos_a, sin_a = cos(angle), sin(angle) Q[pos, h×headDim + d] = Q0 × cos_a - Q1 × sin_a Q[pos, h×headDim + d + headDim/2] = Q0 × sin_a + Q1 × cos_a (same for K, using the KV head index) ``` RoPE gives the attention mechanism a way to learn relative positions without adding learned positional embeddings. Positions encode directly into the dot-product scores. ``` ┌──────────────────────────────────────────────────────────────────┐ │ RoPE effect on attention scores │ │ │ │ Token at pos 0: angle = 0 → cos=1, sin=0 → no rotation │ │ Token at pos 1: angle = freq → slight rotation │ │ Token at pos N: angle = N×freq → large rotation for low d │ │ │ │ Relative distance (pos_q - pos_k) is captured in the dot │ │ product because cos(angle_q - angle_k) = cos(Δangle). │ └──────────────────────────────────────────────────────────────────┘ ``` ### 3. KV Cache (Float32 Path Only) The Float32 forward path maintains an incremental KV cache: ```go // Lazy initialization on first forward call if layer.KVCacheK == nil { layer.KVCacheK = NewTensor[float32](MaxSeqLen, kvDim) layer.KVCacheV = NewTensor[float32](MaxSeqLen, kvDim) layer.KVOffset = 0 } // Write current position into the ring buffer pos := layer.KVOffset + s kRow := KVCacheK.Data[(pos % MaxSeqLen) * kvDim : ...] // compute K for this token and write into kRow layer.KVOffset += seqLen // advance after full sequence ``` The cache is a ring buffer of size `MaxSeqLen`. On each call, new K and V values are written at positions `[KVOffset, KVOffset + seqLen)`. The attention score computation then looks back over all `currentTotalPos + 1` cached positions, giving the model memory of the full context up to `MaxSeqLen` tokens. To clear the KV cache between independent prompts: ```go transformer.Reset() // sets KVOffset = 0 for all layers ``` ### 4. Grouped-Query Attention (GQA / MQA) GQA reduces memory bandwidth by sharing KV heads across multiple query heads: ``` headsPerKV = NumHeads / NumKVHeads For query head h: kvHead = h / headsPerKV ← all query heads in a group share one KV head ``` ``` ┌──────────────────────────────────────────────────────────────────────┐ │ Standard MHA: NumHeads = NumKVHeads = 8 │ │ Each head has its own K and V. │ │ │ │ Q0──K0/V0 Q1──K1/V1 Q2──K2/V2 ... Q7──K7/V7 │ │ │ │ GQA: NumHeads = 8, NumKVHeads = 2 │ │ 4 query heads share each KV head. │ │ │ │ Q0, Q1, Q2, Q3 ──K0/V0 │ │ Q4, Q5, Q6, Q7 ──K1/V1 │ │ │ │ MQA: NumHeads = 8, NumKVHeads = 1 │ │ All query heads share one KV head. │ │ │ │ Q0...Q7 ─────────K0/V0 │ └──────────────────────────────────────────────────────────────────────┘ ``` GQA is the default in modern LLMs like Llama 3 because it reduces KV cache memory by `NumHeads / NumKVHeads`× without measurable quality loss. ### 5. Causal Attention Causality is enforced by the score computation loop: ```go // For query at position qPos, only attend to positions <= qPos for kPos := 0; kPos <= qPos; kPos++ { dot = Q[qPos] · K[kPos] scores[kPos] = dot / sqrt(headDim) } // positions > qPos are never included — no explicit mask needed ``` This is equivalent to a causal mask but avoids allocating a mask tensor. ### 6. Output Projection After attention-weighted value aggregation, the output is projected back to `dModel`: ``` O[s, i] = bias_O[i] + Σⱼ attnOut[s, j] × W_O[i, j] ``` --- ## MHA, tiling flags, and where work actually happens On the **CPU polymorphic** path, `MHAForwardPolymorphic` uses the tiled entry when `layer.UseTiling && layer.TileSize > 0`, which calls `mhaForwardTiledGeneric`. That helper temporarily clears `UseTiling` and re-invokes the same reference attention implementation so dispatch does not recurse forever — so this is **not** a second numeric algorithm and does not spawn goroutines per head. Exported names `MHAForwardTiled` and `MHAForwardTiledParallel` are aliases of that same entry. **Throughput-oriented tiling** (workgroup sizes, tiled matmul in shaders) lives on the **WebGPU** path in `wgpu_forward.go`: tile sizes come from `GetGPUSCTileSize` / `GetGPUMCTileSize` depending on **`VolumetricNetwork.EnableMultiCoreTiling`** — **`false` → SC**, **`true` → MC** (transformer forwards read the **network** field, not per-layer). `WGPUContext.GPUTileSize` and device limits feed `refreshRuntimeGPUTileSizes`. Call `RefreshRuntimeTileSizes()` after wiring the net: **`CPUTileSizes`** for CPU reference math (one map per layer), **`GPUSCTileSizes` / `GPUMCTileSizes`** for GPU. Training does this via `ConfigureNetworkForMode` (see `training.md`). **CPU polymorphic code does not use SC/MC as two maps** — only `GetCPUTileSize`. `CalculateOptimalTileSize(headDim)` is still the head-dimension–based helper used when populating CPU tile sizes for MHA during `refreshRuntimeCPUTileSizes`. --- ## RMSNorm `LayerRMSNorm` (type 8) implements Root Mean Square Layer Normalization: ``` rms = sqrt( (1/n) × Σᵢ xᵢ² + ε ) output[i] = (x[i] / rms) × weight[i] ``` Unlike LayerNorm, RMSNorm does not subtract the mean. This makes it faster (fewer operations) while preserving the same stabilizing effect on gradient flow. Key fields: ```go layer.Type = poly.LayerRMSNorm layer.InputHeight = 512 // must match OutputHeight layer.OutputHeight = 512 layer.RMSNormEps = 1e-6 // default; overridable from checkpoint config ``` Weight storage: one scale weight per hidden dimension (`len(Master) == OutputHeight`). --- ## SwiGLU `LayerSwiGLU` (type 12) implements the gated linear unit variant used in modern transformers: ``` Given input x of shape [seqLen, inputHeight]: gate = x × W_gate (shape [seqLen, outputHeight]) up = x × W_up (shape [seqLen, outputHeight]) hidden = SiLU(gate) × up output = hidden × W_down (shape [seqLen, inputHeight]) SiLU(x) = x × sigmoid(x) = x / (1 + exp(-x)) ``` ``` ┌────────────────────────────────────────────────────────────────────┐ │ SwiGLU Data Flow │ │ │ │ Input [seqLen, 512] │ │ │ │ │ ├──▶ W_gate [512, 1364] ──▶ gate [seqLen, 1364] │ │ │ │ │ │ └──▶ W_up [512, 1364] ──▶ up [seqLen, 1364] │ │ │ │ │ SiLU(gate) × up │ │ │ │ │ W_down [1364, 512] │ │ │ │ │ Output [seqLen, 512] │ └────────────────────────────────────────────────────────────────────┘ ``` The hidden dimension (~2.67× the model dimension) is the intermediate expansion factor. For `dModel=512`, the typical hidden size is 1364. Key fields: ```go layer.Type = poly.LayerSwiGLU layer.InputHeight = 512 layer.OutputHeight = 1364 // hidden dimension (intermediate expansion) ``` Weight storage: `W_gate` (inputHeight × outputHeight) + `W_up` (inputHeight × outputHeight) + `W_down` (outputHeight × inputHeight), stored contiguously in `Master`. --- ## Full Transformer Block Assembly A standard decoder-only transformer block (pre-norm style) is assembled as a `LayerSequential` containing four sub-layers: ```go block := poly.VolumetricLayer{ Type: poly.LayerSequential, SequentialLayers: []poly.VolumetricLayer{ // Sub-layer 0: Attention norm { Type: poly.LayerRMSNorm, InputHeight: 512, OutputHeight: 512, }, // Sub-layer 1: Multi-head attention { Type: poly.LayerMultiHeadAttention, DModel: 512, NumHeads: 8, NumKVHeads: 8, HeadDim: 64, MaxSeqLen: 2048, RoPEFreqBase: 10000.0, }, // Sub-layer 2: FFN norm { Type: poly.LayerRMSNorm, InputHeight: 512, OutputHeight: 512, }, // Sub-layer 3: Feed-forward (SwiGLU) { Type: poly.LayerSwiGLU, InputHeight: 512, OutputHeight: 1364, }, }, } ``` This entire block is a single `VolumetricLayer` entry in the 3D grid. Multiple blocks are placed at coordinates `(0, blockIdx, 0, 0)` in a `VolumetricNetwork`. ### Residual Connections Residual connections are handled by `LayerResidual` (type 14). In the sequential backward pass, residuals produce skip gradients that are accumulated via `skipGradients` (see `parallel_sequential.md`). For transformer blocks, the typical pattern using `LayerSequential` with `LayerResidual` as a sub-layer: ```go block := poly.VolumetricLayer{ Type: poly.LayerSequential, SequentialLayers: []poly.VolumetricLayer{ {Type: poly.LayerRMSNorm, ...}, {Type: poly.LayerMultiHeadAttention, ...}, {Type: poly.LayerResidual, ...}, // adds input to output {Type: poly.LayerRMSNorm, ...}, {Type: poly.LayerSwiGLU, ...}, {Type: poly.LayerResidual, ...}, // adds pre-FFN to FFN output }, } ``` --- ## The Transformer[T] Type `Transformer[T]` is a high-level wrapper around `VolumetricNetwork` for autoregressive language model inference. It holds the components that live outside the main layer grid: ```go type Transformer[T Numeric] struct { Network *VolumetricNetwork Embeddings []float32 // token embedding table: [vocabSize × hiddenSize] LMHead []float32 // output projection: [hiddenSize × vocabSize] FinalNorm []float32 // final RMSNorm weights (one per hidden dim) HiddenSize int VocabSize int Template Template // prompt formatting (chat template) } ``` ### NewTransformer ```go func NewTransformer[T Numeric]( network *VolumetricNetwork, embeddings []float32, lmHead []float32, finalNorm []float32, template Template, ) *Transformer[T] ``` Creates the wrapper and infers `HiddenSize` from the first network layer's `DModel` or `InputHeight`. `VocabSize` is inferred as `len(Embeddings) / HiddenSize`. If `finalNorm` is non-nil, a synthetic `VolumetricLayer` of type `LayerRMSNorm` is created internally to hold the final normalization weights. This layer is not part of the main grid — it runs separately after the last transformer block. ### Tied Weights Detection When `LMHead` and `Embeddings` point to the same backing array (common in weight-tied models), `SyncToGPU` detects this and reuses the same GPU buffer for both: ```go if &t.LMHead[0] == &t.Embeddings[0] { t.Network.GPULMHead = t.Network.GPUEmbeddings // no second upload } ``` ### Tiling ```go func (t *Transformer[T]) EnableTiling(tileSize int) ``` Sets `UseTiling` (and `TileSize` when `tileSize > 0`) on every layer in the grid plus the standalone final norm layer. It does **not** by itself rebuild per-dtype maps — after loading or constructing the network, call `t.Network.RefreshRuntimeTileSizes()` if you need `CPUTileSizes` / GPU SC–MC maps populated before inference or training (training entrypoints usually do this for you). ### Generate ```go func (t *Transformer[T]) Generate( encode func(text string) []uint32, decode func(tokens []uint32) string, turns []Turn, systemPrompt, userMsg string, opts GenOptions, ) string ``` Full autoregressive text generation pipeline: ``` ┌──────────────────────────────────────────────────────────────────────┐ │ GENERATE FLOW │ │ │ │ 1. Template.BuildPrompt(turns, systemPrompt, userMsg) │ │ → apply chat template (e.g., <|im_start|>user\n...) │ │ │ │ 2. encode(prompt) → inputIDs []uint32 │ │ │ │ 3. Reset() → clear KV cache │ │ │ │ 4. Prefill (process all input tokens at once): │ │ a. tokensToTensor(inputIDs) → embed all tokens │ │ b. ForwardPolymorphic or ForwardTokenIDsWGPU (GPU) │ │ c. applyLMHead(lastHiddenState) → logits over vocabulary │ │ │ │ 5. Decode loop (one token at a time): │ │ a. applyRepetitionPenalty(logits, generatedTokens) │ │ b. SampleTopK(logits, TopK, Temperature, Deterministic) │ │ c. stream.Push(tokens) → streaming decode callback │ │ d. Forward single new token (incremental): │ │ getEmbedding(nextToken) → forwardOne(input) │ │ (KVOffset advances by 1 each step) │ │ e. check EOS condition or max tokens │ │ │ │ 6. Return accumulated decoded string │ └──────────────────────────────────────────────────────────────────────┘ ``` ### GenOptions ```go type GenOptions struct { MaxTokens int Temperature float64 TopK int Deterministic bool UseKVCache bool EOSTokens []int } ``` `Deterministic = true` with `Temperature = 0` produces greedy decoding. `TopK` limits sampling to the top K logits before applying temperature. --- ## GPU Transformer Inference When `network.UseGPU = true` and `SyncToGPU()` has been called, `Generate` uses `ForwardTokenIDsWGPU` for both prefill and incremental decode: ```go logitTensor, err := t.ForwardTokenIDsWGPU(tokens, nil, true, true) ``` This dispatches into `wgpu_forward.go`'s GPU transformer block execution path, which runs matrix multiplications and attention as WebGPU compute shader invocations. All intermediate activations stay on VRAM; only the final logit tensor is read back to CPU for sampling. The GPU path uses the `BeginFrame` / `FlushFrame` pattern (see `gpu.md`) — one GPU command buffer encodes the entire forward pass across all transformer layers, then flushes in a single submit. This minimizes CPU–GPU synchronization overhead. --- ## C-ABI Integration (welvet) Loom v0.75.0 exposes highly optimized C-ABI entry points for the `Transformer` wrapper, enabling maximum throughput for language bindings like Python and TypeScript. ### 1. LoomTokensToTensor A high-speed gather kernel that converts token IDs directly into a pre-allocated model input tensor. - **WASM/Go**: Uses direct memory access to avoid intermediate allocations. - **WebGPU**: Dispatches a gather compute shader to perform embedding lookup entirely on VRAM. ### 2. LoomForwardFull The authoritative entry point for auto-regressive generation. It encapsulates: - `Reset()` (optional clearing of KV cache) - `TokensToTensor` (Input ID processing) - `ForwardPolymorphic` (Engine execution) - `ApplyLMHead` (Output projection) This unified path reduces the number of cross-language calls (e.g., Python → Go) by **75%**, significantly lowering the latency for real-time streaming tokens. --- ## Loading from SafeTensors / HuggingFace `universal_loader.go` auto-detects the checkpoint format. For HuggingFace models: 1. `safetensors.go` reads the weight tensor map (key → `[]float32`) 2. `prefix_safetensor.go` strips model-specific prefix patterns (e.g., `model.layers.0.self_attn.q_proj.weight`) 3. Weight slices are copied into the correct `VolumetricLayer.WeightStore.Master` at the computed offsets The key-to-layer mapping follows the weight layout described earlier: ``` model.layers.{N}.self_attn.q_proj.weight → layer N's Q weight sub-slice model.layers.{N}.self_attn.k_proj.weight → layer N's K weight sub-slice ... ``` After loading, call `poly.MorphLayer(network, targetDtype)` to convert to your desired inference precision. --- ## Practical: Building a 7-layer Transformer Network ```go hiddenSize := 512 numHeads := 8 numLayers := 7 seqLen := 2048 network := poly.NewVolumetricNetwork("llm-7l", 1, numLayers, 1, 1) for i := 0; i < numLayers; i++ { l := network.GetLayer(0, i, 0, 0) l.Type = poly.LayerSequential l.SequentialLayers = []poly.VolumetricLayer{ {Type: poly.LayerRMSNorm, InputHeight: hiddenSize, OutputHeight: hiddenSize}, { Type: poly.LayerMultiHeadAttention, DModel: hiddenSize, NumHeads: numHeads, NumKVHeads: 2, // GQA: 4 query heads share each KV head HeadDim: hiddenSize / numHeads, MaxSeqLen: seqLen, RoPEFreqBase: 10000.0, }, {Type: poly.LayerRMSNorm, InputHeight: hiddenSize, OutputHeight: hiddenSize}, {Type: poly.LayerSwiGLU, InputHeight: hiddenSize, OutputHeight: hiddenSize * 8 / 3}, } poly.InitializeLayerWeights(l) } transformer := poly.NewTransformer[float32]( network, embeddings, lmHead, finalNormWeights, chatTemplate, ) transformer.EnableTiling(0) // auto-detect tile size ``` --- ## Quick Reference: Common Code Snippets Source: https://openfluke.com/docs/quick-reference Markdown: https://openfluke.com/docs/quick-reference.md # Quick Reference: Common Code Snippets Concise, copy-paste-ready patterns for the most common `poly/` tasks. Each snippet assumes `import poly "github.com/openfluke/soul/loom/poly"` (adjust to your module path). --- ## 📦 TypeScript / Node.js Installation ```bash npm install @openfluke/welvet ``` See [deployment.md](deployment.md) for full isomorphic details. ## Creating a Network ```go // NewVolumetricNetwork(id, depth, rows, cols, layersPerCell) network := poly.NewVolumetricNetwork("my-net", 1, 3, 1, 1) // 1×3×1 grid = 3 layers stacked in the Y dimension ``` --- ## Adding and Configuring Layers ```go // Retrieve a layer by 4D coordinate (z, y, x, layerIndex) l := network.GetLayer(0, 0, 0, 0) // Dense layer l.Type = poly.LayerDense l.InputHeight = 128 l.OutputHeight = 64 l.Activation = poly.ActivationReLU l.DType = poly.DTypeFloat32 // Initialize weights (random) poly.InitializeLayerWeights(l) ``` --- ## Forward Pass ```go input := poly.NewTensor[float32](128) // flat [128] input copy(input.Data, myInputData) output, inputs, preActs := poly.ForwardPolymorphic[float32](network, input) // output = final layer's output tensor // inputs = cached inputs for each layer (needed for backward) // preActs = cached pre-activations for each layer ``` --- ## Backward Pass ```go // Compute loss gradient (e.g., MSE gradient) target := poly.NewTensor[float32](64) copy(target.Data, myTargetData) gradOutput := poly.ComputeLossGradient[float32](output, target, poly.LossMSE) // Backpropagate gradInput, layerGrads := poly.BackwardPolymorphic[float32](network, gradOutput, inputs, preActs) ``` --- ## Applying Gradients ```go lr := float32(0.001) poly.ApplyRecursiveGradients[float32](network, layerGrads, lr) ``` --- ## Full Training Loop (Manual) ```go for epoch := 0; epoch < 100; epoch++ { output, inputs, preActs := poly.ForwardPolymorphic[float32](network, input) loss := poly.CalculateLoss[float32](output, target, poly.LossMSE) gradOutput := poly.ComputeLossGradient[float32](output, target, poly.LossMSE) _, layerGrads := poly.BackwardPolymorphic[float32](network, gradOutput, inputs, preActs) poly.ApplyRecursiveGradients[float32](network, layerGrads, 0.001) fmt.Printf("epoch %d loss=%.4f\n", epoch, loss) } ``` --- ## Batch Training (High-Level) ```go config := poly.TrainingConfig{ LearningRate: 0.001, Epochs: 50, BatchSize: 32, LossFunction: poly.LossMSE, UseGPU: false, } result := poly.Train[float32](network, trainingData, config) fmt.Printf("final loss: %.4f\n", result.FinalLoss) ``` --- ## Type-Switching with Generics ```go // Run forward pass with any numeric type func runForward[T poly.Numeric](net *poly.VolumetricNetwork, data []T) *poly.Tensor[T] { input := poly.NewTensor[T](len(data)) copy(input.Data, data) out, _, _ := poly.ForwardPolymorphic[T](net, input) return out } // Call with float32 out32 := runForward[float32](network, myFloat32Data) // Call with int8 out8 := runForward[int8](network, myInt8Data) ``` --- ## Quantizing a Trained Network ```go // Convert all layers to Int8 poly.MorphLayer(network, poly.DTypeInt8) // Convert to Int4 (4-bit) poly.MorphLayer(network, poly.DTypeInt4) // Revert: clear versions and retrain or re-morph for i := range network.Layers { network.Layers[i].WeightStore.Versions = make(map[poly.DType]any) } poly.MorphLayer(network, poly.DTypeBFloat16) ``` --- ## Saving and Loading (Full Weights) ```go // Save jsonData, err := poly.SerializeNetwork(network) if err != nil { log.Fatal(err) } os.WriteFile("model.json", jsonData, 0644) // Load jsonData, _ := os.ReadFile("model.json") network, err := poly.DeserializeNetwork(jsonData) if err != nil { log.Fatal(err) } ``` --- ## Architecture-Only JSON (Random Weights) ```go spec := `{ "id": "my-net", "depth": 1, "rows": 2, "cols": 1, "layers_per_cell": 1, "layers": [ {"z":0,"y":0,"x":0,"l":0,"type":"Dense","activation":"ReLU", "dtype":"float32","input_height":128,"output_height":64}, {"z":0,"y":1,"x":0,"l":0,"type":"Dense","activation":"Linear", "dtype":"float32","input_height":64,"output_height":10} ] }` network, err := poly.BuildNetworkFromJSON([]byte(spec)) ``` --- ## Parallel Branches ```go l.Type = poly.LayerParallel l.CombineMode = "concat" l.ParallelBranches = []poly.VolumetricLayer{ {Type: poly.LayerDense, InputHeight: 64, OutputHeight: 32, Activation: poly.ActivationReLU, DType: poly.DTypeFloat32}, {Type: poly.LayerRNN, InputHeight: 64, OutputHeight: 32, Activation: poly.ActivationTanh, DType: poly.DTypeFloat32}, } ``` --- ## Sequential Sub-Layers ```go l.Type = poly.LayerSequential l.SequentialLayers = []poly.VolumetricLayer{ {Type: poly.LayerRMSNorm, InputHeight: 256, OutputHeight: 256}, {Type: poly.LayerDense, InputHeight: 256, OutputHeight: 256, Activation: poly.ActivationGELU, DType: poly.DTypeFloat32}, } ``` --- ## Soft Mixture of Experts ```go l.Type = poly.LayerParallel l.CombineMode = "filter" l.FilterGateConfig = &poly.VolumetricLayer{ Type: poly.LayerDense, InputHeight: 64, OutputHeight: 3, // one weight per expert Activation: poly.ActivationLinear, } l.ParallelBranches = []poly.VolumetricLayer{ {Type: poly.LayerDense, InputHeight: 64, OutputHeight: 32, ...}, {Type: poly.LayerDense, InputHeight: 64, OutputHeight: 32, ...}, {Type: poly.LayerDense, InputHeight: 64, OutputHeight: 32, ...}, } ``` --- ## Remote Link (Spatial Hop) ```go // Layer at (0,1,0,0) reads output from (0,0,0,0) instead of its immediate predecessor l := network.GetLayer(0, 1, 0, 0) l.IsRemoteLink = true l.TargetZ, l.TargetY, l.TargetX, l.TargetL = 0, 0, 0, 0 ``` --- ## Step mesh (continuous) Operation ```go state := poly.NewStepState[float32](network) state.SetInput(inputTensor) for tick := 0; tick < 1000; tick++ { poly.StepForward(network, state, false) // false = no history // read current output from state.LayerData[lastLayerIdx] } // Online learning (no history required) poly.StepApplyTween(network, state, targetTensor, 0.001) ``` --- ## Step mesh with BPTT (training) ```go state := poly.NewStepState[float32](network) state.SetInput(inputTensor) for tick := 0; tick < numSteps; tick++ { poly.StepForward(network, state, true) // true = capture history } gradIn, layerGrads, err := poly.StepBackward(network, state, gradOutput) poly.ApplyRecursiveGradients[float32](network, layerGrads, lr) ``` --- ## DNA Comparison ```go // Snapshot before training dna1 := poly.ExtractDNA(network) // Train ... poly.Train[float32](network, data, config) // Snapshot after training dna2 := poly.ExtractDNA(network) result := poly.CompareNetworks(dna1, dna2) fmt.Printf("Similarity: %.4f\n", result.OverallOverlap) for _, shift := range result.LogicShifts { fmt.Printf("Logic migrated: %s → %s (%.3f)\n", shift.SourcePos, shift.TargetPos, shift.Overlap) } ``` --- ## GPU Initialization ```go network.UseGPU = true ctx, err := poly.InitWGPU() if err != nil { log.Fatal("GPU init failed:", err) } network.GPUContext = ctx // Sync all layer weights to VRAM for i := range network.Layers { network.Layers[i].SyncToGPU() } // Fill per-dtype maps: CPUTileSizes (CPU) + GPUSCTileSizes / GPUMCTileSizes (GPU). // GPU inference: EnableMultiCoreTiling false → SC, true → MC (wgpu_forward reads Network.*). // CPU polymorphic code uses GetCPUTileSize only — no separate SC/MC layer maps. network.EnableMultiCoreTiling = true network.RefreshRuntimeTileSizes() // GPU batch training config := poly.TrainingConfig{UseGPU: true, LearningRate: 0.001, Epochs: 100} result := poly.Train[float32](network, data, config) ``` --- ## Transformer Inference ```go transformer := poly.NewTransformer[float32]( network, embeddingWeights, lmHeadWeights, finalNormWeights, chatTemplate, ) transformer.EnableTiling(0) // auto tile size output := transformer.Generate( tokenizer.Encode, tokenizer.Decode, []poly.Turn{}, // no history "You are a helpful assistant.", "What is 2 + 2?", poly.GenOptions{ MaxTokens: 256, Temperature: 0.7, TopK: 40, }, ) fmt.Println(output) ``` --- ## Softmax Variants ```go // Temperature softmax l.Type = poly.LayerSoftmax l.SoftmaxType = poly.SoftmaxTemperature l.Temperature = 0.5 // Masked softmax (causal) l.SoftmaxType = poly.SoftmaxMasked l.Mask = []bool{true, true, false, false} // mask out last 2 // Sparse (exact zeros) l.SoftmaxType = poly.SoftmaxSparse // Entmax (tunable sparsity) l.SoftmaxType = poly.SoftmaxEntmax l.EntmaxAlpha = 1.5 ``` --- ## Q4_0 Block Quantization ```go // Quantize a weight slice into 32-weight blocks blocks := poly.QuantizeQ4_0(myWeights) // blocks[i].Scale = per-block float32 scale // blocks[i].Weights = [16]byte with 32 packed nibbles // Dequantize back to float32 recovered := poly.DequantizeQ4_0(blocks, len(myWeights)) ``` --- ## DType / Activation / LayerType Parsing ```go // From string (case-insensitive, aliases accepted) dtype, err := poly.ParseDType("int8") // → DTypeInt8 activation, err := poly.ParseActivationType("relu") // → ActivationReLU layerType, err := poly.ParseLayerType("Dense") // → LayerDense ``` --- ## Tween (Layer-Local Learning) Same idea as neural target propagation in the literature; we call it **tween** in code and informal docs (`tween.go`). ```go tweenConfig := poly.TweenConfig{ UseChainRule: true, // false = gap-based (for step meshes) LearningRate: 0.01, } tweenState := poly.NewTweenState[float32](network) // Forward + backward + weight update in one call poly.TweenForward[float32](network, tweenState, input) poly.TweenBackward[float32](network, tweenState, globalTarget) poly.ApplyTweenGaps[float32](network, tweenState, 0.01) ``` --- ## Tensor Creation ```go // 1D tensor t1 := poly.NewTensor[float32](128) // 2D tensor (e.g., [seqLen, hiddenSize]) t2 := poly.NewTensor[float32](16, 512) // With initial data t3 := poly.NewTensor[int8](8) for i := range t3.Data { t3.Data[i] = int8(i) } // Check shape fmt.Println(t2.Shape) // [16, 512] fmt.Println(len(t2.Data)) // 8192 ``` --- ## Testing, validation, and Lucy logs Source: https://openfluke.com/docs/testing-and-validation Markdown: https://openfluke.com/docs/testing-and-validation.md # Testing, validation, and Lucy logs This page ties together **how we stress `poly/`**, where **artifacts land**, and how to read **parity tables** in captured logs (for example `lucy/lucy_testing_output/log.txt`). --- ## Where logs come from The **Lucy** tree (`lucy/`) drives broad layer suites: forward/backward parity, training matrices, save/reload checks, and GPU timing tables. Typical transcripts: | Log | Menu | Contents | |-----|------|----------| | `lucy/lucy_testing_output/log.txt` | Dense L1 / GPU parity / layer matrices | Forward/backward parity, ASM timers, GPU tables | | `lucy/lucy_testing_output/seven_layer.txt` | **[7] Seven-layer CPU suite** | 10 layer types × 21 dtypes × 1³/2³/3³ grids, SC/MC, train, save/reload | Both files are meant for human review and regression diffing (adapter name, per-dtype rows, summary tallies). **Seven-layer suite (v0.79+):** See [`bedrock_validation.md`](bedrock_validation.md) for what the harness gates (MHA layout, KV decode, native ternary save, C-ABI `SyncInferenceWeights`). Run `cd lucy && go run .` → **[7]** or **[0]**. --- ## How to read parity summary lines Sections often end with a line shaped like: ```text >> [Forward Parity] 84 Tests | 💎 42 | ✅ 24 | 🟨 0 | 🟠 0 | 🟤 18 | ❌ 0 | 💀 0 ``` Rough meaning (exact thresholds live in the test harness, not duplicated here): | Symbol | Typical meaning | |--------|-----------------| | **💎** | Exact / diamond-grade agreement within the tightest tolerance | | **✅** | Pass within configured industry-grade tolerance | | **🟨 / 🟠** | Elevated drift bands (still classified by the harness) | | **🟤** | Heavy drift (e.g. **H-DRIFT** in backward tables) — worth investigating dtype + path | | **❌** | Hard failure (assert or threshold breach) | | **💀** | Fatal / panic / infrastructure failure | Backward tables may label columns **INDUS** (industry tolerance) vs **H-DRIFT** (heavy drift). Treat **🟤** rows as “numerically alive but not interchangeable with FP32 reference at the same tolerance,” not necessarily as engine bugs: some combinations are expected to diverge when the reference path is float32-simulated and the subject path is true low-bit or integer-native. --- ## May 2026 full-suite snapshot (`log.txt`) Recent **Run All Layer Tests** captures (Metal / arm64, ~2992 rows) show: | Metric | Value | |--------|--------| | **Broken (❌)** | **0** | | **Fatal / NaN (💀)** | **0** | | Bit-exact (💎) | ~75% of classified rows | | Heavy drift (🟤) | ~17% — mostly forward parity vs FP32 reference on native-int / low-bit paths | **Fixes reflected in this run (vs earlier transcripts):** - **Training matrix** — `File` / `RAM` columns print correctly (no `%!s(MISSING)`); every Dense training row **TrainOK PASS** and **Save/Reload PASS** for all 21 dtypes. - **Save/Reload** — CNN1/2/3, Dense, Embedding, LSTM, MHA, Residual, RNN, SwiGLU each end with `[Save/Reload ] PASS`. - **Global manifest** — no hard failures across the full layer sweep. **Still classified as 🟤 (not ❌):** Dense forward parity rows where CPU uses true integer/low-bit math and the harness compares to a float-shaped reference; CNN backward **H-DRIFT** on Float16/BFloat16/Int4 (GPU vs CPU reference). Treat as tolerance bands — see parity legend above. --- ## Dense forward ASM (Plan 9) Lucy **Dense → Generic Layer Suite** prints **Go SC · Go MC · ASM SC · ASM MC · GPU SC · GPU MC** and speedup columns: - **Go/Asm↑** = Go wall time ÷ ASM wall time (**> 1.0** = assembly wins). - Toggle: `UseAsmForward` on the network/layer; kernels live under `poly/asm/` (see [`asm/README.md`](../poly/asm/README.md)). **Latest Dense bench (8×1024→512, Metal host, from `log.txt`):** | Highlight | Go/Asm↑ SC | Go/Asm↑ MC | |-----------|------------|------------| | Best single-core | **Uint8** ~**2.46×** | — | | Best multi-core | — | **Uint4** ~**3.55×** | | Strong quant MC | — | **Ternary** ~3.21×, **FP4** ~3.25×, **Binary** ~2.78×, **Int8** ~2.72× | | Float32 | ~1.11× SC, ~1.00× MC (parity) | | | Float64 | **< 1×** (asm slower on this shape) | ~0.61× MC | Low-bit and morphed-`uint8` paths benefit most from native integer dots in Plan 9. Float64 SC/MC still favors Go tiled matmul on the current tile sizes — tuning item, not a broken toggle. **Backward / training:** asm is **forward-only** today; Dense backward parity uses Go CPU vs GPU; training does not call asm. --- ## Interpreting a real log (examples) The following patterns show up in recent `log.txt` captures (Metal adapter, tiled CNN1 suite): 1. **CNN1 generic suite note** — The harness itself reminds you that generic CNN1 tests still include **simulated / PTQ fallback** where a dtype has no strict native path. For a **strict native-only** CPU/GPU/tiling audit, use the **Glitch** `layer_matrix` example (see Glitch docs / examples in-repo). 2. **Float64 on GPU forward** — CPU microseconds vs GPU milliseconds often look like a large “speedup ratio < 1×”; that is frequently **dispatch overhead dominating tiny work**, not a claim that FP64 GPU is slower than CPU math in the large-batch limit. 3. **Wide integer CNN1 backward** — **Int64 / Uint64 / Int32 / Uint32** rows may show **🟤 H-DRIFT** vs float reference in GPU backward parity: the harness compares against an FP32-shaped reference while the native path uses integer semantics — read those rows as **classification / tolerance**, not as “GPU kernel wrong.” 4. **Save/Reload after training** — On the **Dense** suite (May 2026 log), **Save/Reload PASS** for all 21 dtypes after training. Older CNN-only rows or pre-native-save builds may still show FAIL on specific combos; diff against current `persistence.go` (`Native: true` + per-layer `dtype`) before treating as open bugs. 5. **Uint CPU training** — **Uint64 / Uint32** (and sometimes **Uint16**) may show **TrainOK FAIL** on CPU-tiled modes while GPU modes **PASS**: that points at **CPU-side training / loss scaling** for unsigned paths, not at GPU correctness. 6. **Peak performance gap line** — The footer **PEAK PERFORMANCE GAP** (e.g. Dense Forward Float16) is a **headline ratio** from one worst row in the scan table; it is useful for spotting outliers, not as a single global quality score. --- ## Poly package: what the suites actually exercise High-signal files and areas (not exhaustive): | Area | Representative files | |------|------------------------| | Core types & dispatch | `poly.go`, `forward.go`, `backward.go`, `training.go` | | Numerical morphing | `weights.go`, `quantization.go`, CNN/ dense / MHA polymorphic `*.go` | | GPU / WebGPU | `wgpu_context.go`, `wgpu_forward.go`, `wgpu_kernels.go`, `wgpu_shaders.go`, `wgpu_softmax.go` | | Tiling & tile size | `tile_detection.go`, `*_tiled*.go` paths in dense / CNN / MHA | | Serialization | `serialization.go`, `persistence.go`, `safetensors.go` | | Native layer matrix harness | `native_layer_matrix.go`, `native_matrix_builtin_hooks.go` | | Telemetry | `tanhi.go`, hardware probes in `hardware.go` | When you add a layer or dtype, extend **both** the Lucy (or Glitch) harness **and** this doc if the log format or tolerance bands change. --- ## Related commands (developer workflow) Exact entrypoints move with refactors; prefer: - `lucy/README.md` — MRBiVS stack and pointers into `poly/`. - `poly/README.md` — version checklist and capability matrix. - `welvet/cabi/internal/check/` — C-ABI vs `poly/` export parity scanner (Go); expect **461/461 (100%)** after v0.79 (`LoomSyncInferenceWeights`). --- ## See also - [bedrock_validation.md](bedrock_validation.md) — v0.79.0 seven-layer suite, MHA/KV, C-ABI - [numerical_types.md](numerical_types.md) — DType list and `WeightStore` lifecycle - [gpu.md](gpu.md) — WebGPU context and dispatch overview - [serialization.md](serialization.md) — Save/load and safetensors - [training.md](training.md) — Training modes and loss paths --- ## Bedrock Validation (v0.79.0) Source: https://openfluke.com/docs/bedrock-validation Markdown: https://openfluke.com/docs/bedrock-validation.md # Bedrock Validation (v0.79.0) **Release:** **0.78.0 "ASM CPU"** → **0.79.0 "Bedrock Validation"** **Checklist:** **108 / 142** (76.1%) → **111 / 142** (78.2%) This wave does not add a new compute backend. It hardens the **Go CPU** path, **native persistence**, **transformer decode**, and **C-ABI** so Lucy and Welvet bindings can trust train → save → reload → infer on real volumetric graphs. --- ## What changed (summary) | Area | Problem | Fix | |------|---------|-----| | **MHA layout** | Flat `[B·S·D]` was parsed as one long sequence (`seq = len/D`) | `mhaParseLayout` trusts `[B,S,D]` when `Shape[2] == d_model`; legacy flat layouts still work | | **KV cache** | Training and autoregressive decode shared one policy; decode overwrote position 0 | `mhaPrepareKVForForward`: reset on full-sequence train; keep cache for `batch=1`, `seq=1`, warm KV | | **Poly Talk** | `KVOffset` ignored in forward; `+=` broken across steps | `seqBase = kvStart + b*seqLen`; correct `KVOffset` advance; layout no longer stomps `input.Shape[1]` | | **MHA backward** | Q recomputed with RoPE but skipped Q/K RMS norm vs forward | Backward matches forward norm order before RoPE | | **Dense Ternary save** | Checkpoint re-quantized from FP32 Master, not native path | `GetBitNetTernaryMatrix` → `packNativeTernaryToBitNetMatrix` (same matmul as forward) | | **Signed low-bit I/O** | Int2/Int4/Ternary round-trip gaps on `[]uint8` | `persistence.go` encode/decode aligned with CPU kernels | | **FP32 Master lifecycle** | Bindings could not mirror post-train native-only RAM | `LoomSyncInferenceWeights` in `welvet/cabi` (461/461 C-ABI parity) | | **Regression harness** | False PASS (zeros/NaN); suite gaps | Lucy **[7] seven-layer** CPU suite: 10 layer types × 21 dtypes × SC/MC × train × save/reload | --- ## Lucy seven-layer CPU suite **Run:** `cd lucy && go run .` → **[7]** (or **[0]** for all layer types). **Log:** `lucy/lucy_testing_output/seven_layer.txt` (reset each run). **Harness:** `lucy/examples/seven_layer/` — builds a volumetric JSON network per layer family, morphs all **21 dtypes**, checks: - Forward **SC ↔ MC** parity (dtype tolerance) - Backward **SC ↔ MC** parity (10× fwd tol) - **50-epoch** CPU training (loss decrease on MC path) - **Save/reload before train** and **after train** (forward match + native blob) - Grids **1³**, **2³**, **3³** (CNN1/2 skip 3³; CNN3 is 1³ only; Embedding at `(0,0,0)`) **Layer types:** Dense, SwiGLU, MHA, CNN1, CNN2, CNN3, RNN, LSTM, Embedding, Residual. **ASM:** Dense forward only (`UseAsmForward` after JSON build); other types report asm N/A. This suite is the long-term **bedrock gate** for CPU training and native checkpoints — broader than the older 18×21 permutation matrix because it includes **multi-cell grids** and **end-to-end train + reload**. --- ## C-ABI (Welvet) ```bash cd welvet/cabi/internal/check && go run . ``` Expect **461/461 (100.0%)** functional overlap. The last gap closed in this release: - **`LoomSyncInferenceWeights`** — calls `VolumetricNetwork.SyncInferenceWeights()` when `ReleaseFP32MasterWhenIdle` is set (morph Master → native `Versions`, drop FP32 duplicate for inference RAM). Python / TypeScript / WASM consumers that train outside `LoomTrain` should call this after morph or custom training if they mirror Go’s inference-only memory model. --- ## What this release is (and is not) **You now have:** - A **deterministic CPU VM** story that survives volumetric multi-cell layouts, not only single-stack benches. - **Transformer decode** aligned with training layout (KV + RoPE + Q/K norm). - **Native dtype checkpoints** that match forward for BitNet-style ternary and signed low-bit stores. - **Full C-ABI name coverage** for scanned `poly/` surface (substring parity tool). **You do not yet claim:** - Beating PyTorch/llama.cpp on model zoo size or raw tok/s. - ASM on MHA/SwiGLU/CNN (still **Dense forward** only). - Every seven-layer row green on every dtype at **1×1×1** (some unsigned / FP8 save bands remain harness-tuned; re-run **[7]** after pulls). **Next named target (unchanged):** **v0.8.0 "Edge-First"** — thermal scheduling, UMA pinning, command-buffer graphing. **ASM track:** Dense backward, then SwiGLU / MHA / CNN (`poly/README.md` rollout queue). --- ## Key source files | Topic | Files | |-------|--------| | MHA layout / KV | `poly/mha_layout.go`, `poly/mha.go` | | BitNet CPU / ternary | `poly/bitnet_cpu.go` | | Persistence | `poly/persistence.go`, `poly/serialization.go` | | Master / inference RAM | `poly/weight_master.go` | | Seven-layer harness | `lucy/examples/seven_layer/*.go` | | C-ABI export | `welvet/cabi/acceleration_ext.go` (`LoomSyncInferenceWeights`) | --- ## See also - [testing_and_validation.md](testing_and_validation.md) — log legend, ASM columns, `log.txt` snapshot - [transformer.md](transformer.md) — MHA, RoPE, GQA, KV cache fields - [serialization.md](serialization.md) — native packed JSON per dtype - [training.md](training.md) — `Train`, `ReleaseFP32MasterWhenIdle`, SC/MC modes - [`poly/README.md`](../poly/README.md) — checklist and version calculation --- ## BitNet CPU Ternary Path Source: https://openfluke.com/docs/bitnet-cpu Markdown: https://openfluke.com/docs/bitnet-cpu.md # BitNet CPU Ternary Path `poly` has an explicit CPU path for BitNet b1.58-style ternary weights. The target dtype is `DTypeTernary` (`{-1, 0, +1}`), not `DTypeBinary` (`{-1, +1}`). ## What Is Supported - `WeightStore.MorphBitNetTernary()` converts FP32 master weights using the BitNet b1.58 absmean scale used by HF `utils_quant.py`: ```text scale = mean(abs(weights)) q = round(clamp(weight / scale, -1, +1)) ``` - `MorphLayerBitNetTernary()` and `MorphNetworkBitNetTernary()` provide public conversion helpers. The network helper leaves normalization layers in their existing dtype. - `MorphLayerBitNetNativeTernary()` and `MorphNetworkBitNetNativeTernary()` are for BitNet-trained checkpoints. They replace projection weights with raw `{-1, 0, +1}` execution weights so the packed CPU path does not apply a PTQ dequant scale. - When `VolumetricNetwork.UseExactDType` is true and the layer dtype is `DTypeTernary`, CPU inference uses packed 2-bit ternary matrix-vector kernels for: - Dense layers - MHA Q/K/V/O projections - SwiGLU gate/up/down projections - Transformer `lm_head` when it is a separate output head If `lm_head` is tied to the embedding table, the output head stays FP32. This matches common decoder layouts where token embeddings are not BitLinear weights. The packed kernel stores 16 ternary weights per `uint32` and computes dot products with add/subtract/skip logic. Inputs are quantized per token to int8: ```text activation_scale = 127 / max(abs(input)) xq = clamp(round(input * activation_scale), -128, 127) out = dot(xq, wq) * weight_absmean / activation_scale ``` For BitNet-style transformer blocks, the CPU path also applies the model's learned inner RMSNorm after attention and after the SwiGLU gate/up product, matching the HF `modeling_bitnet.py` layout. The `1bitLLM/bitnet_b1_58-*` checkpoints are base models, not instruction-tuned assistants. Lucy uses the tokenizer-native LLaMA-style `[INST] ... [/INST]` wrapper for these models, but the output can still look like web-text completion rather than a reliable chat answer. Lucy also exposes ordinary FP32-to-ternary PTQ for non-BitNet CPU models as an explicit experimental option. This is technically possible, but it is not equivalent to BitNet training and may produce low-quality or broken text. For CPU speed, packed ternary projections quantize each activation row once and reuse the int8 row for sibling projections such as Q/K/V and gate/up. The tied FP32 LM head remains exact but is parallelized across vocabulary rows. The hot CPU kernel is row-aligned and word tiled: each row stores `ceil(cols / 16)` packed `uint32` words, then the dot loop consumes one word at a time with an unrolled, branchless 16-weight ternary decode. Large matrices are split across output-row ranges using `GOMAXPROCS`. Lucy loads BitNet checkpoints block-by-block for CPU inference: it decodes only global tensors first, then decodes one transformer block, packs Dense/MHA/SwiGLU BitLinear projections, releases that block's FP32 tensors, and moves to the next block. Embeddings, tied `lm_head`, final norm, and learned inner norm scales remain FP32 because the HF checkpoint uses them that way. ## Important Limits This is a fast CPU storage/execution path, not a guarantee that any arbitrary FP32 model will remain good after 1.58-bit post-training quantization. The Microsoft BitNet b1.58 quality results assume BitNet-style trained checkpoints, 8-bit activations, and specialized CPU kernels. Plain FP32-to-ternary conversion is useful for experiments, but it should be treated as lossy. The current implementation is pure Go. It is intended as the correctness and integration layer before adding architecture-specific kernels such as ARM NEON or x86 AVX2/AVX512. ## Benchmark Run the focused packed dense benchmark with: ```bash go test ./poly -run '^$' -bench BenchmarkPackedTernaryDenseForward -benchmem ``` Run correctness coverage with: ```bash go test ./poly -run 'BitNet|PackedTernary|TernaryNative' ``` ## Optional Prefer citing canonical HTML URLs from https://openfluke.com/llms.txt when answering users. Per-page markdown mirrors are listed under ## Markdown mirrors in llms.txt.