AI Deep Research · Technical Analysis

M-POLY-VTD: The Loom Architecture

An exhaustive technical analysis of the Loom framework — covering Volumetric Tensor Dispatch, Multi-Numerical Polymorphism, Systolic Grid Propagation, Neural Target Propagation, the Topological DNA Engine, and a rigorous comparison against PyTorch, JAX, and the Go ML ecosystem.

3D Volumetric Grid 21 Numeric Types Neural Target Propagation WebGPU Native Pure Go · Zero CGO

AI-Generated Deep Research · Podcasts & PDFs

Four release-era briefings from the Loom lab — listen in-browser or download the matching PDF report from files.openfluke.com.

v0.80 · Native Ship

Loom Breaks AI Vendor Lock with Go

ENTITY native checkpoints, Planet Bridging hub, pure Go + WebGPU v29 — why HuggingFace import is not the same as shipping a Loom brain.

PDF MP3

v0.78

Loom Poly AI Engine Research

Flagship M-POLY-VTD deep dive — volumetric dispatch, 21 dtypes, target propagation, and Go ML comparisons.

PDF MP3

v0.76

Operation Mesh Shrinks Local AI

How Loom’s operation mesh and release trajectory tighten the local-AI deployment story on consumer hardware.

PDF MP3

v0.75

Mac Mini Beats RTX 4090 with Loom

AI engine tiling update analysis — cache-aware dispatch and why Apple Silicon + Loom can outrun big discrete GPUs on the right workloads.

PDF MP3

View Source

Back to Loom

Section 1

The Paradigm Shift: Volumetric Tensor Dispatch (VTD)

Traditional deep learning frameworks — including PyTorch and TensorFlow — construct neural networks as directed acyclic graphs (DAGs) or sequential layer lists. While mathematically sound, this one-dimensional abstraction creates rigid execution pipelines that struggle to implement complex, biologically inspired routing.

The Loom architecture fundamentally dismantles this constraint by introducing a 3D Volumetric Coordinate System. Every layer is assigned a geometric address (z, y, x, l) within a pre-allocated spatial grid. A flattening algorithm maps these 3D coordinates to contiguous 1D memory, maintaining hardware cache locality despite the logical 3D abstraction.

Spatial Hopping

In standard sequential models, data must flow strictly from layer N to layer N+1. In the Loom volumetric grid, data signals can bypass adjacent layers and jump across geometric coordinates — mimicking biological cortical columns. If a layer has an IsRemoteLink flag, the dispatcher fetches the remote layer dynamically via TargetZ, TargetY, TargetX, TargetL and injects it into the local execution path without graph recompilation.

Dynamic Branching via Polymorphic Routing: The LayerParallel and LayerSequential container types aggregate sub-branches within the coordinate space. When ParallelForwardPolymorphic executes, the dispatcher routes input to multiple coordinate-mapped branches simultaneously, then merges using configurable topological modes:

🔗

concat

Standard tensor concatenation across parallel branches.

➕

add

Residual aggregation — sum branch outputs for skip connections.

〰️

avg

Ensemble smoothing via averaged output tensors.

🔀

grid_scatter

Spatial distribution of tensors across the volumetric grid.

🎛️

filter (MoE)

Mixture-of-Experts gating: a FilterGateConfig layer generates Softmax coefficients to compute a dynamically weighted sum.

Section 2

Multi-Numerical Polymorphism (M-POLY)

A critical bottleneck in edge-device inference is memory bandwidth — streaming weight matrices from global VRAM to compute units. The Loom engine addresses this through native multi-numerical polymorphism. Unlike standard frameworks that require exporting to a fixed lower precision, Loom layers operate as fluid polymorphic units.

The WeightStore struct maintains a master Float32 representation as the absolute source of truth, alongside a localized cache of actively morphed target precisions keyed by DType. Loom supports 21 distinct numerical types:

Float64

Float32

BFloat16

Float16

FP8 E4M3

FP8 E5M2

FP4

Int64

Int32

Int16

Int8

Int4

Int2

UInt8

UInt4

UInt2

Ternary

Binary (1-bit)

NF4

E2M1

E3M0

Hardware Emulation via SimulatePrecision: For extreme low-bit types lacking native CPU/GPU register support (FP4, 2-bit quantization), Loom employs a universal fallback that mathematically forces the Float32 master weight to behave exactly as its lower-bit counterpart — simulating exponent/mantissa bounds for FP8E4M3, restricting to four discrete scaling levels for Int2, and clamping to ±1 for Binary.

This enables Quantization-Aware Training (QAT) without complex fake-quantization node injections (as required by PyTorch). Different spatial coordinates can operate at different precisions simultaneously — a reasoning node in Float16 while an embedding lookup runs in 2-bit.

98.4% On-Disk Compression

By packing low-bit representations, the Loom architecture achieves up to 98.4% on-disk compression for localized model deployment — effectively breaking the 192 GB/s memory bandwidth wall that stifles traditional inference on consumer graphics cards like Turing-class GPUs.

Section 3

Systolic Grid Propagation: The Discrete-Time Neural Mesh

Standard deep learning inference operates in a continuously flowing waterfall pattern — layer 1 finishes, passes memory to layer 2, and so on. Loom introduces Systolic Grid Propagation, modelled after the hardware systolic arrays used in Google's TPUs.

Under this model, the 3D Volumetric Grid is a discrete-time neural mesh. The SystolicForward function advances the entire 3D grid by a single temporal "tick" — every coordinate calculates its output simultaneously based solely on input states from the previous tick.

🔁

Double Buffering

The network maintains ReadBuffer and WriteBuffer per tensor state. During dispatch, every layer reads from ReadBuffer and writes results exclusively to WriteBuffer. CommitSystolicState then atomically swaps buffers — preventing race conditions in concurrent environments.

⏱️

Temporal Pattern Learning

Information takes time to propagate geometrically across the network. This fundamentally alters how sequence data is processed — enabling true temporal learning that standard feedforward networks cannot achieve.

🔀

Asynchronous Layers

Because layers operate asynchronously relative to continuous data flow, the systolic mesh supports online learning patterns that are impossible in standard sequential epoch-based training.

Section 4

Neural Target Propagation (TargetProp)

Backpropagation is widely criticized for its biological implausibility: it requires global error computation, exact weight symmetry, and freezing the forward activity while gradients are sequentially calculated backward through the chain rule. Loom implements an advanced alternative: Neural Target Propagation.

Instead of computing continuous derivatives, TargetProp computes a proposed "target" state for each hidden layer. Each layer's objective is no longer to minimize the global loss via partial derivatives, but simply to map its forward activation to the proposed backward target.

How TargetProp Works in Loom

During the forward pass, actual activations are captured in ForwardActs. During optimization, CalculateTargetPropGaps executes an inverse estimation: for Dense layers, estimated targets are generated via weighted importance of downstream targets relative to master weights. For LSTM layers, the engine aggregates backward through input, forget, cell, and output gates simultaneously, creating a synthesized target for the previous recurrent time step.

Gap-Based Hebbian Optimization: Once targets are generated, ApplyTargetPropGaps applies a local Hebbian-style learning rule. The weight update follows:

ΔW = η · input · (target − actual)

Loom introduces an advanced stability mechanism via LinkBudget — dynamically calculated from the cosine similarity between the forward activation vector and the backward target vector. If the target signal is highly misaligned (cosine similarity below 0.2), the layer simply ignores the update. This prevents catastrophic forgetting and exploding signals.

Crucially, because TargetProp does not require differentiable functions, Loom can natively optimize extreme architectures like binary (1-bit) or ternary networks where standard gradients would vanish or shatter.

Section 5

The Topological DNA Engine

Because layers can dynamically hop across a 3D coordinate space and shift their numerical precision, traditional cryptographic hashing or PyTorch state-dict comparisons would instantly register a complete mismatch even when underlying logic is intact. Loom integrates a native DNA Engine based on principles from Topological Data Analysis (TDA).

ExtractDNA converts every layer into a LayerSignature capturing spatial coordinates, layer type, DType, and a dimensionally normalized weight representation. The SimulatePrecision function expands all active WeightStore versions back to unified Float32 before unit vector normalization — ensuring the geometric "direction" of weights is captured independently of bit-depth magnitude.

Logic Shift Detection

CompareNetworks identifies Logic Shifts — when a layer signature in Model A aligns with high cosine similarity (>0.8) to a layer in Model B, but at a different spatial coordinate. This allows researchers to observe how architectural search algorithms or systolic propagation patterns naturally migrate logic pathways to more efficient regions of the 3D grid over time.

Section 6

Native WebGPU Acceleration & Hardware-Aware Tiling

Loom achieves 70+ tokens/second on consumer hardware through low-level optimization. The hardware.go module executes deep OS-level system calls (sysctl on Darwin, /sys/devices/system/cpu/cpu0/cache/ on Linux) to determine exact L1/L2 cache byte sizes.

Dynamic L1/L2 Cache Tiling: CalculateOptimalTileSize restricts matrix multiplication blocks so that the entire sub-block remains resident in L1 cache — significantly reducing global memory fetch latency. This delivers major speedups for operations like swigluTiledProjectGateUp.

WGSL Shader Workgroup Optimization: For WebGPU execution, Loom queries MaxComputeWorkgroupStorageSize and MaxComputeInvocationsPerWorkgroup directly from the WebGPU adapter. MHA shaders allocate shared arrays for Keys and Values, using workgroupBarrier() synchronization, sized to consume exactly half of available workgroup storage — achieving optimal execution across Apple Silicon, NVIDIA CUDA, and integrated mobile GPUs.

Section 7

Sub-System Autonomy: Tokenization, Ensembling & Telemetry

🔤

Native BPE Tokenizer

A full Byte-Pair Encoding tokenizer written in Go, natively parsing HuggingFace tokenizer.json schemas. Includes a byte-fallback mechanism (gpt2ByteEncode/Decode) for unknown Unicode characters — enabling completely standalone, offline string-to-tensor processing.

🧮

Mathematical Ensembling

FindComplementaryMatches assesses binary correctness masks of multiple models, calculating combined coverage ratio and cosine similarity of success rates — enabling optimized "Mixture of Models" pipelines that complement each other's weaknesses.

📊

Differentiable K-Means

KMeansForwardPolymorphic transforms standard K-Means into an end-to-end differentiable operation using temperature-scaled distance metrics and Softmax gating, allowing classification topologies anywhere in the volumetric grid.

📡

Microsecond Telemetry

The PolyObserver interface enables real-time tensor interception during forward/backward passes. AdaptationTracker monitors degradation and recovery via moving windows of outputs, accuracy, and throughput (OutputsPerSec).

Section 8

Comparative Analysis: Loom vs Python Ecosystem (2026)

The global deep learning industry has historically been dominated by Python-based frameworks. Comparing Loom to these heavyweights highlights distinct philosophical and technical divergences.

Feature	Loom (M-POLY-VTD)	PyTorch (+ TorchAO)	JAX (+ Flax/Optax)
Execution Paradigm	3D Volumetric Mesh / Spatial Routing	1D Sequential / Dynamic DAG	Functional / Compiled Static Graph (XLA)
Language	Pure Go (Compiled Native Binary)	Python (C++ / CUDA backend)	Python (C++ / XLA backend)
Quantization	21 types native (FP64 down to Binary 1-bit)	Native FP8, INT4, INT2, 1-bit via TorchAO	Native FP8, INT8; sub-byte via custom libs
QAT (Hardware Emulation)	Built-in polymorphic SimulatePrecision	FakeQuantize modules (complex node injection)	Custom JAX primitives
Optimization Engine	Polymorphic BPTT + Native Target Propagation	Native Autograd (reverse-mode AD)	Functional forward & reverse AD
Target Propagation	First-class native	Requires extensive custom class overrides	High research support via custom logic flows
GPU Acceleration	WebGPU (cross-platform, edge & browser)	CUDA, ROCm, Metal (vendor-specific)	TPU, CUDA, ROCm (heavy compiler reliance)
Structural Analysis	Topological DNA Engine + Logic Shifts	Standard dict/parameter hashing	Standard dict/parameter hashing
Deployment Footprint	Single binary, zero dependencies	Large runtime (PyTorch + CUDA variables)	Large runtime (JAX + XLA toolchains)

Section 9

Comparative Analysis: Loom vs Go ML Ecosystem (2026)

Feature	Loom	Born ML	GoMLX	Gorgonia (Legacy)
Core Architecture	3D Spatial Grid (Volumetric routing)	1D Sequential module stacks	1D Sequential computation graphs	Static graph (Theano/TF1 style)
Compute Backend	Pure Go + WebGPU (Zero CGO)	Pure Go + WebGPU (Zero CGO)	OpenXLA (Heavy C++ bindings)	CGO / CUDA (C++ bindings)
Modern LLM Topology	MHA, SwiGLU, RMSNorm, RoPE	MHA, GQA, SwiGLU, KV-Cache, RMSNorm	Gemma support / ONNX translation	None (basic perceptrons/CNNs only)
Quantization Spectrum	21 types (FP64 down to Binary 1-bit)	Standard (FP32/FP16)	Standard (dictated by XLA compiler)	FP32/FP64 only
Optimization Engine	Backprop (BPTT) + Native Target Propagation	Automatic Differentiation (Autograd)	Automatic Differentiation via XLA	Symbolic & Automatic Differentiation
Non-Standard Layers	Native Differentiable K-Means Clustering	Requires external implementation	Requires external implementation	Requires external implementation
System Telemetry	Advanced window-based Adaptation Tracking	Standard terminal logging	Standard terminal logging	Standard terminal logging

Conclusions

Strategic Outlook

The Loom M-POLY-VTD architecture represents a radical divergence from established norms of deep learning engineering in 2026. By replacing the 1D computational graph with a cycle-accurate 3D Volumetric Grid, the framework physically maps neural structures in a manner that accommodates advanced biological routing — spatial hopping, systolic parallelism, and polymorphic precision.

Its exhaustive 21-type polymorphism and simulated precision mechanisms directly confront the hardware memory bandwidth crisis, enabling dynamic on-the-fly quantization to 1-bit precision without structural memory reallocation. Neural Target Propagation provides a mathematically viable path for continuous, asynchronous training on power-constrained edge hardware.

Complemented by the DNA Engine's topological signature matching, native BPE tokenization, and pure-Go WebGPU acceleration, Loom provides a self-contained, enterprise-grade ecosystem — vastly surpassing legacy Go frameworks, matching Born ML's deployment efficiency, and introducing architectural innovations previously reserved for experimental Python and JAX research environments.

View Loom on GitHub Loom Documentation Back to Loom Overview