M-POLY-VTD: The Loom Architecture
An exhaustive technical analysis of the Loom framework — covering Volumetric Tensor Dispatch, Multi-Numerical Polymorphism, Systolic Grid Propagation, Neural Target Propagation, the Topological DNA Engine, and a rigorous comparison against PyTorch, JAX, and the Go ML ecosystem.
The Paradigm Shift: Volumetric Tensor Dispatch (VTD)
Traditional deep learning frameworks — including PyTorch and TensorFlow — construct neural networks as directed acyclic graphs (DAGs) or sequential layer lists. While mathematically sound, this one-dimensional abstraction creates rigid execution pipelines that struggle to implement complex, biologically inspired routing.
The Loom architecture fundamentally dismantles this constraint by introducing a
3D Volumetric Coordinate System. Every layer is assigned a geometric address (z, y, x, l)
within a pre-allocated spatial grid. A flattening algorithm maps these 3D coordinates to contiguous 1D memory,
maintaining hardware cache locality despite the logical 3D abstraction.
In standard sequential models, data must flow strictly from layer N to layer N+1. In the Loom volumetric grid,
data signals can bypass adjacent layers and jump across geometric coordinates — mimicking
biological cortical columns. If a layer has an IsRemoteLink flag, the dispatcher fetches the remote
layer dynamically via TargetZ, TargetY, TargetX, TargetL and injects it into the local execution
path without graph recompilation.
Dynamic Branching via Polymorphic Routing: The LayerParallel and
LayerSequential container types aggregate sub-branches within the coordinate space.
When ParallelForwardPolymorphic executes, the dispatcher routes input to multiple
coordinate-mapped branches simultaneously, then merges using configurable topological modes:
FilterGateConfig layer generates Softmax coefficients to compute a dynamically weighted sum.Multi-Numerical Polymorphism (M-POLY)
A critical bottleneck in edge-device inference is memory bandwidth — streaming weight matrices from global VRAM to compute units. The Loom engine addresses this through native multi-numerical polymorphism. Unlike standard frameworks that require exporting to a fixed lower precision, Loom layers operate as fluid polymorphic units.
The WeightStore struct maintains a master Float32 representation as the absolute
source of truth, alongside a localized cache of actively morphed target precisions keyed by DType.
Loom supports 21 distinct numerical types:
Hardware Emulation via SimulatePrecision: For extreme low-bit types lacking native CPU/GPU register support (FP4, 2-bit quantization), Loom employs a universal fallback that mathematically forces the Float32 master weight to behave exactly as its lower-bit counterpart — simulating exponent/mantissa bounds for FP8E4M3, restricting to four discrete scaling levels for Int2, and clamping to ±1 for Binary.
This enables Quantization-Aware Training (QAT) without complex fake-quantization node injections (as required by PyTorch). Different spatial coordinates can operate at different precisions simultaneously — a reasoning node in Float16 while an embedding lookup runs in 2-bit.
By packing low-bit representations, the Loom architecture achieves up to 98.4% on-disk compression for localized model deployment — effectively breaking the 192 GB/s memory bandwidth wall that stifles traditional inference on consumer graphics cards like Turing-class GPUs.
Systolic Grid Propagation: The Discrete-Time Neural Mesh
Standard deep learning inference operates in a continuously flowing waterfall pattern — layer 1 finishes, passes memory to layer 2, and so on. Loom introduces Systolic Grid Propagation, modelled after the hardware systolic arrays used in Google's TPUs.
Under this model, the 3D Volumetric Grid is a discrete-time neural mesh. The
SystolicForward function advances the entire 3D grid by a single temporal "tick" — every
coordinate calculates its output simultaneously based solely on input states from the previous tick.
ReadBuffer and WriteBuffer per tensor state. During dispatch,
every layer reads from ReadBuffer and writes results exclusively to WriteBuffer. CommitSystolicState
then atomically swaps buffers — preventing race conditions in concurrent environments.
Neural Target Propagation (TargetProp)
Backpropagation is widely criticized for its biological implausibility: it requires global error computation, exact weight symmetry, and freezing the forward activity while gradients are sequentially calculated backward through the chain rule. Loom implements an advanced alternative: Neural Target Propagation.
Instead of computing continuous derivatives, TargetProp computes a proposed "target" state for each hidden layer. Each layer's objective is no longer to minimize the global loss via partial derivatives, but simply to map its forward activation to the proposed backward target.
During the forward pass, actual activations are captured in ForwardActs. During optimization,
CalculateTargetPropGaps executes an inverse estimation: for Dense layers, estimated targets are
generated via weighted importance of downstream targets relative to master weights. For LSTM layers, the engine
aggregates backward through input, forget, cell, and output gates simultaneously, creating a synthesized
target for the previous recurrent time step.
Gap-Based Hebbian Optimization: Once targets are generated, ApplyTargetPropGaps
applies a local Hebbian-style learning rule. The weight update follows:
ΔW = η · input · (target − actual)
Loom introduces an advanced stability mechanism via LinkBudget — dynamically calculated from the cosine similarity between the forward activation vector and the backward target vector. If the target signal is highly misaligned (cosine similarity below 0.2), the layer simply ignores the update. This prevents catastrophic forgetting and exploding signals.
Crucially, because TargetProp does not require differentiable functions, Loom can natively optimize extreme architectures like binary (1-bit) or ternary networks where standard gradients would vanish or shatter.
The Topological DNA Engine
Because layers can dynamically hop across a 3D coordinate space and shift their numerical precision, traditional cryptographic hashing or PyTorch state-dict comparisons would instantly register a complete mismatch even when underlying logic is intact. Loom integrates a native DNA Engine based on principles from Topological Data Analysis (TDA).
ExtractDNA converts every layer into a LayerSignature capturing spatial coordinates,
layer type, DType, and a dimensionally normalized weight representation. The
SimulatePrecision function expands all active WeightStore versions back to unified Float32 before
unit vector normalization — ensuring the geometric "direction" of weights is captured independently of bit-depth magnitude.
CompareNetworks identifies Logic Shifts — when a layer signature in Model A aligns
with high cosine similarity (>0.8) to a layer in Model B, but at a different spatial coordinate.
This allows researchers to observe how architectural search algorithms or systolic propagation patterns
naturally migrate logic pathways to more efficient regions of the 3D grid over time.
Native WebGPU Acceleration & Hardware-Aware Tiling
Loom achieves 70+ tokens/second on consumer hardware through low-level optimization. The
hardware.go module executes deep OS-level system calls (sysctl on Darwin,
/sys/devices/system/cpu/cpu0/cache/ on Linux) to determine exact L1/L2 cache byte sizes.
Dynamic L1/L2 Cache Tiling: CalculateOptimalTileSize restricts matrix multiplication
blocks so that the entire sub-block remains resident in L1 cache — significantly reducing global memory fetch
latency. This delivers major speedups for operations like swigluTiledProjectGateUp.
WGSL Shader Workgroup Optimization: For WebGPU execution, Loom queries
MaxComputeWorkgroupStorageSize and MaxComputeInvocationsPerWorkgroup directly from the
WebGPU adapter. MHA shaders allocate shared arrays for Keys and Values, using workgroupBarrier()
synchronization, sized to consume exactly half of available workgroup storage — achieving optimal execution
across Apple Silicon, NVIDIA CUDA, and integrated mobile GPUs.
Sub-System Autonomy: Tokenization, Ensembling & Telemetry
tokenizer.json
schemas. Includes a byte-fallback mechanism (gpt2ByteEncode/Decode) for unknown Unicode
characters — enabling completely standalone, offline string-to-tensor processing.
FindComplementaryMatches assesses binary correctness masks of multiple models, calculating
combined coverage ratio and cosine similarity of success rates — enabling optimized "Mixture of Models"
pipelines that complement each other's weaknesses.
KMeansForwardPolymorphic transforms standard K-Means into an end-to-end differentiable operation
using temperature-scaled distance metrics and Softmax gating, allowing classification topologies anywhere
in the volumetric grid.
PolyObserver interface enables real-time tensor interception during forward/backward passes.
AdaptationTracker monitors degradation and recovery via moving windows of outputs, accuracy,
and throughput (OutputsPerSec).
Comparative Analysis: Loom vs Python Ecosystem (2026)
The global deep learning industry has historically been dominated by Python-based frameworks. Comparing Loom to these heavyweights highlights distinct philosophical and technical divergences.
| Feature | Loom (M-POLY-VTD) | PyTorch (+ TorchAO) | JAX (+ Flax/Optax) |
|---|---|---|---|
| Execution Paradigm | 3D Volumetric Mesh / Spatial Routing | 1D Sequential / Dynamic DAG | Functional / Compiled Static Graph (XLA) |
| Language | Pure Go (Compiled Native Binary) | Python (C++ / CUDA backend) | Python (C++ / XLA backend) |
| Quantization | 21 types native (FP64 down to Binary 1-bit) | Native FP8, INT4, INT2, 1-bit via TorchAO | Native FP8, INT8; sub-byte via custom libs |
| QAT (Hardware Emulation) | Built-in polymorphic SimulatePrecision | FakeQuantize modules (complex node injection) | Custom JAX primitives |
| Optimization Engine | Polymorphic BPTT + Native Target Propagation | Native Autograd (reverse-mode AD) | Functional forward & reverse AD |
| Target Propagation | First-class native | Requires extensive custom class overrides | High research support via custom logic flows |
| GPU Acceleration | WebGPU (cross-platform, edge & browser) | CUDA, ROCm, Metal (vendor-specific) | TPU, CUDA, ROCm (heavy compiler reliance) |
| Structural Analysis | Topological DNA Engine + Logic Shifts | Standard dict/parameter hashing | Standard dict/parameter hashing |
| Deployment Footprint | Single binary, zero dependencies | Large runtime (PyTorch + CUDA variables) | Large runtime (JAX + XLA toolchains) |
Comparative Analysis: Loom vs Go ML Ecosystem (2026)
| Feature | Loom | Born ML | GoMLX | Gorgonia (Legacy) |
|---|---|---|---|---|
| Core Architecture | 3D Spatial Grid (Volumetric routing) | 1D Sequential module stacks | 1D Sequential computation graphs | Static graph (Theano/TF1 style) |
| Compute Backend | Pure Go + WebGPU (Zero CGO) | Pure Go + WebGPU (Zero CGO) | OpenXLA (Heavy C++ bindings) | CGO / CUDA (C++ bindings) |
| Modern LLM Topology | MHA, SwiGLU, RMSNorm, RoPE | MHA, GQA, SwiGLU, KV-Cache, RMSNorm | Gemma support / ONNX translation | None (basic perceptrons/CNNs only) |
| Quantization Spectrum | 21 types (FP64 down to Binary 1-bit) | Standard (FP32/FP16) | Standard (dictated by XLA compiler) | FP32/FP64 only |
| Optimization Engine | Backprop (BPTT) + Native Target Propagation | Automatic Differentiation (Autograd) | Automatic Differentiation via XLA | Symbolic & Automatic Differentiation |
| Non-Standard Layers | Native Differentiable K-Means Clustering | Requires external implementation | Requires external implementation | Requires external implementation |
| System Telemetry | Advanced window-based Adaptation Tracking | Standard terminal logging | Standard terminal logging | Standard terminal logging |
Strategic Outlook
The Loom M-POLY-VTD architecture represents a radical divergence from established norms of deep learning engineering in 2026. By replacing the 1D computational graph with a cycle-accurate 3D Volumetric Grid, the framework physically maps neural structures in a manner that accommodates advanced biological routing — spatial hopping, systolic parallelism, and polymorphic precision.
Its exhaustive 21-type polymorphism and simulated precision mechanisms directly confront the hardware memory bandwidth crisis, enabling dynamic on-the-fly quantization to 1-bit precision without structural memory reallocation. Neural Target Propagation provides a mathematically viable path for continuous, asynchronous training on power-constrained edge hardware.
Complemented by the DNA Engine's topological signature matching, native BPE tokenization, and pure-Go WebGPU acceleration, Loom provides a self-contained, enterprise-grade ecosystem — vastly surpassing legacy Go frameworks, matching Born ML's deployment efficiency, and introducing architectural innovations previously reserved for experimental Python and JAX research environments.