The Dispatcher Pattern and 3D Coordinate System

This document explains how DispatchLayer and DispatchLayerBackward work as runtime jump tables, how the 3D coordinate system maps to VolumetricLayer positions, and how IsRemoteLink enables spatial hopping across the grid.

Why a Dispatcher?

A naive implementation of a polymorphic neural network would embed a large switch inside the forward loop:

// Naive — thread-divergence on GPU, hard to fuse
for _, layer := range layers {
    switch layer.Type {
    case LayerDense:   output = denseForward(layer, input)
    case LayerCNN2:    output = cnn2Forward(layer, input)
    // ...
    }
}

M-POLY-VTD separates concerns: the traversal loop iterates coordinates, and the dispatcher makes the type-specific call. This decoupling is what makes GPU kernel fusion possible in the future — the driver can inspect a group of same-type layers and launch a single batched shader rather than 19 separate ones.

DispatchLayer

func DispatchLayer[T Numeric](
    layer *VolumetricLayer,
    input, skip *Tensor[T],
) (preAct, postAct *Tensor[T])

This is a generic function. The type parameter T is inferred from input. Every call returns two tensors:

preAct — the layer's internal state before the final activation. For Parallel/Sequential layers this carries the nested activation tree in preAct.Nested.
postAct — the result of applying the activation function to preAct. This is what flows to the next layer.

The full routing table:

layer.Type ──switch──▶ function called
───────────────────────────────────────────────────────────────
LayerResidual          ResidualForwardPolymorphic(layer, input, skip)
LayerDense             DenseForwardPolymorphic(layer, input)
LayerCNN1              CNN1ForwardPolymorphic(layer, input)
LayerCNN2              CNN2ForwardPolymorphic(layer, input)
LayerCNN3              CNN3ForwardPolymorphic(layer, input)
LayerRNN               RNNForwardPolymorphic(layer, input)
LayerLSTM              LSTMForwardPolymorphic(layer, input)
LayerMultiHeadAttention MHAForwardPolymorphic(layer, input)
LayerSwiGLU            SwiGLUForwardPolymorphic(layer, input)
LayerRMSNorm           RMSNormForwardPolymorphic(layer, input)
LayerLayerNorm         LayerNormForwardPolymorphic(layer, input)
LayerConvTransposed1D  ConvTransposed1DForwardPolymorphic(layer, input)
LayerConvTransposed2D  ConvTransposed2DForwardPolymorphic(layer, input)
LayerConvTransposed3D  ConvTransposed3DForwardPolymorphic(layer, input)
LayerEmbedding         EmbeddingForwardPolymorphic(layer, input)
LayerKMeans            KMeansForwardPolymorphic(layer, input)
LayerSoftmax           SoftmaxForwardPolymorphic(layer, input)
LayerParallel          ParallelForwardPolymorphic(layer, input)
LayerSequential        SequentialForwardPolymorphic(layer, input)
default                DenseForwardPolymorphic(layer, input)
───────────────────────────────────────────────────────────────

DispatchLayerBackward

func DispatchLayerBackward[T Numeric](
    layer *VolumetricLayer,
    gradOutput, input, skip, preAct *Tensor[T],
) (gradInput, gradWeights *Tensor[T])

The mirror of DispatchLayer. Returns:

gradInput — the gradient to pass to the layer that produced input (propagates error upstream)
gradWeights — the gradient for this layer's own weights (used to update WeightStore.Master)

The routing table is symmetric to the forward pass. The skip argument is used only by ResidualBackwardPolymorphic.

The 3D Grid Traversal

ForwardPolymorphic[T] iterates the grid in reading order:

for z := 0; z < n.Depth; z++ {
    for y := 0; y < n.Rows; y++ {
        for x := 0; x < n.Cols; x++ {
            for l := 0; l < n.LayersPerCell; l++ {
                idx := n.GetIndex(z, y, x, l)
                layer := &n.Layers[idx]
                // ...
                _, post := DispatchLayer(layer, currentTensor, nil)
                currentTensor = post
            }
        }
    }
}

The flattened index formula:

idx = z * (Rows * Cols * LayersPerCell)
    + y * (Cols * LayersPerCell)
    + x * (LayersPerCell)
    + l

Visually, for a (Depth=1, Rows=2, Cols=3, LayersPerCell=1) network:

z=0:
  ┌─────────────┬─────────────┬─────────────┐
  │ (0, 0, 0,0) │ (0, 0, 1,0) │ (0, 0, 2,0) │  ← idx 0,1,2
  │   idx=0     │   idx=1     │   idx=2     │
  ├─────────────┼─────────────┼─────────────┤
  │ (0, 1, 0,0) │ (0, 1, 1,0) │ (0, 1, 2,0) │  ← idx 3,4,5
  │   idx=3     │   idx=4     │   idx=5     │
  └─────────────┴─────────────┴─────────────┘

Data flows: idx=0 ──▶ idx=1 ──▶ idx=2 ──▶ idx=3 ──▶ idx=4 ──▶ idx=5

BackwardPolymorphic walks in reverse (z, y, x, l all reversed), using cached inputs[idx] and preActs[idx] from the forward pass.

Tiled Traversal

When n.UseTiling = true, ForwardPolymorphic uses a blocked spatial traversal with tile size 4:

for zTile := 0; zTile < Depth; zTile += 4 {
  for yTile := 0; yTile < Rows; yTile += 4 {
    for xTile := 0; xTile < Cols; xTile += 4 {
      // Process 4×4×4 tile of cells
    }
  }
}

This is the CPU-side analogue of the GPU workgroup tile strategy. The intent is to improve data locality: all layers in a 4×4×4 spatial neighborhood execute together, keeping their weight data warm in L2/L3 cache.

SC (single-workgroup) vs MC (multi-workgroup) tiling

There are two different “tiling” knobs in poly:

VolumetricNetwork.UseTiling (see Tiled Traversal above) — spatial blocking of the 3D grid in ForwardPolymorphic (4×4×4 cells). Unrelated to transformer matmul tiles.
Per-layer matmul / GPU workgroup tiling — RefreshRuntimeTileSizes() fills per-dtype maps from layer geometry and (for GPU) WGPUContext limits.

GPU: two tile maps, configurable SC vs MC

On GPU, each layer gets GPUSCTileSizes and GPUMCTileSizes (see refreshRuntimeGPUTileSizes in tile_detection.go). At dispatch, VolumetricNetwork.EnableMultiCoreTiling chooses which map to use: GetGPUMCTileSize(dtype) when true (larger / higher-throughput tiles where limits allow), GetGPUSCTileSize(dtype) when false (smaller workgroups, friendlier to tight limits). So MC vs SC on GPU is a real switch — you are not stuck in one profile; set EnableMultiCoreTiling (or use TrainingModeGPUSC / TrainingModeGPUMC in trainBatchWGPU, which pick tile sizes the same way).

Transformer-style forwards in wgpu_forward.go read network.EnableMultiCoreTiling (not per-layer) for that choice. WGPUContext.GPUTileSize is the device-tuned baseline that feeds how those SC/MC maps are built, not the only number used at dispatch.

CPU: one tile map (not an SC/MC pair on the layer)

On CPU, each layer has a single per-dtype map, CPUTileSizes, via GetCPUTileSize — there is no CPUSCTileSizes / CPUMCTileSizes pair. Tiled matmul-style loops (Dense, SwiGLU, CNN, etc.) all use that one size.

TrainingModeCPUSC and TrainingModeCPUMC exist in the enum (and show up in benchmarks), but ConfigureNetworkForMode applies the same wiring to all CPU modes (UseTiling, EnableMultiCoreTiling, RefreshRuntimeTileSizes), and executeBatchCPU does not receive the mode — so there is no separate “CPU SC tile path” vs “CPU MC tile path” in the layer maps today. EnableMultiCoreTiling on CPU is set for consistency with GPU-bound nets and training tooling; it does not flip between two CPU tile sizes because only one map exists.

WGPUContext.GPUTileSize is the auto-detected base hint (from limits); concrete SC/MC sizes per layer type on GPU live in the two GPU maps, not in that single int alone.

VolumetricLayer: The Coordinate Record

Every VolumetricLayer contains its own position:

type VolumetricLayer struct {
    Network     *VolumetricNetwork  // back-pointer
    Type        LayerType
    Activation  ActivationType
    DType       DType
    WeightStore *WeightStore

    Z int  // Depth coordinate
    Y int  // Row coordinate
    X int  // Col coordinate
    L int  // Layer index within cell

    // Spatial Routing
    IsRemoteLink bool
    TargetZ, TargetY, TargetX, TargetL int

    // ... configuration fields
}

The (Z, Y, X, L) fields are set during NewVolumetricNetwork and are the canonical address. GetLayer(z, y, x, l) returns a pointer into the flat Layers slice using GetIndex.

IsRemoteLink: Spatial Hopping

A layer with IsRemoteLink = true does not receive its input from the previous layer in reading order. Instead, it reads from the output of whatever layer lives at (TargetZ, TargetY, TargetX, TargetL).

This enables:

Skip connections — hop over several layers in the grid
Feedback loops — target a layer at an earlier coordinate (biological recurrence)
Parallel expert routing — multiple layers at different positions all reading the same source
Cross-depth signals — connect depth=0 outputs to depth=2 inputs

Standard flow:             Remote link (skip):

 (0,0,0) → (0,0,1)          (0,0,0) ────────────────────┐
              │               (0,0,1) → (0,0,2) → ...    │
           (0,0,2)                                        │
              │               (0,2,0) ←── IsRemoteLink ──┘
           (0,0,3)             └── reads output of (0,0,0)

Feedback loop:

 (0,0,0)
    │
 (0,0,1)
    │
 (0,0,2) ─── IsRemoteLink ──▶ TargetZ=0, TargetY=0, TargetX=0
                                (reads from cycle N-1's output
                                 of layer (0,0,0) — step mesh only)

In ForwardPolymorphic, a remote-linked layer simply receives currentTensor like any other layer; the remote link semantic is only fully honored by StepForward, which maintains per-layer output buffers across time steps.

In ParallelForwardPolymorphic and SequentialForwardPolymorphic, remote links are resolved by calling layer.Network.GetLayer(branch.TargetZ, ...) and dispatching the resolved layer pointer.

The GPU Dispatch Path

When n.UseGPU = true, the training loop calls ctx.DispatchForwardLayer(l, batchSize, curBuf, preBuf) instead of DispatchLayer. This function is in wgpu_forward.go and routes to the appropriate WGSL compute shader based on l.Type.

The same dispatcher philosophy applies: one function, one switch, explicit routing. The difference is that inputs and outputs are *wgpu.Buffer handles in VRAM rather than *Tensor[T] in RAM.

trainBatchWGPU:

  BeginFrame()  ← create shared CommandEncoder
     │
     ├── for each layer forward:
     │   └── ctx.DispatchForwardLayer(l, ...) ← records into encoder
     │
     ├── DispatchMSEGradPartialLoss(...)       ← records into encoder
     │
     ├── for each layer backward (reverse):
     │   ├── ctx.DispatchActivationBackward(...)
     │   ├── ctx.DispatchBackwardLayer(l, ...)
     │   └── ctx.DispatchApplyGradients(...)
     │
  FlushFrame()  ← ONE submit for entire forward + backward + weight update
     │
  ReadBuffer(partialsBuf) ← only reads back tiny loss scalars

This single-submission design reduces Go-to-GPU driver overhead from ~150+ round trips per batch to exactly 1.

Disabled Layers

Setting layer.IsDisabled = true causes both ForwardPolymorphic and StepForward to skip the layer entirely. In StepForward, a disabled layer passes its input buffer through to NextBuffer unchanged. This is the mechanism for implementing sparse MoE expert activation — gate layers can conditionally disable branches.