Parallel and Sequential Layers

This document explains LayerParallel and LayerSequential in depth: how they fan out and chain sub-layers, the five combination modes, the recursive activation tree, and how backpropagation flows through nested structures.

LayerParallel

ParallelForwardPolymorphic fans the input to every branch simultaneously and then combines the results.

Configuration

layer.Type        = poly.LayerParallel
layer.CombineMode = "concat"   // or "add", "avg", "filter", "grid_scatter"
layer.ParallelBranches = []poly.VolumetricLayer{
    {Type: poly.LayerDense, InputHeight: 64, OutputHeight: 32, ...},
    {Type: poly.LayerRNN,   InputHeight: 64, OutputHeight: 32, ...},
    {Type: poly.LayerCNN1,  InputHeight: 64, ...},
}

Each entry in ParallelBranches is a full VolumetricLayer — it can itself be a LayerParallel or LayerSequential, enabling unlimited nesting.

Combination Modes

"add"

Element-wise sum of all branch outputs. All branches must produce the same output shape.

Input ──▶ Branch 0 ──▶ [32]
Input ──▶ Branch 1 ──▶ [32]   →   [32] (sum of all)
Input ──▶ Branch 2 ──▶ [32]

Use for: residual-style ensembles, multi-path feature accumulation.

"avg"

Element-wise average of all branch outputs. Same shape requirement as "add".

Output[i] = (Branch0[i] + Branch1[i] + ... + BranchN[i]) / N

Use for: soft ensemble averaging where no single branch should dominate.

"concat" / "grid_scatter"

Concatenates all branch outputs into one flat tensor. Branch output sizes can differ.

Input ──▶ Branch 0 ──▶ [32]
Input ──▶ Branch 1 ──▶ [16]   →   [32, 16, 64] = [112]
Input ──▶ Branch 2 ──▶ [64]

"grid_scatter" behaves identically to "concat" in the current implementation — they share the same code path. The name signals intent: scatter the input across a grid of experts, then collect all outputs.

Use for: multi-scale feature extraction, heterogeneous expert outputs before a routing layer.

"filter" (Soft Mixture of Experts)

Uses a separate gate sub-layer to produce per-branch weights, then computes a weighted sum:

layer.FilterGateConfig = &poly.VolumetricLayer{
    Type:         poly.LayerDense,
    InputHeight:  64,
    OutputHeight: 3,   // one scalar per branch
    Activation:   poly.ActivationLinear,
}

At forward time:

Input ──▶ FilterGateConfig ──▶ [numBranches]
                │
         Softmax(gate_logits)
                │
          [w0, w1, w2]  ← learned routing weights

Input ──▶ Branch 0 ──▶ [32] × w0
Input ──▶ Branch 1 ──▶ [32] × w1  →  [32] (weighted sum)
Input ──▶ Branch 2 ──▶ [32] × w2

Use for: differentiable Mixture of Experts (MoE), learned feature gating, adaptive multi-scale fusion.

The Activation Tree (Tensor.Nested)

The key to making arbitrary nesting differentiable is the Nested []*Tensor[T] field on Tensor.

During ParallelForwardPolymorphic, each branch produces its own (bPre, bOut) pair. The branch preAct tensors are collected into a slice and stored as Nested on the returned preAct:

preAct = &Tensor[T]{
    Data:   input.Data,     // proxy — carries input shape
    Shape:  input.Shape,
    DType:  input.DType,
    Nested: branchPreActs,  // [branch0.preAct, branch1.preAct, ...]
}

During ParallelBackwardPolymorphic, the backward function reads preAct.Nested[i] to get the correct cached state for each branch:

var bPre *Tensor[T]
if preAct != nil && i < len(preAct.Nested) {
    bPre = preAct.Nested[i]
}
gIn, gW := DispatchLayerBackward(target, scaledGrad, input, nil, bPre)

This creates a recursive tree of activation caches that mirrors the nesting depth of the network:

preAct.Nested:
├── Branch 0 preAct
│     └── (if branch 0 is also Parallel)
│           └── .Nested
│                 ├── Sub-branch 0 preAct
│                 └── Sub-branch 1 preAct
├── Branch 1 preAct
└── Branch 2 preAct

The backward pass recursively walks this tree, ensuring each sub-layer gets the exact cached pre-activation it needs to compute its gradient.

Gradient Flow Through Parallel

For "add" and "avg" modes, the same gradOutput (or a scaled version) is sent to every branch:

gradOutput
    │
    ├──── scaledGrad ──▶ Branch 0 backward ──▶ gradInput_0 + gradWeights_0
    ├──── scaledGrad ──▶ Branch 1 backward ──▶ gradInput_1 + gradWeights_1
    └──── scaledGrad ──▶ Branch 2 backward ──▶ gradInput_2 + gradWeights_2

gradInput = gradInput_0 + gradInput_1 + gradInput_2  (accumulated)

For "avg" mode, scaledGrad = gradOutput / N before dispatching.

For "concat" mode, the gradient is sliced by branch output size:

gradOutput [112]:
  branch 0 slice: gradOutput[0:32]   → Branch 0 backward
  branch 1 slice: gradOutput[32:48]  → Branch 1 backward
  branch 2 slice: gradOutput[48:112] → Branch 2 backward

For "concat" backward, the branch output size is determined by running a forward pass to measure len(out.Data). This is a known overhead — for large models, consider caching branch output sizes.

The gradWeights returned by ParallelBackwardPolymorphic is a synthetic tensor with no Data — only Nested:

gradWeights = &Tensor[T]{
    Nested: branchGradWeights,  // per-branch weight gradients
}

ApplyRecursiveGradients recognizes this pattern and dispatches weight updates to each branch recursively.

LayerSequential

SequentialForwardPolymorphic chains sub-layers in order, each receiving the output of the previous one.

layer.Type = poly.LayerSequential
layer.SequentialLayers = []poly.VolumetricLayer{
    {Type: poly.LayerDense,   InputHeight: 128, OutputHeight: 256, ...},
    {Type: poly.LayerRMSNorm, InputHeight: 256, ...},
    {Type: poly.LayerDense,   InputHeight: 256, OutputHeight: 64, ...},
}

This is how transformer blocks are typically assembled: RMSNorm → MHA → RMSNorm → SwiGLU.

Step Containers

For each sub-layer, the forward pass stores a "step container" — a tensor whose Nested holds [bPre, bInput, bSkip]:

stepContainer := &Tensor[T]{
    Nested: []*Tensor[T]{
        bPre,    // Nested[0]: preAct from this sub-layer
        current, // Nested[1]: the input this sub-layer received
        lastInput, // Nested[2]: the previous input (for skip connections)
    },
}
stepIntermediates[i] = stepContainer

The outer preAct returned by SequentialForwardPolymorphic carries all step containers in its Nested:

preAct = &Tensor[T]{
    Data:   input.Data,
    Nested: stepIntermediates,  // [step0container, step1container, step2container]
}

Sequential Backward

The backward pass iterates sub-layers in reverse order:

for i := len(layer.SequentialLayers) - 1; i >= 0; i-- {
    container := preAct.Nested[i]
    bPre   = container.Nested[0]
    bInput = container.Nested[1]
    bSkip  = container.Nested[2]

    stepGradOutput = currentGrad
    if skipGradients[i+1] != nil {
        stepGradOutput.Add(skipGradients[i+1])  // add skip gradient
    }

    gIn, gW = DispatchLayerBackward(target, stepGradOutput, bInput, bSkip, bPre)
    currentGrad = gIn
}

skipGradients is a slice that accumulates gradients flowing back through skip connections inside the sequence. If a sub-layer (like LayerResidual) produces a gradient flowing back to an earlier step, it is accumulated here.

Remote Links Inside Branches

Both ParallelForwardPolymorphic and SequentialForwardPolymorphic support IsRemoteLink on individual branches:

if branch.IsRemoteLink && layer.Network != nil {
    if remote := layer.Network.GetLayer(branch.TargetZ, branch.TargetY, branch.TargetX, branch.TargetL); remote != nil {
        target = remote
    }
}

This allows a branch to redirect to any layer in the parent VolumetricNetwork, enabling cross-cell feature reuse without duplicating layer definitions.

Tiling Propagation

When layer.UseTiling = true on the parent Sequential layer, the flag is propagated to each sub-layer before dispatch:

if layer.UseTiling {
    target.UseTiling = true
    target.TileSize  = layer.TileSize
}

This means you can set tiling on the top-level Sequential layer and all its sub-layers inherit UseTiling and TileSize automatically. EnableMultiCoreTiling is not propagated here — it lives on VolumetricNetwork (and may be copied onto layers for training). GPU SC vs MC is chosen from Network.EnableMultiCoreTiling plus GPUSCTileSizes / GPUMCTileSizes after RefreshRuntimeTileSizes(). CPU sub-layers use GetCPUTileSize only (one map per layer, not SC/MC pair); see dispatch.md.

Practical Example: Transformer Block as Sequential

block := poly.VolumetricLayer{
    Type: poly.LayerSequential,
    SequentialLayers: []poly.VolumetricLayer{
        {
            Type:        poly.LayerRMSNorm,
            InputHeight: 512,
            OutputHeight: 512,
        },
        {
            Type:       poly.LayerMultiHeadAttention,
            DModel:     512,
            NumHeads:   8,
            NumKVHeads: 8,
            HeadDim:    64,
            MaxSeqLen:  2048,
        },
        {
            Type:        poly.LayerRMSNorm,
            InputHeight: 512,
            OutputHeight: 512,
        },
        {
            Type:         poly.LayerSwiGLU,
            InputHeight:  512,
            OutputHeight: 1364,  // ~2.67× hidden size
        },
    },
}

The entire block is a single VolumetricLayer entry in the grid. It runs as a mini-pipeline with the preAct.Nested tree tracking all four sub-layer states for backpropagation.