GPU Layers | OpenFluke

Enabling GPU Acceleration

GPU acceleration is enabled per-network with two steps:

Go

network, _ := nn.BuildNetworkFromJSON(config)
network.GPU = true                    // Enable GPU mode
err := network.WeightsToGPU()         // Transfer weights to GPU memory
if err != nil {
    log.Fatal("GPU not available:", err)
}

// Forward automatically routes to GPU
output, duration := network.Forward(input)

// When done, release GPU resources
network.ReleaseGPUWeights()

All layer types below support both CPU and GPU execution with automatic parity checking.

Supported Layers Status

Layer	Forward	Backward	Notes
Dense	✅ Stable	✅ Stable	Best speedup (up to 20x).
Conv2D	✅ Stable	✅ Stable	Good for large batches/kernels.
Conv1D	✅ Stable	⚠️ Experimental	Accuracy under review.
RNN / LSTM	✅ Stable	⚠️ Experimental	Verified parity, BPTT limited.
SwiGLU	✅ Stable	⚠️ Experimental	Works perfectly.
Norms	✅ Stable	⚠️ Experimental	LayerNorm and RMSNorm supported.
MHA	✅ Stable	⚠️ Experimental	Multi-Head Attention supported.

Dense Layer

The Dense (fully-connected) layer is the fundamental building block. Every input connects to every output through learned weights.

What It Does

Text

Inputs (2048)                    Outputs (2048)
    │                                  │
    ├── w₀₀, w₀₁, ... ──────────────▶ o₀
    ├── w₁₀, w₁₁, ... ──────────────▶ o₁
    │        ...                       ...
    └── w₂₀₄₇,₀, ... ──────────────▶ o₂₀₄₇

Total: 2048 × 2048 = 4,194,304 weights

JSON Configuration

Json

{
  "id": "dense_network",
  "batch_size": 1,
  "grid_rows": 1,
  "grid_cols": 1,
  "layers_per_cell": 5,
  "layers": [
    {"type": "dense", "activation": "leaky_relu", "input_height": 2048, "output_height": 2048},
    {"type": "dense", "activation": "leaky_relu", "input_height": 2048, "output_height": 2048},
    {"type": "dense", "activation": "leaky_relu", "input_height": 2048, "output_height": 2048},
    {"type": "dense", "activation": "leaky_relu", "input_height": 2048, "output_height": 2048},
    {"type": "dense", "activation": "sigmoid", "input_height": 2048, "output_height": 2}
  ]
}

Go Code Example

Go

// Create a dense layer programmatically
layer := nn.InitDenseLayer(2048, 2048, nn.ActivationLeakyReLU)

// Or use nn.NewNetwork and SetLayer
network := nn.NewNetwork(2048, 1, 1, 3)
network.SetLayer(0, 0, 0, nn.InitDenseLayer(2048, 1024, nn.ActivationLeakyReLU))
network.SetLayer(0, 0, 1, nn.InitDenseLayer(1024, 512, nn.ActivationLeakyReLU))
network.SetLayer(0, 0, 2, nn.InitDenseLayer(512, 10, nn.ActivationSigmoid))

Parameters

Parameter	Description
`input_height`	Number of input features
`output_height`	Number of output features
`activation`	Activation function: `relu`, `leaky_relu`, `sigmoid`, `tanh`, `linear`

LayerNorm Layer

Layer Normalization normalizes activations across the feature dimension, stabilizing training by preventing value drift.

What It Does

Text

For each sample in a batch:
1. Compute mean: μ = mean(features)
2. Compute variance: σ² = var(features)
3. Normalize: x̂ = (x - μ) / √(σ² + ε)
4. Scale and shift: y = γ × x̂ + β

Where γ (gamma) and β (beta) are learnable parameters.

JSON Configuration

Json

{
  "layers": [
    {"type": "dense", "activation": "leaky_relu", "input_height": 2048, "output_height": 2048},
    {"type": "layer_norm", "norm_size": 2048, "epsilon": 1e-5},
    {"type": "layer_norm", "norm_size": 2048, "epsilon": 1e-5},
    {"type": "layer_norm", "norm_size": 2048, "epsilon": 1e-5},
    {"type": "dense", "activation": "sigmoid", "input_height": 2048, "output_height": 2}
  ]
}

Parameters

Parameter	Description
`norm_size`	Size of the feature dimension to normalize
`epsilon`	Small constant for numerical stability (typically `1e-5`)

RMSNorm Layer

RMS Normalization is a simplified version of LayerNorm used in modern LLMs like Llama. It only uses the root-mean-square (no mean subtraction).

What It Does

Text

rms = √(mean(x²) + ε)
output = (x / rms) × γ

Simpler than LayerNorm:
- No mean computation
- No beta parameter (just gamma)
- Slightly faster, works well empirically

JSON Configuration

Json

{
  "layers": [
    {"type": "dense", "activation": "leaky_relu", "input_height": 2048, "output_height": 2048},
    {"type": "rms_norm", "norm_size": 2048, "epsilon": 1e-5},
    {"type": "rms_norm", "norm_size": 2048, "epsilon": 1e-5},
    {"type": "rms_norm", "norm_size": 2048, "epsilon": 1e-5},
    {"type": "dense", "activation": "sigmoid", "input_height": 2048, "output_height": 2}
  ]
}

Parameters

Parameter	Description
`norm_size`	Size of the feature dimension to normalize
`epsilon`	Small constant for numerical stability (typically `1e-5` or `1e-6`)

Softmax Layer

Softmax converts arbitrary values into a probability distribution that sums to 1.

What It Does

Text

Input logits:  [2.0, 1.0, 0.1]
                  │
                  ▼ exp(each value)
             [7.39, 2.72, 1.11]
                  │
                  ▼ divide by sum (11.22)
Output probs: [0.66, 0.24, 0.10]
              ─────────────────
               sums to 1.0 ✓

JSON Configuration

Json

{
  "layers": [
    {"type": "dense", "activation": "leaky_relu", "input_height": 2048, "output_height": 2048},
    {"type": "softmax", "temperature": 1.0},
    {"type": "softmax", "temperature": 1.0},
    {"type": "softmax", "temperature": 1.0},
    {"type": "dense", "activation": "sigmoid", "input_height": 2048, "output_height": 2}
  ]
}

Go Code Example

Go

// Standard softmax
layer := nn.InitSoftmaxLayer()

// Temperature-scaled (lower = sharper, higher = smoother)
layer := nn.InitTemperatureSoftmaxLayer(0.5)

// Grid softmax for multi-agent (each row sums to 1)
layer := nn.InitGridSoftmaxLayer(4, 8)  // 4 agents, 8 actions each

// Masked softmax (for legal moves in games)
layer := nn.InitMaskedSoftmaxLayer(10)
layer.Mask = []bool{true, true, false, true, ...}  // false = illegal

Parameters

Parameter	Description
`temperature`	Controls sharpness: 0.1=confident, 1.0=normal, 5.0=smooth

Conv1D Layer

1D Convolution slides a kernel over sequential data, detecting local patterns.

What It Does

Text

Input: [batch][channels][sequence]

Kernel (3 elements) slides across sequence:
  [a, b, c] slides over [x₀, x₁, x₂, x₃, x₄, x₅, ...]

  Position 0: a×x₀ + b×x₁ + c×x₂ → output[0]
  Position 1: a×x₁ + b×x₂ + c×x₃ → output[1]
  ...

Output: [batch][filters][output_length]

JSON Configuration

Json

{
  "layers": [
    {"type": "dense", "activation": "leaky_relu", "input_height": 2048, "output_height": 2048},
    {"type": "conv1d", "conv1d_in_channels": 64, "conv1d_filters": 64, 
     "conv1d_kernel_size": 3, "conv1d_stride": 1, "conv1d_padding": 1},
    {"type": "conv1d", "conv1d_in_channels": 64, "conv1d_filters": 64, 
     "conv1d_kernel_size": 3, "conv1d_stride": 1, "conv1d_padding": 1},
    {"type": "conv1d", "conv1d_in_channels": 64, "conv1d_filters": 64, 
     "conv1d_kernel_size": 3, "conv1d_stride": 1, "conv1d_padding": 1},
    {"type": "dense", "activation": "sigmoid", "input_height": 2048, "output_height": 2}
  ]
}

The input size 2048 = 32 sequence × 64 channels. With padding=1 and stride=1, output size stays 2048.

Go Code Example

Go

// Conv1D: 32 seq length, 64 input channels, kernel=3, stride=1, padding=1, 64 filters
layer := nn.InitConv1DLayer(32, 64, 3, 1, 1, 64, nn.ActivationReLU)

Parameters

Parameter	Description
`conv1d_in_channels`	Number of input channels
`conv1d_filters`	Number of output filters
`conv1d_kernel_size`	Size of the convolution kernel
`conv1d_stride`	Step size for kernel movement
`conv1d_padding`	Zero-padding added to input edges

Conv2D Layer

2D Convolution slides a kernel over spatial data (images), detecting local patterns like edges and textures.

What It Does

Text

Input: [batch][channels][height][width]

3×3 Kernel slides across 2D image:
┌───┬───┬───┐
│ a │ b │ c │    Convolves at each spatial position
├───┼───┼───┤    to produce one output value
│ d │ e │ f │
├───┼───┼───┤
│ g │ h │ i │
└───┴───┴───┘

Output: [batch][filters][out_height][out_width]

JSON Configuration

Json

{
  "layers": [
    {"type": "dense", "activation": "leaky_relu", "input_height": 2048, "output_height": 2048},
    {"type": "conv2d", "input_channels": 8, "filters": 8, "kernel_size": 3, 
     "stride": 1, "padding": 1, "input_height": 16, "input_width": 16},
    {"type": "conv2d", "input_channels": 8, "filters": 8, "kernel_size": 3, 
     "stride": 1, "padding": 1, "input_height": 16, "input_width": 16},
    {"type": "conv2d", "input_channels": 8, "filters": 8, "kernel_size": 3, 
     "stride": 1, "padding": 1, "input_height": 16, "input_width": 16},
    {"type": "dense", "activation": "sigmoid", "input_height": 2048, "output_height": 2}
  ]
}

The input 2048 = 16×16×8 (height × width × channels). With padding=1, stride=1, kernel=3, output stays at 16×16×8=2048.

Go Code Example

Go

// Conv2D: 16×16 image, 8 input channels, 3×3 kernel, stride 1, padding 1, 8 filters
layer := nn.InitConv2DLayer(16, 16, 8, 3, 1, 1, 8, nn.ActivationReLU)

Parameters

Parameter	Description
`input_height`, `input_width`	Spatial dimensions of input
`input_channels`	Number of input channels
`filters`	Number of output filters/channels
`kernel_size`	Size of the square kernel (e.g., 3 for 3×3)
`stride`	Step size for kernel movement
`padding`	Zero-padding added to input edges

SwiGLU Layer

SwiGLU is a gated activation used in modern LLMs (Llama, Mistral, etc.). It combines three projections with a gating mechanism.

What It Does

Text

SwiGLU(x) = down_proj(silu(gate_proj(x)) × up_proj(x))

Where:
- gate_proj: Linear projection to intermediate size
- up_proj: Another linear projection to intermediate size
- silu(x) = x × sigmoid(x) (Swish activation)
- down_proj: Project back to input size

JSON Configuration

Json

{
  "layers": [
    {"type": "dense", "activation": "leaky_relu", "input_height": 2048, "output_height": 2048},
    {"type": "swiglu", "input_height": 2048, "output_height": 2048},
    {"type": "swiglu", "input_height": 2048, "output_height": 2048},
    {"type": "swiglu", "input_height": 2048, "output_height": 2048},
    {"type": "dense", "activation": "sigmoid", "input_height": 2048, "output_height": 2}
  ]
}

Parameters

Parameter	Description
`input_height`	Input feature size
`output_height`	Output feature size (typically same as input)

RNN Layer

Recurrent Neural Networks process sequences by maintaining a hidden state that carries information through time.

What It Does

Text

Sequence: [x₀, x₁, x₂, x₃, ...]

     x₀        x₁        x₂        x₃
      │         │         │         │
      ▼         ▼         ▼         ▼
   ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
h₀→│ RNN  │→│ RNN  │→│ RNN  │→│ RNN  │→h₄
   │ Cell │ │ Cell │ │ Cell │ │ Cell │
   └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘
      │         │         │         │
      ▼         ▼         ▼         ▼
     y₀        y₁        y₂        y₃

Hidden state h carries context forward through time.
Same weights used at every step (weight sharing).

JSON Configuration

Json

{
  "layers": [
    {"type": "dense", "activation": "leaky_relu", "input_height": 512, "output_height": 512},
    {"type": "rnn", "input_size": 64, "hidden_size": 64, "seq_length": 8},
    {"type": "dense", "activation": "sigmoid", "input_height": 512, "output_height": 2}
  ]
}

Input size 512 = 8 sequence × 64 features. Output is also 8 × 64 = 512.

Go Code Example

Go

// RNN: 64 input features, 64 hidden size, batch size 1, sequence length 8
layer := nn.InitRNNLayer(64, 64, 1, 8)

Parameters

Parameter	Description
`input_size`	Size of input features at each time step
`hidden_size`	Size of the hidden state
`seq_length`	Length of input sequences

Complete GPU Test Example

This example creates a network with each layer type and tests CPU vs GPU parity:

Go

package main

import (
    "fmt"
    "math/rand"

    "github.com/openfluke/loom/nn"
)

func main() {
    // Define network with Dense layers
    config := `{
        "id": "gpu_test",
        "batch_size": 1,
        "grid_rows": 1,
        "grid_cols": 1,
        "layers_per_cell": 3,
        "layers": [
            {"type": "dense", "activation": "leaky_relu", "input_height": 2048, "output_height": 2048},
            {"type": "layer_norm", "norm_size": 2048, "epsilon": 1e-5},
            {"type": "dense", "activation": "sigmoid", "input_height": 2048, "output_height": 10}
        ]
    }`

    network, _ := nn.BuildNetworkFromJSON(config)
    network.BatchSize = 1
    network.InitializeWeights()

    // Create random input
    input := make([]float32, 2048)
    for i := range input {
        input[i] = rand.Float32()*2 - 1
    }

    // CPU forward pass
    network.GPU = false
    cpuOutput, cpuTime := network.Forward(input)
    fmt.Printf("CPU: %v\n", cpuTime)

    // GPU forward pass
    network.GPU = true
    network.WeightsToGPU()
    gpuOutput, gpuTime := network.Forward(input)  // Same API, uses GPU
    fmt.Printf("GPU: %v (%.2fx speedup)\n", gpuTime, float64(cpuTime)/float64(gpuTime))

    // Verify parity
    maxError := 0.0
    for i := range cpuOutput {
        if diff := abs(cpuOutput[i] - gpuOutput[i]); diff > maxError {
            maxError = diff
        }
    }
    fmt.Printf("Max error: %e\n", maxError)

    network.ReleaseGPUWeights()
}

func abs(x float32) float64 {
    if x < 0 { return float64(-x) }
    return float64(x)
}

Summary

Layer	Use Case	GPU Benefit
Dense	General transformations	High (matrix multiply)
LayerNorm	Training stability	Medium
RMSNorm	LLM normalization	Medium
Softmax	Probabilities	Medium
Conv1D	Sequence patterns	High
Conv2D	Image patterns	Very High
SwiGLU	LLM activations	High
RNN	Sequential memory	Medium

All layers support backward pass for training with gradient computation on GPU.

GPU-Accelerated Neural Network Layers

Enabling GPU Acceleration

Supported Layers Status

Dense Layer

What It Does

JSON Configuration

Go Code Example

Parameters

LayerNorm Layer

What It Does

JSON Configuration

Parameters

RMSNorm Layer

What It Does

JSON Configuration

Parameters

Softmax Layer

What It Does

JSON Configuration

Go Code Example

Parameters

Conv1D Layer

What It Does

JSON Configuration

Go Code Example

Parameters

Conv2D Layer

What It Does

JSON Configuration

Go Code Example

Parameters

SwiGLU Layer

What It Does

JSON Configuration

Parameters

RNN Layer

What It Does

JSON Configuration

Go Code Example

Parameters

Complete GPU Test Example

Summary