NN Optimizers | OpenFluke

The Basic Problem: How Do We Improve?

After computing gradients, we know which direction would increase the loss. We want to move opposite to that direction. But how far should we step?

Text

Current weights → Loss = 1.5
                     │
                     │ gradient = -0.2 (pointing toward higher loss)
                     ▼

Step 1: weights += learning_rate × (-gradient)
        weights += 0.01 × 0.2
        weights += 0.002
                     │
                     │ We moved opposite to the gradient
                     ▼

New weights → Loss = 1.4  (improved!)

Simple gradient descent works, but it has problems: 1. Same step size everywhere: Flat regions need big steps, steep regions need small steps 2. Oscillation: Can bounce back and forth in "valleys" 3. Getting stuck: Can get trapped in local minima or saddle points

Optimizers solve these problems.

Visualizing the Loss Landscape

Think of training as navigating a hilly landscape, trying to find the lowest valley:

Text

     ╱╲    The Loss Landscape
    ╱  ╲   (2D slice for visualization)
   ╱    ╲
  ╱      ╲        ╱╲
 ╱        ╲      ╱  ╲
╱          ╲    ╱    ╲        ╱╲
            ╲  ╱      ╲      ╱  ╲
             ╲╱        ╲____╱    ╲____
                         ↑
                    Global minimum
                    (best solution)

Challenges: - Steep cliffs: Gradient is huge, might overshoot - Flat plateaus: Gradient is tiny, progress is slow - Narrow valleys: Oscillate between walls

SGD: The Foundation

Stochastic Gradient Descent is the simplest optimizer:

Go

optimizer := nn.NewSGDOptimizer(momentum)

Basic SGD (No Momentum)

Text

weight = weight - learning_rate × gradient

That's it! Move opposite to the gradient, scaled by learning rate.

The Problem: Oscillation

When the loss surface is like a valley (steep in one direction, shallow in another):

Text

Top view of a valley-shaped loss landscape:

        Steep walls
            ↓
    ────────────────────
    │                  │
    │   ╱╲  ╱╲  ╱╲    │ ← Path without momentum
    │  ╱  ╲╱  ╲╱  ╲   │    (bouncing between walls)
    │ ╱            ╲  │
    │                 │
    ────────●──────────
             ↑
         Goal (minimum)

The gradient points toward the walls, so we bounce back and forth instead of moving toward the minimum.

Solution: Momentum

Momentum remembers which direction we've been moving and continues in that direction:

Text

velocity = momentum × old_velocity + gradient
weight = weight - learning_rate × velocity

Think of it like a ball rolling downhill:

Text

Without momentum:              With momentum:

    ╱╲  ╱╲  ╱╲                       ╲
   ╱  ╲╱  ╲╱  ╲                       ╲
  ╱            ╲                       ╲─────────●
                                               Goal!

  Bounces between walls       Builds speed in consistent direction,
                              dampens oscillation

Momentum Values

Text

momentum = 0.0:  Pure gradient descent (no memory)
momentum = 0.9:  Standard choice (remembers ~10 past gradients)
momentum = 0.99: Strong momentum (smoother, slower to change direction)

Advanced: Nesterov Momentum

Nesterov looks ahead before computing the gradient:

Text

Standard momentum:
    Look at current position → Compute gradient → Update velocity

Nesterov momentum:
    Estimate where we're going → Compute gradient THERE → Update velocity

              Current       →        Where we're headed
                ●────────────────────────▶ ●
                                           │
                                           │ Compute gradient here!
                                           ▼

This often leads to better convergence because we're planning ahead.

AdamW: The Modern Default

Adam (Adaptive Moment Estimation) + Weight Decay. This is what you should use unless you have a specific reason not to.

Go

optimizer := nn.NewAdamWOptimizer(
    0.9,    // beta1 - momentum decay (like SGD momentum)
    0.999,  // beta2 - variance decay (for adaptive LR)
    1e-8,   // epsilon - prevents division by zero
    0.01,   // weight decay - L2 regularization
)

What Adam Does Differently

Adam tracks TWO exponential averages: 1. First moment (m): Average of gradients → like momentum 2. Second moment (v): Average of squared gradients → measures variance

Text

m = β₁ × m + (1 - β₁) × gradient       ← Direction (which way to go)
v = β₂ × v + (1 - β₂) × gradient²      ← Scale (how noisy is this param?)

update = m / (√v + ε)         ← Divide momentum by standard deviation
weight -= learning_rate × update

Why This Works

Different parameters need different learning rates:

Text

Parameter with consistent gradients (low variance):
    gradient over time: [0.1, 0.12, 0.09, 0.11, 0.10]
    m (momentum): 0.1 (pointing clearly in one direction)
    v (variance): 0.01 (very stable)
    update = 0.1 / √0.01 = 1.0  ← Strong update!

Parameter with noisy gradients (high variance):
    gradient over time: [0.5, -0.3, 0.8, -0.6, 0.2]
    m (momentum): ~0 (averaging out)
    v (variance): 0.25 (very noisy)
    update = 0 / √0.25 = 0.0   ← No update (noise cancels out)

Bias Correction

At the start of training, m and v are initialized to zero. This biases them toward zero. Adam corrects:

Text

m̂ = m / (1 - β₁ᵗ)    ← Corrects first moment
v̂ = v / (1 - β₂ᵗ)    ← Corrects second moment

At step 1:
    Uncorrected m = 0.1 × gradient (mostly zero)
    Corrected m̂ = 0.1 × gradient / (1 - 0.9) = gradient (proper scale)

Weight Decay (The "W" in AdamW)

Original Adam had a problem: weight decay interacted badly with adaptive learning rates. AdamW fixes this:

Text

Original Adam (wrong):
    Add weight decay to gradient → Affects adaptive scaling

AdamW (correct):
    weight = weight - learning_rate × (update + weight_decay × weight)
                                                ↑
                         Decoupled! Doesn't affect Adam's adaptive behavior

When to Use AdamW

Transformers: Almost always AdamW
LLMs: AdamW is standard
Most deep learning: Safe default choice
Fast convergence needed: Adam finds good solutions quickly

RMSprop: For Recurrent Networks

RMSprop was one of the first adaptive learning rate methods, predating Adam.

Go

optimizer := nn.NewRMSpropOptimizer(
    0.99,   // alpha - decay rate for variance estimate
    1e-8,   // epsilon - numerical stability
    0.0,    // momentum - optional momentum term
)

How It Works

Text

v = α × v + (1 - α) × gradient²     ← Running average of squared gradients
weight -= learning_rate × gradient / (√v + ε)

This is like Adam's second moment, but without the first moment (momentum).

When to Use RMSprop

RNNs/LSTMs: Gradients can vary wildly between time steps
Non-stationary problems: Loss landscape changes during training
When Adam overfits: Sometimes simpler is better

Comparing Optimizers

Text

                  SGD        SGD+Mom     Adam       AdamW      RMSprop
                  ─────────────────────────────────────────────────────
Convergence:      Slow       Medium      Fast       Fast       Medium
Generalization:   Best       Good        Good       Good       Good
Memory:           O(n)       O(2n)       O(3n)      O(3n)      O(2n)
Hyperparams:      LR         LR, Mom     LR, β1,β2  LR, β1,β2, LR, α
                                                    WD
Use case:         CNNs       CNNs        Most       LLMs       RNNs

Learning Rate Schedules

The optimal learning rate changes during training. At first, you want to explore quickly. Later, you want to fine-tune carefully.

Why Schedules Matter

Text

Fixed high LR:                    Fixed low LR:
         │                                 │
 Loss    │  ╱╲  ╱╲  ╱╲                    │╲
         │ ╱  ╲╱  ╲╱                       │ ╲
         │╱                                │  ╲
         └────────────────                 │   ╲
                                           └────────────────────────

         Oscillates near minimum,          Takes forever to reach
         never converges precisely         the minimum


Scheduled LR:
         │
 Loss    │╲
         │ ╲
         │  ╲
         │   ╲
         │    ╲──────
         └────────────

         Fast initial progress,
         then precise convergence

Constant (No Schedule)

Go

scheduler := nn.NewConstantScheduler(0.001)

Text

LR │━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
   │
   └──────────────────────────────────────▶ Step

Simple. Use for debugging or very short training runs.

Linear Decay

Go

scheduler := nn.NewLinearDecayScheduler(
    0.001,   // start LR
    0.0001,  // end LR
    10000,   // total steps
)

Text

LR │╲
   │ ╲
   │  ╲
   │   ╲
   │    ╲
   │     ╲
   │      ╲━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
   └──────────────────────────────────────▶ Step

Linear decrease. Simple and effective.

Cosine Annealing

Go

scheduler := nn.NewCosineAnnealingScheduler(
    0.001,   // max LR
    0.0001,  // min LR
    10000,   // total steps
)

Text

LR │━━━╮
   │   │
   │    ╲
   │     ╲
   │      ╲
   │       ╲
   │         ╲
   │           ╲________━━━━━━━━━━━━━━━━━━
   └──────────────────────────────────────▶ Step

Follows cosine curve. Slow decay at start and end, 
faster in the middle. Very popular for transformers.

Cosine with Warm Restarts

Go

scheduler := nn.NewCosineAnnealingWarmRestartsScheduler(
    0.001,   // max LR
    0.0001,  // min LR
    1000,    // T_0 - first period
    2,       // T_mult - multiply period after each restart
)

Text

LR │╮  ╭───╮     ╭───────────╮
   │ ╲╱    ╲   ╱               ╲
   │        ╲╱                   ╲
   │                               ╲
   │                                 ╲    ╱───────────
   │                                  ╲__╱
   └──────────────────────────────────────▶ Step
        │      │           │
     Restart  Restart    Restart
     (1000)   (2000)     (4000)

Periodically "reheats" to escape local minima.
Each cycle is longer than the last.

Warmup

Go

scheduler := nn.NewWarmupScheduler(
    0.001,   // target LR
    1000,    // warmup steps
)

Text

LR │            ━━━━━━━━━━━━━━━━━━━━━━━━━━
   │          ╱
   │        ╱
   │      ╱
   │    ╱
   │  ╱
   │╱
   └──────────────────────────────────────▶ Step
        │
    Warmup ends

Essential for large models. Prevents gradient explosion
at the start when weights are random.

Warmup + Cosine (The Standard)

Combine warmup with another schedule:

Go

warmup := nn.NewWarmupScheduler(0.001, 1000)
cosine := nn.NewCosineAnnealingScheduler(0.001, 0.0001, 9000)
scheduler := nn.NewCompositeScheduler(warmup, cosine, 1000)

Text

LR │          ╭───╮
   │        ╱      ╲
   │      ╱          ╲
   │    ╱              ╲
   │  ╱                  ╲
   │╱                      ╲___________
   └──────────────────────────────────────▶ Step
       │
   Warmup ends, cosine begins

The most common schedule for training LLMs.

Step Decay

Go

scheduler := nn.NewStepDecayScheduler(
    0.01,    // initial LR
    0.1,     // gamma (decay factor)
    3000,    // step size
)

Text

LR │━━━━━━━━━━━━╮
   │            │
   │            └━━━━━━━━━━━━╮
   │                         │
   │                         └━━━━━━━━━━━━
   │
   └──────────────────────────────────────▶ Step
             │              │
        LR × 0.1        LR × 0.1

Classic approach for CNNs. "Drop LR every N epochs."

Polynomial Decay

Go

scheduler := nn.NewPolynomialDecayScheduler(
    0.001,   // initial LR
    0.0001,  // final LR
    10000,   // total steps
    2.0,     // power
)

Text

power = 1.0:  Linear (same as linear decay)
power = 2.0:  Quadratic (starts slow, speeds up)
power = 0.5:  Square root (starts fast, slows down)

Power = 2.0:
LR │╲
   │ ╲
   │  ╲
   │   ╲
   │     ╲
   │        ╲
   │            ╲━━━━━━━━━━━━━━━━━━━━━━━━
   └──────────────────────────────────────▶ Step

Adjustable curve shape.

Putting It Together

A typical training setup:

Go

// Create network
network := nn.NewNetwork(...)

// Set up optimizer
optimizer := nn.NewAdamWOptimizer(0.9, 0.999, 1e-8, 0.01)
network.SetOptimizer(optimizer)

// Set up schedule
warmupSteps := 1000
totalSteps := 100000
warmup := nn.NewWarmupScheduler(0.001, warmupSteps)
cosine := nn.NewCosineAnnealingScheduler(0.001, 1e-5, totalSteps-warmupSteps)
scheduler := nn.NewCompositeScheduler(warmup, cosine, warmupSteps)

// Training loop
for step := 0; step < totalSteps; step++ {
    lr := scheduler.GetLR(step)

    output, _ := network.ForwardCPU(input)
    loss, grad := nn.CrossEntropyLossGrad(output, target)
    network.BackwardCPU(grad)
    network.ClipGradients(1.0)
    network.ApplyGradients(lr)  // Optimizer handles the rest
}

Quick Reference: Which to Use?

Situation	Optimizer	Schedule
Transformers / LLMs	AdamW	Warmup + Cosine
CNNs (ImageNet)	SGD + Momentum	Step decay or Cosine
RNNs / LSTMs	RMSprop or Adam	Exponential decay
Fine-tuning	AdamW	Constant (low LR)
Quick experiments	Adam	Constant
Maximum accuracy	SGD + Momentum	Warmup + Cosine

Default values that usually work:

Text

AdamW:  LR=0.001, β1=0.9, β2=0.999, weight_decay=0.01
SGD:    LR=0.1, momentum=0.9
Warmup: 1% to 10% of total steps

Understanding Optimizers and Learning Rate Schedules

The Basic Problem: How Do We Improve?

Visualizing the Loss Landscape

SGD: The Foundation

Basic SGD (No Momentum)

The Problem: Oscillation

Solution: Momentum

Momentum Values

Advanced: Nesterov Momentum

AdamW: The Modern Default

What Adam Does Differently

Why This Works

Bias Correction

Weight Decay (The "W" in AdamW)

When to Use AdamW

RMSprop: For Recurrent Networks

How It Works

When to Use RMSprop

Comparing Optimizers

Learning Rate Schedules

Why Schedules Matter

Constant (No Schedule)

Linear Decay

Cosine Annealing

Cosine with Warm Restarts

Warmup

Warmup + Cosine (The Standard)

Step Decay

Polynomial Decay

Putting It Together

Quick Reference: Which to Use?