Momentum Optimizer

#146 · Deep Learning · Easy

Problem

Implement the Momentum optimizer. Momentum accelerates SGD by accumulating an exponentially decaying moving average of past gradients, helping to navigate ravines and reduce oscillation.

Solution

import numpy as np

class MomentumOptimizer:
    def __init__(self, learning_rate: float = 0.01, momentum: float = 0.9):
        self.lr = learning_rate
        self.momentum = momentum
        self.velocity = None

    def update(self, params: np.ndarray, grads: np.ndarray) -> np.ndarray:
        if self.velocity is None:
            self.velocity = np.zeros_like(params)

        self.velocity = self.momentum * self.velocity - self.lr * grads
        params = params + self.velocity
        return params

def momentum_update(params: np.ndarray, grads: np.ndarray, velocity: np.ndarray,
                    lr: float = 0.01, momentum: float = 0.9) -> tuple[np.ndarray, np.ndarray]:
    velocity = momentum * velocity - lr * grads
    params = params + velocity
    return params, velocity

Explanation

Maintain a velocity vector initialized to zero.
Each step, update velocity: v = momentum * v - lr * gradient.
Update parameters: params = params + v.
The momentum term (typically 0.9) causes the optimizer to keep moving in directions where gradients consistently point, while dampening oscillations.
This is equivalent to a ball rolling downhill with friction -- it builds up speed in consistent directions.

Complexity

Time: O(P) per update where P = number of parameters
Space: O(P) for the velocity vector

← #145 #147 →