← back

Adagrad Optimizer

#145 · Deep Learning · Easy

⊣ Solve on deep-ml.com

Problem

Implement the Adagrad (Adaptive Gradient) optimizer. Adagrad adapts the learning rate for each parameter based on the historical sum of squared gradients, giving larger updates to infrequent parameters and smaller updates to frequent ones.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import numpy as np

class Adagrad:
    def __init__(self, learning_rate: float = 0.01, epsilon: float = 1e-8):
        self.lr = learning_rate
        self.epsilon = epsilon
        self.accumulated = None

    def update(self, params: np.ndarray, grads: np.ndarray) -> np.ndarray:
        if self.accumulated is None:
            self.accumulated = np.zeros_like(params)

        self.accumulated += grads ** 2
        params = params - self.lr * grads / (np.sqrt(self.accumulated) + self.epsilon)
        return params

def adagrad_update(params: np.ndarray, grads: np.ndarray, accumulated_sq_grads: np.ndarray,
                   lr: float = 0.01, epsilon: float = 1e-8) -> tuple[np.ndarray, np.ndarray]:
    accumulated_sq_grads = accumulated_sq_grads + grads ** 2
    adjusted_grads = grads / (np.sqrt(accumulated_sq_grads) + epsilon)
    params = params - lr * adjusted_grads
    return params, accumulated_sq_grads

Explanation

  1. Maintain a running sum of squared gradients for each parameter.
  2. Each update divides the gradient by the square root of the accumulated squared gradients plus epsilon.
  3. Parameters with large historical gradients get smaller effective learning rates.
  4. Parameters with small historical gradients get larger effective learning rates.
  5. The main limitation is that the accumulated sum grows monotonically, causing the learning rate to eventually become vanishingly small.

Complexity

  • Time: O(P) per update where P = number of parameters
  • Space: O(P) for the accumulated squared gradients