← back

Adamax Optimizer

#148 · Deep Learning · Easy

⊣ Solve on deep-ml.com

Problem

Implement the Adamax optimizer, a variant of Adam based on the infinity norm. Instead of using the second moment (mean of squared gradients), Adamax uses the max of exponentially weighted absolute gradients.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import numpy as np

class Adamax:
    def __init__(self, learning_rate: float = 0.002, beta1: float = 0.9, beta2: float = 0.999, epsilon: float = 1e-8):
        self.lr = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = None  # first moment
        self.u = None  # infinity norm
        self.t = 0

    def update(self, params: np.ndarray, grads: np.ndarray) -> np.ndarray:
        if self.m is None:
            self.m = np.zeros_like(params)
            self.u = np.zeros_like(params)

        self.t += 1

        # Update biased first moment
        self.m = self.beta1 * self.m + (1 - self.beta1) * grads

        # Update infinity norm (exponentially weighted)
        self.u = np.maximum(self.beta2 * self.u, np.abs(grads))

        # Bias correction for first moment
        m_hat = self.m / (1 - self.beta1 ** self.t)

        # Update params
        params = params - self.lr * m_hat / (self.u + self.epsilon)
        return params

def adamax_update(params: np.ndarray, grads: np.ndarray, m: np.ndarray, u: np.ndarray,
                  t: int, lr: float = 0.002, beta1: float = 0.9, beta2: float = 0.999,
                  epsilon: float = 1e-8) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
    m = beta1 * m + (1 - beta1) * grads
    u = np.maximum(beta2 * u, np.abs(grads))
    m_hat = m / (1 - beta1 ** t)
    params = params - lr * m_hat / (u + epsilon)
    return params, m, u

Explanation

  1. First moment (m): Exponential moving average of gradients, same as Adam.
  2. Infinity norm (u): Instead of the second moment, track max(beta2 * u, |grad|). This is the L-infinity norm version.
  3. Bias correction: Only the first moment needs correction; the infinity norm does not.
  4. Update rule: params -= lr * m_hat / (u + epsilon).
  5. Adamax is more robust to large gradients than Adam and does not require the square root in the denominator.

Complexity

  • Time: O(P) per update where P = number of parameters
  • Space: O(P) for the moment and infinity norm vectors