Implement Layer Normalization for Sequence Data

#109 · Machine Learning · Medium

Problem

Implement Layer Normalization for sequence data. Given an input tensor of shape (batch_size, seq_len, d_model), normalize over the last dimension (features) for each position independently. Optionally apply learnable scale (gamma) and shift (beta) parameters.

Solution

import numpy as np

def layer_normalization(x: np.ndarray, gamma: np.ndarray = None, beta: np.ndarray = None, eps: float = 1e-5) -> np.ndarray:
    # Normalize over last dimension
    mean = np.mean(x, axis=-1, keepdims=True)
    var = np.var(x, axis=-1, keepdims=True)

    x_norm = (x - mean) / np.sqrt(var + eps)

    if gamma is not None:
        x_norm = x_norm * gamma
    if beta is not None:
        x_norm = x_norm + beta

    return x_norm

Explanation

Mean and variance: Compute statistics over the feature dimension (last axis) for each token position independently. This differs from batch normalization which computes stats over the batch dimension.
Normalize: Subtract mean and divide by standard deviation (with epsilon for numerical stability).
Scale and shift: Optionally apply learnable parameters gamma (scale) and beta (shift) to allow the network to learn the optimal normalization.
Key advantage: Layer norm is independent of batch size, making it suitable for sequence models where batch sizes vary and for inference with single samples.

Complexity

Time: O(n d) where n is batch_size seq_len and d is d_model
Space: O(n * d) for the normalized output

← #108 #110 →