Implement MuonClip (qk-clip) for Stabilizing Attention

#177 · Deep Learning · Medium

Problem

Implement MuonClip (qk-clip) for stabilizing attention in transformers. This technique clips the query-key dot products to prevent extreme attention scores that can destabilize training.

Solution

import numpy as np

def muon_clip_attention(Q: np.ndarray, K: np.ndarray, V: np.ndarray,
                        clip_value: float = 1.0,
                        temperature: float = None) -> np.ndarray:
    d_k = Q.shape[-1]
    if temperature is None:
        temperature = np.sqrt(d_k)

    scores = Q @ K.T / temperature
    scores = np.clip(scores, -clip_value, clip_value)

    exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
    attention_weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)

    return attention_weights @ V

Explanation

Compute scaled dot-product attention scores: QK^T / sqrt(d_k).
Clip the scores to the range [-clip_value, clip_value] to prevent extreme values.
Apply softmax with numerical stability (subtract max before exp).
Multiply attention weights by values V.
Clipping prevents attention logits from growing too large, which helps training stability especially with the Muon optimizer.

Complexity

Time: O(n^2 * d) where n is the sequence length and d is the head dimension
Space: O(n^2) for the attention score matrix

← #176 #178 →