← back

Implement MuonClip (qk-clip) for Stabilizing Attention

#177 · Deep Learning · Medium

⊣ Solve on deep-ml.com

Problem

Implement MuonClip (qk-clip) for stabilizing attention in transformers. This technique clips the query-key dot products to prevent extreme attention scores that can destabilize training.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import numpy as np

def muon_clip_attention(Q: np.ndarray, K: np.ndarray, V: np.ndarray,
                        clip_value: float = 1.0,
                        temperature: float = None) -> np.ndarray:
    d_k = Q.shape[-1]
    if temperature is None:
        temperature = np.sqrt(d_k)

    scores = Q @ K.T / temperature
    scores = np.clip(scores, -clip_value, clip_value)

    exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
    attention_weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)

    return attention_weights @ V

Explanation

  1. Compute scaled dot-product attention scores: QK^T / sqrt(d_k).
  2. Clip the scores to the range [-clip_value, clip_value] to prevent extreme values.
  3. Apply softmax with numerical stability (subtract max before exp).
  4. Multiply attention weights by values V.
  5. Clipping prevents attention logits from growing too large, which helps training stability especially with the Muon optimizer.

Complexity

  • Time: O(n^2 * d) where n is the sequence length and d is the head dimension
  • Space: O(n^2) for the attention score matrix