← back

Implement Self-Attention Mechanism

#53 · Deep Learning · Medium

⊣ Solve on deep-ml.com

Problem

Implement a self-attention mechanism from scratch. Given query, key, and value matrices, compute scaled dot-product attention: Attention(Q, K, V) = softmax(Q @ K^T / sqrt(d_k)) @ V.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import numpy as np

def self_attention(Q, K, V):
    Q = np.array(Q, dtype=np.float64)
    K = np.array(K, dtype=np.float64)
    V = np.array(V, dtype=np.float64)

    d_k = Q.shape[-1]

    # Scaled dot-product scores
    scores = Q @ K.T / np.sqrt(d_k)

    # Softmax along last axis
    exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
    attention_weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)

    # Weighted sum of values
    output = attention_weights @ V

    return output.tolist()

Explanation

  1. Compute attention scores by taking the dot product of Q and K transposed.
  2. Scale by 1 / sqrt(d_k) to prevent large dot products that push softmax into saturated regions.
  3. Apply softmax row-wise using the numerically stable approach (subtract max before exponentiating).
  4. Multiply the attention weights by V to produce the output, which is a weighted combination of value vectors.

Complexity

  • Time: O(n^2 * d) where n is the sequence length and d is the dimension
  • Space: O(n^2) for the attention weight matrix