Implement Self-Attention Mechanism

#53 · Deep Learning · Medium

Problem

Implement a self-attention mechanism from scratch. Given query, key, and value matrices, compute scaled dot-product attention: Attention(Q, K, V) = softmax(Q @ K^T / sqrt(d_k)) @ V.

Solution

import numpy as np

def self_attention(Q, K, V):
    Q = np.array(Q, dtype=np.float64)
    K = np.array(K, dtype=np.float64)
    V = np.array(V, dtype=np.float64)

    d_k = Q.shape[-1]

    # Scaled dot-product scores
    scores = Q @ K.T / np.sqrt(d_k)

    # Softmax along last axis
    exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
    attention_weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)

    # Weighted sum of values
    output = attention_weights @ V

    return output.tolist()

Explanation

Compute attention scores by taking the dot product of Q and K transposed.
Scale by 1 / sqrt(d_k) to prevent large dot products that push softmax into saturated regions.
Apply softmax row-wise using the numerically stable approach (subtract max before exponentiating).
Multiply the attention weights by V to produce the output, which is a weighted combination of value vectors.

Complexity

Time: O(n^2 * d) where n is the sequence length and d is the dimension
Space: O(n^2) for the attention weight matrix

← #52 #54 →