← back

Diffusion Model U-Net Time Embedding

#399 · Deep Learning · Medium

⊣ Solve on deep-ml.com

Problem

Implement the sinusoidal time embedding used in diffusion model U-Nets. The timestep is encoded into a high-dimensional vector using sinusoidal position encodings (similar to Transformer position embeddings), then projected through an MLP.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import numpy as np

def sinusoidal_time_embedding(timesteps: np.ndarray, embedding_dim: int) -> np.ndarray:
    # timesteps: (batch_size,)
    half_dim = embedding_dim // 2
    freqs = np.exp(-np.log(10000.0) * np.arange(half_dim) / half_dim)

    # Outer product: (batch_size, half_dim)
    args = timesteps[:, np.newaxis] * freqs[np.newaxis, :]

    embedding = np.concatenate([np.sin(args), np.cos(args)], axis=-1)

    # If embedding_dim is odd, pad with zero
    if embedding_dim % 2 == 1:
        embedding = np.concatenate([embedding, np.zeros_like(embedding[:, :1])], axis=-1)

    return embedding


def time_mlp(timesteps: np.ndarray, embedding_dim: int, hidden_dim: int, W1: np.ndarray, b1: np.ndarray, W2: np.ndarray, b2: np.ndarray) -> np.ndarray:
    emb = sinusoidal_time_embedding(timesteps, embedding_dim)
    # MLP: Linear -> SiLU -> Linear
    h = emb @ W1 + b1
    h = h * (1.0 / (1.0 + np.exp(-h)))  # SiLU activation
    return h @ W2 + b2

Explanation

  1. Compute frequencies as a geometric series from 1 to 1/10000, with half_dim steps.
  2. Multiply each timestep by each frequency to get phase arguments.
  3. Apply sin and cos to get the full embedding, concatenating both halves.
  4. The resulting embedding is passed through an MLP with SiLU (Swish) activation to get a learned time representation.
  5. This embedding is then added to or used to modulate feature maps in the U-Net.

Complexity

  • Time: O(B * d) where B is batch size and d is embedding dimension
  • Space: O(B * d) for the embedding matrix