← back

Implement SwiGLU activation function

#156 · Deep Learning · Easy

⊣ Solve on deep-ml.com

Problem

Implement the SwiGLU activation function used in modern transformer architectures (e.g., LLaMA). SwiGLU is defined as SwiGLU(x, W, V, b, c) = Swish(xW + b) * (xV + c) where Swish(t) = t * sigmoid(t).

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np

def swiglu(x: np.ndarray, W: np.ndarray, V: np.ndarray,
           b: np.ndarray, c: np.ndarray) -> np.ndarray:
    def sigmoid(t):
        return 1.0 / (1.0 + np.exp(-t))

    def swish(t):
        return t * sigmoid(t)

    gate = swish(x @ W + b)
    linear = x @ V + c
    return gate * linear

Explanation

  1. Compute the gate branch: xW + b, then apply Swish activation (t * sigmoid(t)).
  2. Compute the linear branch: xV + c.
  3. Element-wise multiply the gate and linear branches.
  4. SwiGLU is a gated linear unit variant that uses the Swish function as the gating mechanism, providing smoother gradients than ReLU-based GLU.

Complexity

  • Time: O(n * d) where n is the input dimension and d is the output dimension (dominated by matrix multiplications)
  • Space: O(n * d) for intermediate activations