Implement SwiGLU activation function

#156 · Deep Learning · Easy

Problem

Implement the SwiGLU activation function used in modern transformer architectures (e.g., LLaMA). SwiGLU is defined as SwiGLU(x, W, V, b, c) = Swish(xW + b) * (xV + c) where Swish(t) = t * sigmoid(t).

Solution

import numpy as np

def swiglu(x: np.ndarray, W: np.ndarray, V: np.ndarray,
           b: np.ndarray, c: np.ndarray) -> np.ndarray:
    def sigmoid(t):
        return 1.0 / (1.0 + np.exp(-t))

    def swish(t):
        return t * sigmoid(t)

    gate = swish(x @ W + b)
    linear = x @ V + c
    return gate * linear

Explanation

Compute the gate branch: xW + b, then apply Swish activation (t * sigmoid(t)).
Compute the linear branch: xV + c.
Element-wise multiply the gate and linear branches.
SwiGLU is a gated linear unit variant that uses the Swish function as the gating mechanism, providing smoother gradients than ReLU-based GLU.

Complexity

Time: O(n * d) where n is the input dimension and d is the output dimension (dominated by matrix multiplications)
Space: O(n * d) for intermediate activations

← #155 #157 →