← back

GeLU Activation Function

#147 · Deep Learning · Easy

⊣ Solve on deep-ml.com

Problem

Implement the GeLU (Gaussian Error Linear Unit) activation function. GeLU smoothly gates the input by its own value, using the cumulative distribution function of the standard normal distribution. It is widely used in transformers like BERT and GPT.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import numpy as np

def gelu(x: np.ndarray) -> np.ndarray:
    # Exact: x * Phi(x) where Phi is the CDF of standard normal
    return 0.5 * x * (1.0 + np.vectorize(lambda v: erf_approx(v / np.sqrt(2)))(x))

def erf_approx(x):
    # Abramowitz and Stegun approximation
    sign = np.sign(x)
    x = abs(x)
    t = 1.0 / (1.0 + 0.3275911 * x)
    poly = t * (0.254829592 + t * (-0.284496736 + t * (1.421413741 + t * (-1.453152027 + t * 1.061405429))))
    return sign * (1.0 - poly * np.exp(-x * x))

def gelu_approx(x: np.ndarray) -> np.ndarray:
    # Tanh approximation (used in practice)
    return 0.5 * x * (1.0 + np.tanh(np.sqrt(2.0 / np.pi) * (x + 0.044715 * x ** 3)))

def gelu_sigmoid(x: np.ndarray) -> np.ndarray:
    # Sigmoid approximation
    return x * (1.0 / (1.0 + np.exp(-1.702 * x)))

Explanation

  1. Exact GeLU: x * Phi(x) where Phi is the standard normal CDF, computed via the error function.
  2. Tanh approximation: 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3))) -- the most common approximation used in practice.
  3. Sigmoid approximation: x * sigmoid(1.702 * x) -- a simpler but less accurate variant.
  4. Unlike ReLU which is a hard gate (0 or x), GeLU provides a smooth transition that allows small negative values to pass through slightly.

Complexity

  • Time: O(n) where n is the number of elements
  • Space: O(n) for the output