Implement Position-wise Feed-Forward Block with Residual and Dropout

#178 · Deep Learning · Medium

Problem

Implement a Position-wise Feed-Forward Block with residual connection and dropout as used in Transformer architectures. The block applies two linear transformations with a ReLU activation in between, followed by dropout and a residual addition.

Solution

import numpy as np

def feed_forward_block(x: np.ndarray, W1: np.ndarray, b1: np.ndarray,
                       W2: np.ndarray, b2: np.ndarray,
                       dropout_rate: float = 0.1,
                       training: bool = True) -> np.ndarray:
    hidden = x @ W1 + b1
    hidden = np.maximum(0, hidden)

    if training and dropout_rate > 0:
        mask = (np.random.rand(*hidden.shape) > dropout_rate).astype(float)
        hidden = hidden * mask / (1 - dropout_rate)

    output = hidden @ W2 + b2

    if training and dropout_rate > 0:
        mask = (np.random.rand(*output.shape) > dropout_rate).astype(float)
        output = output * mask / (1 - dropout_rate)

    return x + output

Explanation

Apply the first linear transformation: hidden = x @ W1 + b1.
Apply ReLU activation: max(0, hidden).
Apply dropout to the hidden layer (if training): randomly zero elements and scale by 1/(1-p).
Apply the second linear transformation: output = hidden @ W2 + b2.
Apply dropout to the output (if training).
Add the residual connection: return x + output. This allows gradients to flow directly through the skip connection.

Complexity

Time: O(n d d_ff) where n is sequence length, d is model dimension, d_ff is the feed-forward dimension
Space: O(n * d_ff) for the hidden activations

← #177 #179 →