← back

Implement Position-wise Feed-Forward Block with Residual and Dropout

#178 · Deep Learning · Medium

⊣ Solve on deep-ml.com

Problem

Implement a Position-wise Feed-Forward Block with residual connection and dropout as used in Transformer architectures. The block applies two linear transformations with a ReLU activation in between, followed by dropout and a residual addition.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import numpy as np

def feed_forward_block(x: np.ndarray, W1: np.ndarray, b1: np.ndarray,
                       W2: np.ndarray, b2: np.ndarray,
                       dropout_rate: float = 0.1,
                       training: bool = True) -> np.ndarray:
    hidden = x @ W1 + b1
    hidden = np.maximum(0, hidden)

    if training and dropout_rate > 0:
        mask = (np.random.rand(*hidden.shape) > dropout_rate).astype(float)
        hidden = hidden * mask / (1 - dropout_rate)

    output = hidden @ W2 + b2

    if training and dropout_rate > 0:
        mask = (np.random.rand(*output.shape) > dropout_rate).astype(float)
        output = output * mask / (1 - dropout_rate)

    return x + output

Explanation

  1. Apply the first linear transformation: hidden = x @ W1 + b1.
  2. Apply ReLU activation: max(0, hidden).
  3. Apply dropout to the hidden layer (if training): randomly zero elements and scale by 1/(1-p).
  4. Apply the second linear transformation: output = hidden @ W2 + b2.
  5. Apply dropout to the output (if training).
  6. Add the residual connection: return x + output. This allows gradients to flow directly through the skip connection.

Complexity

  • Time: O(n d d_ff) where n is sequence length, d is model dimension, d_ff is the feed-forward dimension
  • Space: O(n * d_ff) for the hidden activations