#178 · Deep Learning · Medium
⊣ Solve on deep-ml.comImplement a Position-wise Feed-Forward Block with residual connection and dropout as used in Transformer architectures. The block applies two linear transformations with a ReLU activation in between, followed by dropout and a residual addition.
import numpy as np
def feed_forward_block(x: np.ndarray, W1: np.ndarray, b1: np.ndarray,
W2: np.ndarray, b2: np.ndarray,
dropout_rate: float = 0.1,
training: bool = True) -> np.ndarray:
hidden = x @ W1 + b1
hidden = np.maximum(0, hidden)
if training and dropout_rate > 0:
mask = (np.random.rand(*hidden.shape) > dropout_rate).astype(float)
hidden = hidden * mask / (1 - dropout_rate)
output = hidden @ W2 + b2
if training and dropout_rate > 0:
mask = (np.random.rand(*output.shape) > dropout_rate).astype(float)
output = output * mask / (1 - dropout_rate)
return x + outputhidden = x @ W1 + b1.max(0, hidden).1/(1-p).output = hidden @ W2 + b2.return x + output. This allows gradients to flow directly through the skip connection.