Train Softmax Regression with Gradient Descent

#105 · Machine Learning · Hard

Problem

Train a softmax regression (multinomial logistic regression) model using gradient descent. Given features and multi-class labels, learn weight parameters that minimize the cross-entropy loss and return the trained weights.

Solution

import numpy as np

def softmax_regression(X: np.ndarray, y: np.ndarray, n_classes: int, lr: float = 0.1, epochs: int = 1000) -> np.ndarray:
    n_samples, n_features = X.shape
    W = np.zeros((n_features, n_classes))

    # One-hot encode labels
    Y_onehot = np.zeros((n_samples, n_classes))
    Y_onehot[np.arange(n_samples), y.astype(int)] = 1

    for _ in range(epochs):
        # Compute softmax probabilities
        logits = X @ W
        logits -= np.max(logits, axis=1, keepdims=True)  # numerical stability
        exp_logits = np.exp(logits)
        probs = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)

        # Gradient of cross-entropy loss
        gradient = X.T @ (probs - Y_onehot) / n_samples

        W -= lr * gradient

    return W

Explanation

One-hot encoding: Convert integer class labels to binary matrix form for vectorized computation.
Softmax function: Convert raw logits to probabilities via exp(z_i) / sum(exp(z_j)). Subtract the max for numerical stability.
Cross-entropy gradient: The gradient is elegantly X^T (probs - Y_onehot) / n, which is the generalization of the logistic regression gradient to multiple classes.
Gradient descent: Iteratively update the weight matrix to minimize cross-entropy loss.

Complexity

Time: O(epochs n d * C) where n is samples, d is features, C is classes
Space: O(d C + n C) for weights and probability matrices

← #104 #106 →