Principal Component Analysis (PCA) Implementation

#19 · Machine Learning · Medium

Problem

Implement Principal Component Analysis (PCA) from scratch. Given a dataset, reduce its dimensionality to a specified number of principal components.

Solution

import numpy as np

def pca(data: list[list[float]], num_components: int) -> list[list[float]]:
    X = np.array(data, dtype=float)
    n = X.shape[0]

    # Center the data (subtract mean of each feature)
    mean = X.mean(axis=0)
    X_centered = X - mean

    # Compute covariance matrix
    cov_matrix = (X_centered.T @ X_centered) / (n - 1)

    # Compute eigenvalues and eigenvectors
    eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)

    # Sort by eigenvalue descending
    idx = np.argsort(eigenvalues)[::-1]
    eigenvectors = eigenvectors[:, idx]

    # Select top num_components eigenvectors
    W = eigenvectors[:, :num_components]

    # Project data
    projected = X_centered @ W

    return np.round(projected, 4).tolist()

Explanation

Center the data by subtracting the mean of each feature.
Compute the covariance matrix of the centered data.
Find the eigenvalues and eigenvectors of the covariance matrix.
Sort eigenvectors by eigenvalue in descending order and select the top num_components.
Project the centered data onto the selected eigenvectors to get the lower-dimensional representation.

Complexity

Time: O(n * f^2 + f^3) where n is samples and f is features
Space: O(f^2) for the covariance matrix

← #18 #20 →