← back

Calculate Correlation Matrix

#37 · Linear Algebra · Medium

⊣ Solve on deep-ml.com

Problem

Calculate the correlation matrix for a dataset. Given a 2D NumPy array where each column is a feature, compute the Pearson correlation coefficient between every pair of features.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import numpy as np

def calculate_correlation_matrix(X):
    n_features = X.shape[1]
    means = np.mean(X, axis=0)
    stds = np.std(X, axis=0, ddof=0)
    corr = np.zeros((n_features, n_features))

    for i in range(n_features):
        for j in range(n_features):
            if stds[i] == 0 or stds[j] == 0:
                corr[i][j] = 0.0 if i != j else 1.0
            else:
                cov = np.mean((X[:, i] - means[i]) * (X[:, j] - means[j]))
                corr[i][j] = cov / (stds[i] * stds[j])

    return corr.tolist()

Explanation

  1. Compute the mean and standard deviation of each feature column.
  2. For each pair of features (i, j), compute the covariance as the mean of the product of deviations from the mean.
  3. Divide the covariance by the product of the two standard deviations to get the Pearson correlation coefficient.
  4. Handle edge cases where a feature has zero variance.

Complexity

  • Time: O(n * f^2) where n is the number of samples and f is the number of features
  • Space: O(f^2) for the correlation matrix