← back

Implement K-Means++ Initialization

#362 · Machine Learning · Medium

⊣ Solve on deep-ml.com

Problem

Implement K-Means++ initialization, which selects initial cluster centers in a way that spreads them out, leading to faster convergence and better clustering results than random initialization.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import numpy as np

def kmeans_plus_plus(X: np.ndarray, k: int) -> np.ndarray:
    n, d = X.shape
    centers = np.zeros((k, d))

    # Choose first center uniformly at random
    idx = np.random.randint(0, n)
    centers[0] = X[idx]

    for c in range(1, k):
        # Compute squared distances from each point to nearest existing center
        dists = np.min(
            np.sum((X[:, None] - centers[:c][None, :]) ** 2, axis=2),
            axis=1
        )
        # Choose next center with probability proportional to distance squared
        probs = dists / dists.sum()
        idx = np.random.choice(n, p=probs)
        centers[c] = X[idx]

    return centers

Explanation

  1. Select the first center uniformly at random from the data points.
  2. For each subsequent center, compute the squared distance from every point to its nearest existing center.
  3. Select the next center with probability proportional to the squared distance, so points far from existing centers are more likely to be chosen.
  4. Repeat until k centers are selected.

Complexity

  • Time: O(n k d) for distance calculations at each step
  • Space: O(n) for the distance array