Implement K-Means++ Initialization

#362 · Machine Learning · Medium

Problem

Implement K-Means++ initialization, which selects initial cluster centers in a way that spreads them out, leading to faster convergence and better clustering results than random initialization.

Solution

import numpy as np

def kmeans_plus_plus(X: np.ndarray, k: int) -> np.ndarray:
    n, d = X.shape
    centers = np.zeros((k, d))

    # Choose first center uniformly at random
    idx = np.random.randint(0, n)
    centers[0] = X[idx]

    for c in range(1, k):
        # Compute squared distances from each point to nearest existing center
        dists = np.min(
            np.sum((X[:, None] - centers[:c][None, :]) ** 2, axis=2),
            axis=1
        )
        # Choose next center with probability proportional to distance squared
        probs = dists / dists.sum()
        idx = np.random.choice(n, p=probs)
        centers[c] = X[idx]

    return centers

Explanation

Select the first center uniformly at random from the data points.
For each subsequent center, compute the squared distance from every point to its nearest existing center.
Select the next center with probability proportional to the squared distance, so points far from existing centers are more likely to be chosen.
Repeat until k centers are selected.

Complexity

Time: O(n k d) for distance calculations at each step
Space: O(n) for the distance array

← #361 #363 →