K-Means Clustering

#17 · Machine Learning · Medium

Problem

Implement the K-Means clustering algorithm from scratch. Given a dataset and K initial centroids, iteratively assign points to the nearest centroid and update centroids until convergence or a maximum number of iterations.

Solution

def k_means_clustering(points: list[tuple[float, float]], k: int,
                        initial_centroids: list[tuple[float, float]],
                        max_iterations: int) -> list[tuple[float, float]]:
    centroids = [list(c) for c in initial_centroids]
    num_features = len(points[0])

    for _ in range(max_iterations):
        # Assign each point to nearest centroid
        clusters = [[] for _ in range(k)]
        for point in points:
            min_dist = float('inf')
            best = 0
            for i, centroid in enumerate(centroids):
                dist = sum((point[d] - centroid[d]) ** 2 for d in range(num_features))
                if dist < min_dist:
                    min_dist = dist
                    best = i
            clusters[best].append(point)

        # Update centroids
        new_centroids = []
        for i in range(k):
            if clusters[i]:
                new_centroid = [
                    sum(p[d] for p in clusters[i]) / len(clusters[i])
                    for d in range(num_features)
                ]
                new_centroids.append(new_centroid)
            else:
                new_centroids.append(centroids[i])

        # Check convergence
        if new_centroids == centroids:
            break
        centroids = new_centroids

    return [tuple(round(c, 4) for c in centroid) for centroid in centroids]

Explanation

Initialize centroids from the provided starting positions.
Assignment step: For each point, find the nearest centroid using Euclidean distance (squared).
Update step: Recompute each centroid as the mean of all points assigned to it.
Repeat until centroids no longer change or max iterations is reached.

Complexity

Time: O(iterations n k * f) where n is points, k is clusters, f is features
Space: O(n) for cluster assignments

← #16 #18 →