← back

Implement DBSCAN Clustering Algorithm

#259 · Machine Learning · Medium

⊣ Solve on deep-ml.com

Problem

Implement the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm from scratch. Given data points, an epsilon radius, and a minimum number of points, cluster the data and identify noise points.

Solution

For each unvisited point, find its epsilon-neighborhood. If it has enough neighbors it becomes a core point and starts a cluster; otherwise it is (initially) labeled as noise. Expand clusters by recursively adding density-reachable points.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import math

def dbscan(
    X: list[list[float]], eps: float = 0.5, min_pts: int = 5
) -> list[int]:
    n = len(X)
    labels = [-1] * n  # -1 means noise
    visited = [False] * n
    cluster_id = 0

    def distance(a: list[float], b: list[float]) -> float:
        return math.sqrt(sum((x - y) ** 2 for x, y in zip(a, b)))

    def region_query(idx: int) -> list[int]:
        neighbors = []
        for j in range(n):
            if distance(X[idx], X[j]) <= eps:
                neighbors.append(j)
        return neighbors

    def expand_cluster(idx: int, neighbors: list[int], cid: int) -> None:
        labels[idx] = cid
        queue = list(neighbors)
        i = 0
        while i < len(queue):
            q = queue[i]
            if not visited[q]:
                visited[q] = True
                q_neighbors = region_query(q)
                if len(q_neighbors) >= min_pts:
                    queue.extend(q_neighbors)
            if labels[q] == -1:
                labels[q] = cid
            i += 1

    for i in range(n):
        if visited[i]:
            continue
        visited[i] = True
        neighbors = region_query(i)
        if len(neighbors) < min_pts:
            labels[i] = -1  # noise (may be reassigned later)
        else:
            expand_cluster(i, neighbors, cluster_id)
            cluster_id += 1

    return labels

Explanation

  1. For each unvisited point, find all neighbors within distance eps.
  2. If the point has at least min_pts neighbors, it is a core point — start a new cluster.
  3. Expand the cluster: for each neighbor, if it is also a core point, add its neighbors to the expansion queue.
  4. Points that are density-reachable from any core point join its cluster.
  5. Points not reachable from any core point remain labeled as noise (-1).

Complexity

  • Time: O(n^2) for naive neighbor search (can be improved to O(n log n) with spatial indexing)
  • Space: O(n) for labels and visited flags