← back

Handle Imbalanced Data with SMOTE

#357 · Data Preprocessing · Medium

⊣ Solve on deep-ml.com

Problem

Implement SMOTE (Synthetic Minority Over-sampling Technique) to handle imbalanced classification data. Given minority class samples and a desired number of synthetic samples, generate new samples by interpolating between each minority sample and its nearest neighbors.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import numpy as np

def smote(X_minority: np.ndarray, n_synthetic: int, k: int = 5) -> np.ndarray:
    n_samples, n_features = X_minority.shape
    k = min(k, n_samples - 1)

    # Compute pairwise distances and find k nearest neighbors
    dists = np.sum((X_minority[:, None] - X_minority[None, :]) ** 2, axis=2)
    neighbors = np.argsort(dists, axis=1)[:, 1:k+1]

    synthetic = np.zeros((n_synthetic, n_features))
    for i in range(n_synthetic):
        idx = i % n_samples
        nn_idx = neighbors[idx, np.random.randint(0, k)]
        lam = np.random.random()
        synthetic[i] = X_minority[idx] + lam * (X_minority[nn_idx] - X_minority[idx])

    return synthetic

Explanation

  1. For each minority sample, find its k nearest neighbors using Euclidean distance.
  2. To generate a synthetic sample, pick a minority sample and one of its k nearest neighbors at random.
  3. Interpolate between the two by choosing a random factor lambda in [0, 1]: x_new = x + lambda * (x_neighbor - x).
  4. Repeat until the desired number of synthetic samples is reached.

Complexity

  • Time: O(n^2 * d) for pairwise distances + O(n_synthetic) for generation
  • Space: O(n^2) for the distance matrix + O(n_synthetic * d) for synthetic samples