← back

Data Quality Scoring for ML Pipelines

#252 · MLOps · Medium

⊣ Solve on deep-ml.com

Problem

Compute a data quality score for an ML pipeline. Given a dataset (list of dicts), evaluate multiple quality dimensions — completeness, uniqueness of a key column, value-range validity, and type consistency — then return an aggregate score between 0 and 1.

Solution

Score each dimension independently as a ratio and return the weighted average.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
def data_quality_score(
    data: list[dict],
    key_column: str,
    numeric_columns: list[str] | None = None,
    valid_ranges: dict | None = None,
    weights: dict | None = None,
) -> dict:
    if not data:
        return {"overall": 0.0, "completeness": 0.0, "uniqueness": 0.0,
                "validity": 0.0, "consistency": 0.0}

    if weights is None:
        weights = {"completeness": 0.3, "uniqueness": 0.2,
                   "validity": 0.25, "consistency": 0.25}
    if numeric_columns is None:
        numeric_columns = []
    if valid_ranges is None:
        valid_ranges = {}

    n = len(data)
    all_keys = set()
    for row in data:
        all_keys.update(row.keys())

    # Completeness: fraction of non-None cells
    total_cells = n * len(all_keys)
    non_null = sum(1 for row in data for k in all_keys if row.get(k) is not None)
    completeness = non_null / total_cells if total_cells else 1.0

    # Uniqueness of key column
    key_values = [row.get(key_column) for row in data if row.get(key_column) is not None]
    uniqueness = len(set(key_values)) / len(key_values) if key_values else 0.0

    # Validity: numeric values within expected ranges
    valid_count, range_total = 0, 0
    for col in numeric_columns:
        lo, hi = valid_ranges.get(col, (float("-inf"), float("inf")))
        for row in data:
            v = row.get(col)
            if v is not None:
                range_total += 1
                if lo <= v <= hi:
                    valid_count += 1
    validity = valid_count / range_total if range_total else 1.0

    # Consistency: type uniformity per column
    consistent_cols = 0
    for col in all_keys:
        types = set(type(row[col]) for row in data if col in row and row[col] is not None)
        if len(types) <= 1:
            consistent_cols += 1
    consistency = consistent_cols / len(all_keys) if all_keys else 1.0

    overall = (
        weights["completeness"] * completeness
        + weights["uniqueness"] * uniqueness
        + weights["validity"] * validity
        + weights["consistency"] * consistency
    )

    return {
        "overall": round(overall, 4),
        "completeness": round(completeness, 4),
        "uniqueness": round(uniqueness, 4),
        "validity": round(validity, 4),
        "consistency": round(consistency, 4),
    }

Explanation

  1. Completeness — ratio of non-null cells to total cells.
  2. Uniqueness — ratio of distinct values in the key column to total values.
  3. Validity — fraction of numeric values that fall within specified ranges.
  4. Consistency — fraction of columns where all non-null values share the same Python type.
  5. Weighted average of the four dimensions gives the overall quality score.

Complexity

  • Time: O(n * c) where n is the number of rows and c is the number of columns
  • Space: O(n * c) for collecting keys and values