← back

Cold Start Latency Budget Breakdown

#450 · Machine Learning · Medium

⊣ Solve on deep-ml.com

Problem

Break down the cold start latency budget for an ML inference service. Given component latencies (container pull, model download, model load into GPU memory, warmup inference, health check registration), compute the total cold start time and identify which components dominate the budget. Support analyzing the impact of optimizations like pre-pulled images or cached model weights.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def cold_start_budget(
    components: list[dict],
    optimizations: list[dict] | None = None
) -> dict:
    opt_map = {}
    if optimizations:
        for opt in optimizations:
            opt_map[opt["component"]] = opt["reduction_ms"]

    breakdown = []
    total_ms = 0.0

    for comp in components:
        name = comp["name"]
        base_ms = comp["latency_ms"]
        reduction = opt_map.get(name, 0)
        effective_ms = max(0, base_ms - reduction)
        total_ms += effective_ms
        breakdown.append({
            "name": name,
            "base_ms": base_ms,
            "reduction_ms": reduction,
            "effective_ms": effective_ms
        })

    for item in breakdown:
        item["fraction"] = round(item["effective_ms"] / total_ms, 4) if total_ms > 0 else 0.0

    breakdown.sort(key=lambda x: x["effective_ms"], reverse=True)
    bottleneck = breakdown[0]["name"] if breakdown else "none"

    return {
        "breakdown": breakdown,
        "total_cold_start_ms": round(total_ms, 2),
        "total_cold_start_sec": round(total_ms / 1000, 2),
        "bottleneck": bottleneck
    }

Explanation

  1. Each cold start component contributes a latency: container image pull, model weight download, GPU memory loading, warmup inference passes, and health-check registration.
  2. Optimizations (e.g., pre-pulled images, cached weights on local NVMe) reduce specific component latencies by a fixed amount.
  3. Compute the effective latency per component as max(0, base - reduction), then sum for the total cold start time.
  4. Calculate each component's fraction of the total to identify the bottleneck.
  5. Typical findings: model download and GPU loading dominate cold starts. Pre-caching weights or using quantized models provides the largest improvements.

Complexity

  • Time: O(C) where C is the number of components
  • Space: O(C) for the breakdown list