← back

Disaggregated Prefill-Decode Serving Simulator

#440 · Machine Learning · Hard

⊣ Solve on deep-ml.com

Problem

Build a disaggregated prefill-decode serving simulator. In this architecture, prefill (prompt processing) and decode (token generation) run on separate GPU pools. Given a stream of requests with prompt lengths and output lengths, pool sizes, prefill throughput (tokens/sec per GPU), and decode throughput (tokens/sec per GPU), simulate the system and return per-request latencies and overall throughput.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
def disaggregated_serving_sim(
    requests: list[dict],
    prefill_gpus: int,
    decode_gpus: int,
    prefill_tok_per_sec: float,
    decode_tok_per_sec: float,
    kv_transfer_ms: float
) -> dict:
    import heapq

    prefill_pool = [0.0] * prefill_gpus
    decode_pool = [0.0] * decode_gpus

    results = []
    total_output_tokens = 0

    for req in requests:
        arrival = req["arrival_ms"]
        prompt_len = req["prompt_length"]
        output_len = req["output_length"]

        prefill_time_ms = (prompt_len / prefill_tok_per_sec) * 1000

        gpu_idx = prefill_pool.index(min(prefill_pool))
        prefill_start = max(arrival, prefill_pool[gpu_idx])
        prefill_end = prefill_start + prefill_time_ms
        prefill_pool[gpu_idx] = prefill_end

        kv_ready = prefill_end + kv_transfer_ms

        decode_time_ms = (output_len / decode_tok_per_sec) * 1000

        dgpu_idx = decode_pool.index(min(decode_pool))
        decode_start = max(kv_ready, decode_pool[dgpu_idx])
        decode_end = decode_start + decode_time_ms
        decode_pool[dgpu_idx] = decode_end

        latency = decode_end - arrival
        results.append({
            "latency_ms": round(latency, 2),
            "prefill_ms": round(prefill_time_ms, 2),
            "decode_ms": round(decode_time_ms, 2),
            "kv_transfer_ms": kv_transfer_ms
        })
        total_output_tokens += output_len

    last_finish = max(max(decode_pool), max(prefill_pool))
    first_arrival = requests[0]["arrival_ms"] if requests else 0
    total_time_sec = (last_finish - first_arrival) / 1000 if last_finish > first_arrival else 1
    throughput = total_output_tokens / total_time_sec

    return {
        "per_request": results,
        "throughput_tok_per_sec": round(throughput, 2)
    }

Explanation

  1. Maintain two pools of GPU availability times: one for prefill, one for decode.
  2. For each request, assign it to the earliest-available prefill GPU. The prefill duration is prompt_length / prefill_throughput.
  3. After prefill completes, add a KV-cache transfer delay before the request can begin decoding.
  4. Assign the request to the earliest-available decode GPU. Decode duration is output_length / decode_throughput.
  5. Per-request latency is measured from arrival to decode completion. Overall throughput is total output tokens divided by the time span from first arrival to last completion.

Complexity

  • Time: O(n * max(P, D)) where n is the number of requests and P, D are pool sizes
  • Space: O(n) for storing results plus O(P + D) for pool state