End-to-End Latency Decomposition

#413 · Machine Learning · Medium

Problem

Decompose the end-to-end latency of an LLM inference request into its component parts: tokenization time, prefill (prompt processing) time, decode (token generation) time, and detokenization time. Given a list of stage timings, compute the total latency and the percentage contribution of each stage.

Solution

1

2

3

4

5

6

7

def decompose_latency(stage_times: dict[str, float]) -> dict:
    total = sum(stage_times.values())
    result = {"total_latency": round(total, 4)}
    for stage, t in stage_times.items():
        pct = (t / total * 100) if total > 0 else 0.0
        result[stage] = {"time": round(t, 4), "percentage": round(pct, 2)}
    return result

Explanation

Sum all stage times to get total end-to-end latency.
For each stage, compute its fraction of total latency as a percentage.
Typical LLM inference stages include:
- Tokenization: converting input text to token IDs (usually negligible).
- Prefill: processing all prompt tokens through the model in one forward pass.
- Decode: autoregressive generation of output tokens, one at a time.
- Detokenization: converting output token IDs back to text.
The decode stage usually dominates for long outputs, while prefill dominates for long prompts with short outputs.

Complexity

Time: O(s) where s is the number of stages
Space: O(s) for the result dictionary

← #412 #414 →