← back

End-to-End Latency Decomposition

#413 · Machine Learning · Medium

⊣ Solve on deep-ml.com

Problem

Decompose the end-to-end latency of an LLM inference request into its component parts: tokenization time, prefill (prompt processing) time, decode (token generation) time, and detokenization time. Given a list of stage timings, compute the total latency and the percentage contribution of each stage.

Solution

1
2
3
4
5
6
7
def decompose_latency(stage_times: dict[str, float]) -> dict:
    total = sum(stage_times.values())
    result = {"total_latency": round(total, 4)}
    for stage, t in stage_times.items():
        pct = (t / total * 100) if total > 0 else 0.0
        result[stage] = {"time": round(t, 4), "percentage": round(pct, 2)}
    return result

Explanation

  1. Sum all stage times to get total end-to-end latency.
  2. For each stage, compute its fraction of total latency as a percentage.
  3. Typical LLM inference stages include:
  4. - Tokenization: converting input text to token IDs (usually negligible).
  5. - Prefill: processing all prompt tokens through the model in one forward pass.
  6. - Decode: autoregressive generation of output tokens, one at a time.
  7. - Detokenization: converting output token IDs back to text.
  8. The decode stage usually dominates for long outputs, while prefill dominates for long prompts with short outputs.

Complexity

  • Time: O(s) where s is the number of stages
  • Space: O(s) for the result dictionary