← back

GPU Ops:Byte Ratio Calculation from Spec Sheet

#421 · Inference · Easy

⊣ Solve on deep-ml.com

Problem

Given a GPU spec sheet with peak FLOP/s and peak memory bandwidth (bytes/s), compute the ops:byte ratio (also known as the machine balance or ridge point of the roofline model). This ratio tells you the minimum arithmetic intensity an operation needs to be compute-bound on this hardware.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def gpu_ops_byte_ratio(
    peak_flops: float,
    peak_bandwidth_bytes_per_sec: float,
    dtype: str = "fp16"
) -> dict:
    ops_byte_ratio = peak_flops / peak_bandwidth_bytes_per_sec

    dtype_bytes = {"fp32": 4, "fp16": 2, "bf16": 2, "fp8": 1, "int8": 1, "int4": 0.5}
    bpe = dtype_bytes.get(dtype, 2)

    # Ops:element ratio = ops:byte * bytes_per_element
    ops_per_element = ops_byte_ratio * bpe

    return {
        "ops_byte_ratio": round(ops_byte_ratio, 2),
        "dtype": dtype,
        "bytes_per_element": bpe,
        "ops_per_element": round(ops_per_element, 2),
        "interpretation": (
            f"An operation must perform at least {round(ops_byte_ratio, 1)} FLOPs "
            f"per byte transferred to be compute-bound. "
            f"For {dtype} ({bpe}B per element), that is {round(ops_per_element, 1)} "
            f"FLOPs per element."
        )
    }

Explanation

  1. The ops:byte ratio = peak_FLOP/s / peak_bandwidth. It represents the GPU's computational density.
  2. For example, NVIDIA A100 SXM: 312 TFLOP/s (FP16) / 2.0 TB/s = 156 FLOP/byte.
  3. Any operation with arithmetic intensity below this threshold is memory-bandwidth limited.
  4. Converting to ops-per-element by multiplying by bytes-per-element gives a more intuitive number: how many FLOPs you need per data element to keep the GPU fully utilized.
  5. Different dtypes have different bytes per element, affecting the practical threshold.

Complexity

  • Time: O(1)
  • Space: O(1)