← back

Number Format Precision Comparison (FP16 vs BF16 vs FP8 vs FP4)

#428 · Deep Learning · Hard

⊣ Solve on deep-ml.com

Problem

Compare the precision characteristics of different number formats used in deep learning: FP16, BF16, FP8 (E4M3 and E5M2), and FP4 (E2M1). For each format, compute the maximum representable value, minimum positive normal value, machine epsilon, and the number of representable values.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
def number_format_comparison() -> list[dict]:
    formats = []

    def compute_props(name: str, sign_bits: int, exp_bits: int, man_bits: int, exp_bias: int = None) -> dict:
        if exp_bias is None:
            exp_bias = (1 << (exp_bits - 1)) - 1

        max_exp = (1 << exp_bits) - 2 - exp_bias  # all-1s exponent reserved
        # Max value: (2 - 2^(-man_bits)) * 2^max_exp
        max_val = (2.0 - 2.0 ** (-man_bits)) * (2.0 ** max_exp)
        # Min positive normal: 1.0 * 2^(1 - bias)
        min_normal = 2.0 ** (1 - exp_bias)
        # Machine epsilon: 2^(-man_bits)
        epsilon = 2.0 ** (-man_bits)
        # Min positive subnormal: 2^(1-bias) * 2^(-man_bits)
        min_subnormal = min_normal * (2.0 ** (-man_bits))
        # Number of representable values (including +/- zero, normals, subnormals, not counting inf/nan)
        num_normals = 2 * ((1 << exp_bits) - 2) * (1 << man_bits)  # sign * exponents * mantissas
        num_subnormals = 2 * ((1 << man_bits) - 1)  # sign * (mantissa != 0)
        total_values = num_normals + num_subnormals + 2  # +0, -0

        return {
            "name": name,
            "total_bits": sign_bits + exp_bits + man_bits,
            "sign_bits": sign_bits,
            "exponent_bits": exp_bits,
            "mantissa_bits": man_bits,
            "exp_bias": exp_bias,
            "max_value": round(max_val, 6),
            "min_normal": min_normal,
            "min_subnormal": min_subnormal,
            "epsilon": epsilon,
            "num_representable": total_values
        }

    formats.append(compute_props("FP16", 1, 5, 10))       # IEEE 754 half
    formats.append(compute_props("BF16", 1, 8, 7))         # Brain float
    formats.append(compute_props("FP8_E4M3", 1, 4, 3))     # FP8 E4M3
    formats.append(compute_props("FP8_E5M2", 1, 5, 2))     # FP8 E5M2
    formats.append(compute_props("FP4_E2M1", 1, 2, 1))     # FP4 E2M1

    return formats

Explanation

  1. FP16 (1-5-10): Max ~65504, epsilon ~9.77e-4. High precision, moderate range. Used widely in training.
  2. BF16 (1-8-7): Max ~3.39e38 (same range as FP32), epsilon ~7.81e-3. Lower precision but large dynamic range. Popular for training.
  3. FP8 E4M3 (1-4-3): Max ~448, epsilon 0.125. Balanced for weights/activations in inference.
  4. FP8 E5M2 (1-5-2): Max ~57344, epsilon 0.25. More range, less precision. Good for gradients.
  5. FP4 E2M1 (1-2-1): Max 6, epsilon 0.5. Very low precision, needs block scaling (MXFP4).
  6. The key trade-off is dynamic range (exponent bits) vs precision (mantissa bits).

Complexity

  • Time: O(1) - fixed number of formats
  • Space: O(1)