Tensor Parallelism All-Reduce Communication Cost

#438 · Deep Learning · Medium

Problem

Compute the communication cost of an all-reduce operation in tensor parallelism. Given the tensor size in elements, element byte size, the number of GPUs (tensor parallel degree), and the inter-GPU bandwidth, calculate the total data transferred per GPU and the time for the all-reduce using the ring all-reduce algorithm.

Solution

def tensor_parallel_allreduce_cost(
    num_elements: int,
    element_bytes: int,
    tp_degree: int,
    bandwidth_gbps: float
) -> dict:
    tensor_bytes = num_elements * element_bytes

    data_per_gpu = tensor_bytes * 2 * (tp_degree - 1) / tp_degree

    bandwidth_bytes_per_sec = bandwidth_gbps * 1e9 / 8

    latency_sec = data_per_gpu / bandwidth_bytes_per_sec
    latency_ms = latency_sec * 1000

    return {
        "tensor_bytes": tensor_bytes,
        "data_per_gpu_bytes": round(data_per_gpu, 2),
        "latency_ms": round(latency_ms, 4)
    }

Explanation

Compute total tensor size: num_elements * element_bytes.
In ring all-reduce, each GPU sends and receives tensor_bytes * (tp_degree - 1) / tp_degree in both the reduce-scatter and all-gather phases, giving a factor of 2.
Convert the link bandwidth from Gbps to bytes/sec: bandwidth_gbps * 1e9 / 8.
Divide the data volume per GPU by the bandwidth to get the communication latency.
This models the bandwidth-bound component; real latency also includes per-message startup costs.

Complexity

Time: O(1) computation
Space: O(1)

← #437 #439 →