#438 · Deep Learning · Medium
⊣ Solve on deep-ml.comCompute the communication cost of an all-reduce operation in tensor parallelism. Given the tensor size in elements, element byte size, the number of GPUs (tensor parallel degree), and the inter-GPU bandwidth, calculate the total data transferred per GPU and the time for the all-reduce using the ring all-reduce algorithm.
def tensor_parallel_allreduce_cost(
num_elements: int,
element_bytes: int,
tp_degree: int,
bandwidth_gbps: float
) -> dict:
tensor_bytes = num_elements * element_bytes
data_per_gpu = tensor_bytes * 2 * (tp_degree - 1) / tp_degree
bandwidth_bytes_per_sec = bandwidth_gbps * 1e9 / 8
latency_sec = data_per_gpu / bandwidth_bytes_per_sec
latency_ms = latency_sec * 1000
return {
"tensor_bytes": tensor_bytes,
"data_per_gpu_bytes": round(data_per_gpu, 2),
"latency_ms": round(latency_ms, 4)
}num_elements * element_bytes.tensor_bytes * (tp_degree - 1) / tp_degree in both the reduce-scatter and all-gather phases, giving a factor of 2.bandwidth_gbps * 1e9 / 8.