← back

Calculate Computational Efficiency of MoE

#123 · Deep Learning · Easy

⊣ Solve on deep-ml.com

Problem

Calculate the computational efficiency gain of a Mixture of Experts (MoE) model compared to a dense model. Given the total number of experts, the number of active experts per token (top-k), and the total model FLOPs for the dense equivalent, compute the effective FLOPs used per token.

Solution

1
2
3
4
5
6
7
8
9
10
def moe_efficiency(total_experts: int, active_experts: int, dense_flops: float) -> dict:
    expert_flops = dense_flops / total_experts
    active_flops = active_experts * expert_flops
    gating_flops = total_experts  # simplified gating cost
    total_active_flops = active_flops + gating_flops
    efficiency_gain = dense_flops / total_active_flops
    return {
        "active_flops": round(total_active_flops, 2),
        "efficiency_gain": round(efficiency_gain, 2)
    }

Explanation

  1. Each expert handles 1/total_experts of the dense model's FLOPs.
  2. Only active_experts (top-k) are activated per token, so effective compute is active_experts * (dense_flops / total_experts).
  3. A small gating cost is added proportional to the number of experts.
  4. The efficiency gain is the ratio of dense FLOPs to active FLOPs.

Complexity

  • Time: O(1)
  • Space: O(1)