Calculate Computational Efficiency of MoE

#123 · Deep Learning · Easy

Problem

Calculate the computational efficiency gain of a Mixture of Experts (MoE) model compared to a dense model. Given the total number of experts, the number of active experts per token (top-k), and the total model FLOPs for the dense equivalent, compute the effective FLOPs used per token.

Solution

def moe_efficiency(total_experts: int, active_experts: int, dense_flops: float) -> dict:
    expert_flops = dense_flops / total_experts
    active_flops = active_experts * expert_flops
    gating_flops = total_experts  # simplified gating cost
    total_active_flops = active_flops + gating_flops
    efficiency_gain = dense_flops / total_active_flops
    return {
        "active_flops": round(total_active_flops, 2),
        "efficiency_gain": round(efficiency_gain, 2)
    }

Explanation

Each expert handles 1/total_experts of the dense model's FLOPs.
Only active_experts (top-k) are activated per token, so effective compute is active_experts * (dense_flops / total_experts).
A small gating cost is added proportional to the number of experts.
The efficiency gain is the ratio of dense FLOPs to active FLOPs.

Complexity

Time: O(1)
Space: O(1)

← #122 #124 →