Calculate the computational efficiency gain of a Mixture of Experts (MoE) model compared to a dense model. Given the total number of experts, the number of active experts per token (top-k), and the total model FLOPs for the dense equivalent, compute the effective FLOPs used per token.
def moe_efficiency(total_experts: int, active_experts: int, dense_flops: float) -> dict:
expert_flops = dense_flops / total_experts
active_flops = active_experts * expert_flops
gating_flops = total_experts # simplified gating cost
total_active_flops = active_flops + gating_flops
efficiency_gain = dense_flops / total_active_flops
return {
"active_flops": round(total_active_flops, 2),
"efficiency_gain": round(efficiency_gain, 2)
}1/total_experts of the dense model's FLOPs.active_experts (top-k) are activated per token, so effective compute is active_experts * (dense_flops / total_experts).