← back

Implement Sigmoid MoE Router with Bias Correction

#458 · Deep Learning · Hard

⊣ Solve on deep-ml.com

Problem

Implement a Sigmoid Mixture-of-Experts (MoE) router with bias correction. Unlike a softmax router that produces a probability distribution, a sigmoid router scores each expert independently with a sigmoid, then normalizes. Include a learned bias term per expert to correct for load imbalance.

Solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import math

def sigmoid(x: float) -> float:
    x = max(-500.0, min(500.0, x))
    return 1.0 / (1.0 + math.exp(-x))

def sigmoid_moe_router(
    token_embeddings: list[list[float]],
    expert_weights: list[list[float]],
    expert_bias: list[float],
    top_k: int
) -> tuple[list[list[int]], list[list[float]]]:
    B, E = len(token_embeddings), len(expert_weights)
    all_indices, all_scores = [], []
    for b in range(B):
        scored = []
        for e in range(E):
            dot = expert_bias[e] + sum(token_embeddings[b][j] * expert_weights[e][j] for j in range(len(token_embeddings[0])))
            scored.append((sigmoid(dot), e))
        scored.sort(key=lambda x: -x[0])
        top = scored[:top_k]
        s = sum(v for v, _ in top) or 1.0
        all_indices.append([idx for _, idx in top])
        all_scores.append([v / s for v, _ in top])
    return all_indices, all_scores

Explanation

  1. Logits: Compute the dot product of each token embedding with each expert weight vector, plus a per-expert bias that can be adjusted to encourage or discourage routing to underused/overused experts.
  2. Sigmoid scoring: Apply sigmoid independently to each expert logit. Unlike softmax, experts are scored non-competitively, so multiple experts can have high scores simultaneously.
  3. Top-k selection: Select the top_k experts with the highest sigmoid scores for each token.
  4. Normalization: Normalize the selected expert scores so they sum to 1 for proper weighted combination of expert outputs.

Complexity

  • Time: O(B E d) where B is batch size, E is number of experts, d is embedding dimension
  • Space: O(B * E) for the score matrix