DeepSeekMoE, 2024

Fine-grained / many experts
Each MoE layer is split into two groups of experts:
1. Shared experts (a small, fixed set)
2. Routed experts (the large MoE pool)
For every token:
- All shared experts are executed unconditionally (no router decision).
- Only a subset of routed experts are selected by the router (top-K over routed experts).