Supports various forms of parallelism, outside of pipeline
| Short Name | Flexible Parallelism Configurations | Benefit |
|---|---|---|
| E | Expert | Scales the model size by increasing the number of experts |
| E + D | Expert + Data | Accelerates training throughput by scaling to multiple data parallel groups |
| E + Z | Expert + ZeRO-powered data | Partitions the nonexpert parameters to support larger base models |
| E + D + M | Expert + Data + Model | Supports massive hidden sizes and even larger base models than E+Z |
| E + D + Z | Expert + Data + ZeRO-powered data | Supports massive hidden sizes and even larger base models than E+Z |
| E + Z-Off + M | Expert + ZeRO-Offload + Model | Leverages both GPU and CPU memory for large MoE models on limited # of GPUs |
DeepSpeed TED: uses 2D (data, tensor) for non-experts/attention and 3D (data, tensor, expert) for experts/MLP. Arranged so that each DP group can communicate with each EP group—so if TP=2 EP=4 DP=16 EDP=2, you would have each of the 2 TPs in any DP → each of the 4 EPs (just the matching TP in it) in any EDP.

So the key differences from MbMoE are:
But, how does this ensure that the router is kept in sync across the EP*TP group? TODO