Supports various forms of parallelism, outside of pipeline
Short Name | Flexible Parallelism Configurations | Benefit |
---|---|---|
E | Expert | Scales the model size by increasing the number of experts |
E + D | Expert + Data | Accelerates training throughput by scaling to multiple data parallel groups |
E + Z | Expert + ZeRO-powered data | Partitions the nonexpert parameters to support larger base models |
E + D + M | Expert + Data + Model | Supports massive hidden sizes and even larger base models than E+Z |
E + D + Z | Expert + Data + ZeRO-powered data | Supports massive hidden sizes and even larger base models than E+Z |
E + Z-Off + M | Expert + ZeRO-Offload + Model | Leverages both GPU and CPU memory for large MoE models on limited # of GPUs |
DeepSpeed TED: uses 2D (data, tensor) for non-experts/attention and 3D (data, tensor, expert) for experts/MLP. Arranged so that each DP group can communicate with each EP group—so if TP=2 EP=4 DP=16 EDP=2, you would have each of the 2 TPs in any DP → each of the 4 EPs (just the matching TP in it) in any EDP.
So the key differences from MbMoE are:
But, how does this ensure that the router is kept in sync across the EP*TP group? TODO