See also
General resources
DDP: simple, but requires full model on each node
Model parallelism: ambiguous term, refers to either pipeline or tensor, usu. tensor
Pipeline parallelism: inter-layer
Tensor parallelism: intra-layer, chatty
3D parallelism: DP, MP, PP combined
Maybe helpful:
ZeRO/FSDP: partitions optimizer state, weights, gradients
Sequence parallelism (e.g. in DS Ulysses): partition the input sequence. Use ring self-attention algorithm to compute attentions by passing around keys in a circle
Expert choice - no aux loss
Sinkhorn routing https://arxiv.org/pdf/2202.01169.pdf
MOE review https://arxiv.org/abs/2209.01667
What if we didn’t use load bal https://arxiv.org/pdf/2103.16716.pdf
MoE: mixture of experts
Switch transformers
DeepSpeed: allows tensor parallelism within each expert (”expert slices”). Figure 3 below shows how everything comes together. See also Megatron-DeepSpeed .
DeepSpeed TED: uses 2D (data, tensor) for non-experts/attention and 3D (data, tensor, expert) for experts/MLP. Arranged so that each DP group can communicate with each EP group—so if TP=2 EP=4 DP=16 EDP=2, you would have each of the 2 TPs in any DP → each of the 4 EPs (just the matching TP in it) in any EDP.
Megablocks: key insight is to just use block sparse matmul
Sequence parallelism/context parallelism: doesn’t partition any memory, so more similar to data parallelism (pure compute parallelism), but with the communication of tensor parallelism (since all-to-all communication needed)
How parallel groups work, worked out
World size must be product of all parallel group sizes, but often data parallelism is the variable/auto-inferred from the others
Think of each parallel group as dimension in a multi-dimensional vector. Size is number of values for that dimension. Hence total size of space is full cross product of dimension sizes.
Confusing terminology: parallel size = group size = (qualified) world size (like “expert world size” just means “expert group size”)
So with 8 GPUs, a TP size of 2 means there’s only TP=0 and TP=1. The actual TP groups might be 0 1, 2 3, 4 5, 6 7
. If DP is the other dimension, then DP has size 4, with DP=0, DP=1, … The actual DP groups are 0 2 4 6, 1 3 5 7
. Gradients are reduced within each DP group.
With DP size 1 and TP 4, then DP groups are just 0, 1, 2, 3
and TP group is 0 1 2 3
. Each DP rank receives the same data. Only different ranks in DP receive different data. Also gradients are reduced within each DP group. (So no reductions across GPUs.)
If helps, think of each parallel group independently, as if it were the only parallelism.
0 1
0 1 2 3
0 1
With tensor and pipeline together, you can have multiple possibilities, but assuming that tensor is more chatty than pipeline, you’ll want to keep them to the same host (assuming adjacent node numbers are more likely to be on the same host). You need at least 2 * 4 = 8 nodes.
pipeline ->>>>
0 1 2 3 model
4 5 6 7 vvvvv
or better for locality:
0 2 4 6
1 3 5 7
i.e., mapping from pipeline parallel rank (group), tensor parallel rank (group) to node:
pp0 tp0 -> 0
pp0 tp1 -> 1
pp1 tp0 -> 2
pp1 tp1 -> 3
pp2 tp0 -> 4
...
With tensor and data, you can now have:
dp0 tp0: 0
dp0 tp1: 1
dp1 tp0: 2
dp1 tp1: 3
With tensor, pipeline, data, you can now have 16:
pp0 pp1 pp2 pp3
dp0 tp0 0 4 8 12
dp0 tp1 1 5 9 13
dp1 tp0 2 6 10 14
dp1 tp1 3 7 11 15
How to determine the groups?
Now let’s add expert parallelism (and ignore PP). How does this work?
dp0 tp0 ep0 0
dp0 tp0 ep1 1
dp0 tp1 ep0 2
dp0 tp1 ep1 3
dp1 tp0 ep0 4
dp1 tp0 ep1 5
dp1 tp1 ep0 6
dp1 tp1 ep1 7
With TED, their example:
tp=2 dp=2 dep=1 ep=2
dp=non-expert data parallelism
dep=expert data parallelism
dp0 tp0, dep0 ep0 tp0 0
dp0 tp1, dep0 ep0 tp1 1
dp1 tp0, dep0 ep1 tp0 2
dp1 tp1, dep0 ep1 tp1 3
With just dep=1, that means everyone receives different data, and no reductions across the network... so how does router stay in sync?!
what if we scaled this up?
dp0 tp0, dep0 ep0 tp0 0
dp0 tp1, dep0 ep0 tp1 1
dp1 tp0, dep0 ep1 tp0 2
dp1 tp1, dep0 ep1 tp1 3
dp2 tp0, dep1 ep0 tp0 4
dp2 tp1, dep1 ep0 tp1 5
dp3 tp0, dep1 ep1 tp0 6
dp3 tp1, dep1 ep1 tp1 7
more
dp0 tp0, dep0 ep0 tp0 0
dp0 tp1, dep0 ep0 tp1 1
dp1 tp0, dep0 ep1 tp0 2
dp1 tp1, dep0 ep1 tp1 3
dp2 tp0, dep0 ep2 tp0 4
dp2 tp1, dep0 ep2 tp1 5
dp3 tp0, dep0 ep3 tp0 6
dp3 tp1, dep0 ep3 tp1 7
dp4 tp0, dep1 ep0 tp0 8
dp4 tp1, dep1 ep0 tp1 9
dp5 tp0, dep1 ep1 tp0 10
dp5 tp1, dep1 ep1 tp1 11
dp6 tp0, dep1 ep2 tp0 12
dp6 tp1, dep1 ep2 tp1 13
dp7 tp0, dep1 ep3 tp0 14
dp7 tp1, dep1 ep3 tp1 15
if ep=1
dp0, dep0 ep0
dp1, dep1 ep0
if ep=2
dp0, dep0 ep0
dp1, dep0 ep1
Authoritative comment from Megatron:
Let's say we have a total of 16 GPUs denoted by g0 ... g15 and we
use 2 GPUs to parallelize the model tensor, and 4 GPUs to parallelize
the model pipeline. The present function will
create 8 tensor model-parallel groups, 4 pipeline model-parallel groups
and 8 data-parallel groups as:
8 data_parallel groups:
[g0, g2], [g1, g3], [g4, g6], [g5, g7], [g8, g10], [g9, g11], [g12, g14], [g13, g15]
8 tensor model-parallel groups:
[g0, g1], [g2, g3], [g4, g5], [g6, g7], [g8, g9], [g10, g11], [g12, g13], [g14, g15]
4 pipeline model-parallel groups:
[g0, g4, g8, g12], [g1, g5, g9, g13], [g2, g6, g10, g14], [g3, g7, g11, g15]
Also this from the DS docs:
Given a total number of GPUs in our world size and a subset of GPUs in our expert-parallel world as follows.
WORLD_SIZE = 4 EP_WORLD_SIZE = 2 EXPERTS = [8]
The model code needs to use the deepspeed.moe.layer.MoE API as follows.
self.experts = deepspeed.moe.layer.MoE(hidden_size=input_dim, expert=ExpertModule(), num_experts=EXPERTS, ep_size=EP_WORLD_SIZE)
With the above two commands, the DeepSpeed runtime will be set to train an MoE model with a total of 8 experts on 4 GPUs in 4 experts/GPU mode. We call this the E + D mode as described earlier in the table.
…
[There’s also this snippet, which just means that the expert parallelism never drops below the number of experts, i.e. expert parallelism parallelizes different experts, not within any expert!]
expert_parallel_size = min(world_size, args.num_experts)
Also this on MOE:
Example - E + D parallel
world_size = 16
expert_parallel_size = 2 # number of experts in same group
expert_data_parallel_group = [0,2,4,6,8,10,12,14], [1,3,5,7,9,11,13,15] - all reduce is only on MoE params
expert_parallel_group = [0, 1], [2,3], [4,5], [6,7], [8,9] - no all reduce, but all to all
data_parallel_group = [0,1,...,15] - all reduce is only on non-MoE
use_data_before_expert_parallel_ (bool): Use the D + E instead of E + D topology
Offload, ZeRO Offload