self.module.language_model.encoder.layers[3].mlp.deepspeed_moe.experts.deepspeed_experts[0].dense_4h_to_h.weight.flatten()[0]
ZeRO-2 and ZeRO-3 are incompatible with out pipeline parallelism engine. ZeRO-2 partitions gradients that the pipeline engine assumes are intact. Similarly, ZeRO-3 partitions parameters that the pipeline engine assumes are intact. Note that pipeline parallelism already offers some of these advantages by partitioning the model directly, and then ZeRO-1 (with optional offload) can be combined to further partition the optimizer. (source)
Short Name | Flexible Parallelism Configurations | Benefit |
---|---|---|
E | Expert | Scales the model size by increasing the number of experts |
E + D | Expert + Data | Accelerates training throughput by scaling to multiple data parallel groups |
E + Z | Expert + ZeRO-powered data | Partitions the nonexpert parameters to support larger base models |
E + D + M | Expert + Data + Model | Supports massive hidden sizes and even larger base models than E+Z |
E + D + Z | Expert + Data + ZeRO-powered data | Supports massive hidden sizes and even larger base models than E+Z |
E + Z-Off + M | Expert + ZeRO-Offload + Model | Leverages both GPU and CPU memory for large MoE models on limited # of GPUs |
PR-MoE
As Phenomenon-I in Section 4.1.1 suggested that leveraging MoE at the later layers bring more benefits, our new architecture utilizes more experts in the last few layers as compared to previous layers. This gives the Pyramid-MoE design, where we show an example in Figure 3 (right)–the last two layers have 2x experts as the previous layers. Meanwhile, considering Phenomenon II, we propose the Residual-MoE architecture, where each token separately passes one fixed MLP module and one chosen expert as shown in Figure 3 (right), where orange blocks are the fixed MLP.
Megatron
gpt-neox
DS
user code: model = PipelineModule.ctor
user code: MoE(expert=SomeModelToClone())
Experts() deep-clones N times and marks each with .allreduce/.group_name
class Experts(torch.nn.Module):
def __init__(self, expert, num_local_experts=1, expert_group_name=None):
super(Experts, self).__init__()
self.deepspeed_experts = torch.nn.ModuleList([copy.deepcopy(expert) for i in range(num_local_experts)])
self.num_local_experts = num_local_experts
# TODO: revisit allreduce for moe.gate...
for expert in self.deepspeed_experts:
# TODO: Create param groups to handle expert + data case (e.g. param.group = moe_group)
for name, param in expert.named_parameters():
param.allreduce = False
param.group_name = expert_group_name
user code: model = deepspeed.initialize(model)
user code:
PipelineEngine.train_batch
# Do the work
sched = schedule.TrainSchedule(micro_batches=self.micro_batches,
stages=self.num_stages,
stage_id=self.stage_id)
self._exec_schedule(sched)
self.agg_train_loss = self._aggregate_total_loss()
instructions are
_INSTRUCTION_MAP = {
schedule.OptimizerStep: _exec_optimizer_step,
schedule.ReduceGrads: _exec_reduce_grads,
schedule.ReduceTiedGrads: _exec_reduce_tied_grads,
schedule.LoadMicroBatch: _exec_load_micro_batch,
schedule.ForwardPass: _exec_forward_pass,
schedule.BackwardPass: _exec_backward_pass,
schedule.SendActivation: _exec_send_activations,
schedule.RecvActivation: _exec_recv_activations,
schedule.SendGrad: _exec_send_grads,
schedule.RecvGrad: _exec_recv_grads,
}
_exec_forward_pass
has its own p2p comm system
DSE.backward
writing _EXPERT_DATA_PARALLEL_GROUP: