self.module.language_model.encoder.layers[3].mlp.deepspeed_moe.experts.deepspeed_experts[0].dense_4h_to_h.weight.flatten()[0]

Parallelism overviews

ZeRO-2 and ZeRO-3 are incompatible with out pipeline parallelism engine. ZeRO-2 partitions gradients that the pipeline engine assumes are intact. Similarly, ZeRO-3 partitions parameters that the pipeline engine assumes are intact. Note that pipeline parallelism already offers some of these advantages by partitioning the model directly, and then ZeRO-1 (with optional offload) can be combined to further partition the optimizer. (source)

Short Name Flexible Parallelism Configurations Benefit
E Expert Scales the model size by increasing the number of experts
E + D Expert + Data Accelerates training throughput by scaling to multiple data parallel groups
E + Z Expert + ZeRO-powered data Partitions the nonexpert parameters to support larger base models
E + D + M Expert + Data + Model Supports massive hidden sizes and even larger base models than E+Z
E + D + Z Expert + Data + ZeRO-powered data Supports massive hidden sizes and even larger base models than E+Z
E + Z-Off + M Expert + ZeRO-Offload + Model Leverages both GPU and CPU memory for large MoE models on limited # of GPUs

Code flows

Misc notes

Parallelism details