Background on parallelism

More gory notes in LLM scaling algorithms (systems) and Megatron-DeepSpeed
There are two implementations of MoE: DS MoE and Megablocks dMoE
- DS MoE
  - Pros
    - Allows plugging in arbitrary expert modules
    - Integrates natively with TED
  - Cons
    - May not maximize GPU utilization (since experts are modular)
    - Doesn’t support pipeline parallelism
- MbMoE
  - Pros
    - Main contribution is the efficient kernel for matmul (either block sparse or the newer grouped GEMM which feels ‘cleaner’)—this is what allows for dropless distribution to experts
  - Cons
    - Doesn’t bring anything on the parallelism front—wouldn’t fix DS expert parallelism to work with pipeline parallelism
How does MbMoE expert parallelism work?
- It simply uses the Megatron’s data parallel group as its expert parallel group. Why?
- This is so that the router params will be automatically replicated (via Megatron’s DDP engine).
- But expert weights shouldn’t be replicated, how do they do this? They simply modify Megatron’s DDP code to exclude these.
- This all means their demo codebase is limited to EP = DP. You can’t have, say, EP < DP. You have exactly one copy of any expert throughout your network.
- Also important: What about TP?
  - TP happens implicitly whenever you have # experts < EP size (i.e. DP size). So if you have 4 experts and EP size of 8, then each each node handles half an expert. That’s the only way TP shows up.
  - Not using the TP size in Megatron! This is completely irrelevant and ignored by them. They say there’s no theoretical reason why it couldn’t be made to work, just would need to be separate work.
  - So “expert TP size” is always exactly DP / num experts.
- Full notes in Megatron-DeepSpeed

How does DS MoE expert parallelism work?

Supports various forms of parallelism, outside of pipeline

Short Name	Flexible Parallelism Configurations	Benefit
E	Expert	Scales the model size by increasing the number of experts
E + D	Expert + Data	Accelerates training throughput by scaling to multiple data parallel groups
E + Z	Expert + ZeRO-powered data	Partitions the nonexpert parameters to support larger base models
E + D + M	Expert + Data + Model	Supports massive hidden sizes and even larger base models than E+Z
E + D + Z	Expert + Data + ZeRO-powered data	Supports massive hidden sizes and even larger base models than E+Z
E + Z-Off + M	Expert + ZeRO-Offload + Model	Leverages both GPU and CPU memory for large MoE models on limited # of GPUs

DeepSpeed TED: uses 2D (data, tensor) for non-experts/attention and 3D (data, tensor, expert) for experts/MLP. Arranged so that each DP group can communicate with each EP group—so if TP=2 EP=4 DP=16 EDP=2, you would have each of the 2 TPs in any DP → each of the 4 EPs (just the matching TP in it) in any EDP.
So the key differences from MbMoE are:
- EP group is “carved out” of the DP group, but it doesn’t have to be exactly =DP. (Also means you can have, for instance, more than one instance of an expert appear in your network.)
- You can separately have an explicitly set TP size, which can be different from EP size.
- Like in the MB case, DS’s DP handling explicitly excludes expert params—these are handled differently, by setting things appropriately
- Gory details in Megatron-DeepSpeed, including some worked-out parallelism group configurations within this TED paradigm
But, how does this ensure that the router is kept in sync across the EP*TP group? TODO

Proposal

Add DS MoE, which is a clean integration
Add MbMoE—the challenge here is how to make it work with gpt-neox’s parallelism, which is largely based on DS parallelism
1. We want to avoid, say, introducing new communication/synchronization barriers, and introduce minimal changes into DS parallelism engines, which are complex.
2. The key strategy here is to fully utilize DS expert parallelism (despite not having any DS MoE layers), by making the MB params appear like native DS MoE params, and thus receive the same treatment. This seems to work…
Later, implement pipeline parallelism for MoE (both DSMoE and MbMoE).
1. The key idea here is to pipe the MoE losses through the pipeline, like normal model output, so that backprop will work in driving load balancing. Although, if it were this simple, not sure why DS doesn’t already just do this.

Remaining state

[x] Verify that router weights are kept in sync across EP group
[x] Verify that expert weights are not kept in sync across EP group
[x] Verify that loss goes down
[x] Verify LBL and token distribution is working
[x] Verify router weights are kept in sync across TP group (should be same for both MoEs), which is needed for TED parallelism. Sub-questions:
[ ] Add tests to automate the above verifications
[ ] Fork rest of megablocks and deps into gpt-neox
[ ] Code cleanup 🧹

Yang’s remaining questions

What is clip_grad_norm for