Background on parallelism

Proposal

  1. Add DS MoE, which is a clean integration
  2. Add MbMoE—the challenge here is how to make it work with gpt-neox’s parallelism, which is largely based on DS parallelism
    1. We want to avoid, say, introducing new communication/synchronization barriers, and introduce minimal changes into DS parallelism engines, which are complex.
    2. The key strategy here is to fully utilize DS expert parallelism (despite not having any DS MoE layers), by making the MB params appear like native DS MoE params, and thus receive the same treatment. This seems to work…
  3. Later, implement pipeline parallelism for MoE (both DSMoE and MbMoE).
    1. The key idea here is to pipe the MoE losses through the pipeline, like normal model output, so that backprop will work in driving load balancing. Although, if it were this simple, not sure why DS doesn’t already just do this.

Remaining state

Yang’s remaining questions