Paper
- Summary
- Before MLP (MOE), down-proj to half size low rank dim, do comms, then up-proj
- Can double the MLP size to maintain same isoflops
- For fine-grained MOE, where top-k for larger k (e.g. 8+)
- Normally
- tokens routed to experts with all-to-all at size h
- experts do Wdown → activation → Wup, so each expert is doing (2 h small_h) flops
- tokens routed back with all-to-all at size h
- BigMac
- tokens go through Wdown (this is global, and only happens once instead of per expert!)
- tokens routed to experts with all-to-all at size small_h
- experts can now afford to do Wup → activation → Wdown (since they can be bigger now for isoflops)
- tokens routed back with all-to-all at size small_h
- tokens go through Wup (this is global)