Collectives
Implementing collectives
Regardless of tree vs ring all-reduce, there are 2(n-1) transfers, so busbw is 2S(n-1)/n where S is array size (source)
Ring all-reduce: first a reduce-scatter, then an all-gather
Tree all-reduce:
Ring all-gather