2024-03-21
2024-02-29
2024-02-26
notes
loss calculated on non final pipeline layers?
it is using data parallel group for expert parallelism?
how should router be replicated across within TP group?
how are losses communicated?
is this row slicing the TP? what’s hidden_sharding_degree
?
if not args.moe_expert_model_parallelism:
return master_weights
# Calculate the amount of sharding in each dimension.
expert_sharding_degree = mpu.expert_sharding_degree(args)
hidden_sharding_degree = mpu.hidden_sharding_degree(args)
# Calculate the experts per rank.
#
# NOTE: We assign ranks to be expert parallel before going
# tensor parallel.
rank = mpu.get_expert_parallel_rank(args)
expert_rank = rank % expert_sharding_degree
num_experts_per_rank = num_experts // expert_sharding_degree
start_expert = expert_rank * num_experts_per_rank
end_expert = (expert_rank + 1) * num_experts_per_rank
# Calculate the rows per rank.
row_rank = rank // expert_sharding_degree
num_rows_per_rank = ffn_hidden_size // hidden_sharding_degree
start_row = row_rank * num_rows_per_rank
end_row = (row_rank + 1) * num_rows_per_rank
# Slice the weight matrix to get the chunk for this rank.
with torch.no_grad():
weights = master_weights[
start_expert:end_expert, start_row:end_row]
return weights
Understand this (source)
Our Megatron fork is mostly for small-scale experiments and uses the data parallel process group for expert model parallelism. If you scale out to multiple nodes with data parallelism and expert parallelism enabled you'll do expert parallelism across those nodes, which can be slow because the all2alls become a bit expensive.
One thing you could try is using pipeline parallelism between nodes. If you were to use MegaBlocks in a custom framework, I'd recommend using something like FSDP across nodes and expert parallelism within each node.
I do not have reference scripts for multi-node training, but for pipeline parallelism the flags are the same as they are in upstream Megatron-LM. I hope this helps!
2024-01-23
gpt-neox inference with
gpt-neox inference: couldn’t figure out how to load pretrained models from hf.
For s3 checkpointing, please install hf_transfer either using requirements/requirements-s3.txt or <https://github.com/huggingface/hf_transfer>
2024-01-24:01:24:33,675 INFO [utils.py:160] NumExpr defaulting to 4 threads.
2024-01-24:01:24:33,930 INFO [config.py:58] PyTorch version 1.13.0+cu117 available.
======================================================================
Warning the following script will delete files within checkpoints/neox_converted/pythia/70m
Warning the following script will delete this directory /tmp/ckpt_tmp_dir
======================================================================
NeoXArgs.from_ymls() ['configs/pythia/70M.yml', 'configs/local_setup.yml']
2024-01-24:01:24:35,262 INFO [arguments.py:849] NeoXArgs.calculate_derived() Total number of GPUs determined to be: 1
NeoXArgs.configure_distributed_args() using world size: 1 and model-parallel size: 1
> building HFTokenizer tokenizer ...
Traceback (most recent call last):
File "tools/ckpts/convert_hf_to_sequential.py", line 505, in <module>
neox_args.build_tokenizer()
File "/home/ubuntu/gpt-neox/megatron/neox_arguments/arguments.py", line 147, in build_tokenizer
self.tokenizer = build_tokenizer(self)
File "/home/ubuntu/gpt-neox/megatron/tokenizer/tokenizer.py", line 45, in build_tokenizer
tokenizer = HFTokenizer(args.vocab_file)
File "/home/ubuntu/gpt-neox/megatron/tokenizer/tokenizer.py", line 229, in __init__
self.tokenizer = Tokenizer.from_file(vocab_file)
Exception: expected `,` or `}` at line 1 column 5
MTDS inference with 345m
request IP: 127.0.0.1
{"prompts": ["Hello my name is"], "tokens_to_generate": 32}
start time: 2024-01-23 23:58:31.383247
127.0.0.1 - - [23/Jan/2024 23:58:32] "PUT /api HTTP/1.1" 200 -
request IP: 127.0.0.1
{"prompts": ["Hello my name is"], "tokens_to_generate": 32}
start time: 2024-01-23 23:59:06.512734
127.0.0.1 - - [23/Jan/2024 23:59:07] "PUT /api HTTP/1.1" 200 -
request IP: 127.0.0.1
{"prompts": ["Hello my name is"], "tokens_to_generate": 32}
start time: 2024-01-23 23:59:07.587983
127.0.0.1 - - [23/Jan/2024 23:59:08] "PUT /api HTTP/1.1" 200 -
request IP: 127.0.0.1
{"prompts": ["Hello my name is"], "tokens_to_generate": 32}
start time: 2024-01-23 23:59:08.611759
127.0.0.1 - - [23/Jan/2024 23:59:09] "PUT /api HTTP/1.1" 200 -
request IP: 127.0.0.1
{"prompts": ["Hello my name is"], "tokens_to_generate": 32}
start time: 2024-01-23 23:59:09.683290
127.0.0.1 - - [23/Jan/2024 23:59:10] "PUT /api HTTP/1.1" 200 -
request IP: 127.0.0.1
{"prompts": ["Hello my name is"], "tokens_to_generate": 32}
start time: 2024-01-23 23:59:10.690690
127.0.0.1 - - [23/Jan/2024 23:59:11] "PUT /api HTTP/1.1" 200 -
request IP: 127.0.0.1
{"prompts": ["Hello my name is"], "tokens_to_generate": 32}
start time: 2024-01-23 23:59:11.728337
127.0.0.1 - - [23/Jan/2024 23:59:12] "PUT /api HTTP/1.1" 200 -
request IP: 127.0.0.1
{"prompts": ["Hello my name is"], "tokens_to_generate": 32}
start time: 2024-01-23 23:59:12.765529
127.0.0.1 - - [23/Jan/2024 23:59:13] "PUT /api HTTP/1.1" 200 -
request IP: 127.0.0.1
{"prompts": ["Hello my name is"], "tokens_to_generate": 32}
start time: 2024-01-23 23:59:13.819065
127.0.0.1 - - [23/Jan/2024 23:59:14] "PUT /api HTTP/1.1" 200 -
request IP: 127.0.0.1
{"prompts": ["Hello my name is"], "tokens_to_generate": 32}
start time: 2024-01-23 23:59:14.842124
127.0.0.1 - - [23/Jan/2024 23:59:15] "PUT /api HTTP/1.1" 200 -
request IP: 127.0.0.1
{"prompts": ["Hello my name is"], "tokens_to_generate": 32}
start time: 2024-01-23 23:59:15.872852
127.0.0.1 - - [23/Jan/2024 23:59:16] "PUT /api HTTP/1.1" 200 -
gpt-neox moe
nexperts=2...
[2024-01-23 19:43:51,492] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 492.27 | backward_microstep: 1273.92 | backward_inner_microstep: 1252.38 | backward_allreduce_microstep: 21.28 | step_microstep: 61.84
[2024-01-23 19:43:51,492] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 492.24 (forward_moe: 371.40, 1st alltoall: 1.32, 2nd alltoall: 1.35, top-k: 197.07)
[2024-01-23 19:43:51,492] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 0.00 | backward: 1273.92 | backward_inner: 1252.40 | backward_allreduce: 21.28 | step: 61.85
[2024-01-23 19:43:53,357] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.62 | optimizer_gradients: 8.49 | optimizer_step: 16.70
[2024-01-23 19:43:53,357] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 490.92 | backward_microstep: 1275.11 | backward_inner_microstep: 1254.66 | backward_allreduce_microstep: 20.21 | step_microstep: 61.67
[2024-01-23 19:43:53,358] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 490.89 (forward_moe: 373.46, 1st alltoall: 1.32, 2nd alltoall: 1.35, top-k: 198.63)
[2024-01-23 19:43:53,358] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 0.00 | backward: 1275.11 | backward_inner: 1254.67 | backward_allreduce: 20.22 | step: 61.67
[2024-01-23 19:43:55,226] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.62 | optimizer_gradients: 8.53 | optimizer_step: 16.71
[2024-01-23 19:43:55,227] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 495.61 | backward_microstep: 1275.26 | backward_inner_microstep: 1252.79 | backward_allreduce_microstep: 22.22 | step_microstep: 61.68
[2024-01-23 19:43:55,227] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 495.58 (forward_moe: 372.04, 1st alltoall: 1.32, 2nd alltoall: 1.35, top-k: 197.63)
[2024-01-23 19:43:55,228] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 0.00 | backward: 1275.26 | backward_inner: 1252.81 | backward_allreduce: 22.22 | step: 61.69
[2024-01-23 19:43:57,092] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.62 | optimizer_gradients: 8.51 | optimizer_step: 16.70
[2024-01-23 19:43:57,093] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 491.13 | backward_microstep: 1273.29 | backward_inner_microstep: 1250.94 | backward_allreduce_microstep: 22.10 | step_microstep: 63.19
[2024-01-23 19:43:57,093] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 491.09 (forward_moe: 371.54, 1st alltoall: 1.32, 2nd alltoall: 1.35, top-k: 197.05)
[2024-01-23 19:43:57,093] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 0.00 | backward: 1273.29 | backward_inner: 1250.96 | backward_allreduce: 22.10 | step: 63.22
[2024-01-23 19:43:58,969] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.62 | optimizer_gradients: 8.65 | optimizer_step: 16.69
[2024-01-23 19:43:58,970] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 492.43 | backward_microstep: 1276.46 | backward_inner_microstep: 1253.49 | backward_allreduce_microstep: 22.63 | step_microstep: 70.64
[2024-01-23 19:43:58,970] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 492.40 (forward_moe: 372.34, 1st alltoall: 1.32, 2nd alltoall: 1.35, top-k: 197.69)
[2024-01-23 19:43:58,970] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 0.00 | backward: 1276.48 | backward_inner: 1253.54 | backward_allreduce: 22.67 | step: 70.64
[2024-01-23 19:44:00,829] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.62 | optimizer_gradients: 8.52 | optimizer_step: 16.69
[2024-01-23 19:44:00,830] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 490.72 | backward_microstep: 1270.46 | backward_inner_microstep: 1251.14 | backward_allreduce_microstep: 19.08 | step_microstep: 61.69
[2024-01-23 19:44:00,830] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 490.66 (forward_moe: 371.14, 1st alltoall: 1.32, 2nd alltoall: 1.35, top-k: 197.02)
[2024-01-23 19:44:00,831] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 0.00 | backward: 1270.46 | backward_inner: 1251.15 | backward_allreduce: 19.09 | step: 61.70
[2024-01-23 19:44:02,692] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.62 | optimizer_gradients: 8.48 | optimizer_step: 16.72
[2024-01-23 19:44:02,693] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 492.22 | backward_microstep: 1271.68 | backward_inner_microstep: 1249.68 | backward_allreduce_microstep: 21.75 | step_microstep: 61.85
[2024-01-23 19:44:02,693] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 492.19 (forward_moe: 371.21, 1st alltoall: 1.32, 2nd alltoall: 1.35, top-k: 196.70)
[2024-01-23 19:44:02,693] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 0.00 | backward: 1271.68 | backward_inner: 1249.70 | backward_allreduce: 21.75 | step: 61.85
[2024-01-23 19:44:04,554] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.62 | optimizer_gradients: 8.49 | optimizer_step: 16.70
[2024-01-23 19:44:04,554] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 493.41 | backward_microstep: 1270.02 | backward_inner_microstep: 1250.86 | backward_allreduce_microstep: 18.91 | step_microstep: 61.47
[2024-01-23 19:44:04,555] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 493.37 (forward_moe: 371.28, 1st alltoall: 1.32, 2nd alltoall: 1.35, top-k: 197.03)
[2024-01-23 19:44:04,555] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 0.00 | backward: 1270.03 | backward_inner: 1250.87 | backward_allreduce: 18.92 | step: 61.48
[2024-01-23 19:44:06,417] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.62 | optimizer_gradients: 8.51 | optimizer_step: 16.68
[2024-01-23 19:44:06,417] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 492.15 | backward_microstep: 1271.31 | backward_inner_microstep: 1250.80 | backward_allreduce_microstep: 20.28 | step_microstep: 62.41
[2024-01-23 19:44:06,418] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 492.12 (forward_moe: 371.15, 1st alltoall: 1.32, 2nd alltoall: 1.35, top-k: 196.96)
[2024-01-23 19:44:06,418] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 0.00 | backward: 1271.31 | backward_inner: 1250.81 | backward_allreduce: 20.29 | step: 62.42
q[2024-01-23 19:44:08,283] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.61 | optimizer_gradients: 8.50 | optimizer_step: 16.69
[2024-01-23 19:44:08,283] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=1, lr=[1.8562499999999997e-05, 1.8562499999999997e-05, 1.8562499999999997e-05, 1.8562499999999997e-05], mom=[[0.9, 0.95], [0.9, 0.95], [0.9, 0.95], [0.9, 0.95]]
[2024-01-23 19:44:08,284] [INFO] [timer.py:215:stop] epoch=0/micro_step=100/global_step=100, RunningAvgSamplesPerSec=1.7303383609839986, CurrSamplesPerSec=2.1486294003304676, MemAllocated=3.55GB, MaxMemAllocated=7.46GB
[2024-01-23 19:44:08,285] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 493.25 | backward_microstep: 1274.53 | backward_inner_microstep: 1252.74 | backward_allreduce_microstep: 21.50 | step_microstep: 62.48
[2024-01-23 19:44:08,285] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 493.22 (forward_moe: 371.95, 1st alltoall: 1.32, 2nd alltoall: 1.35, top-k: 197.59)
[2024-01-23 19:44:08,285] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 0.00 | backward: 1274.51 | backward_inner: 1252.76 | backward_allreduce: 21.50 | step: 62.52
samples/sec: 0.877 | iteration 100/ 320000 | elapsed time per iteration (ms): 4563.5 | learning rate: 1.856E-05 | approx flops per GPU: 2.2TFLOPS | lm_loss: 9.289667E+00 | loss scale: 65536.0 | number of skipped iterations: 1 | number of nan iterations: 0 |
Base logs
nexperts=1
[2024-01-23 19:47:38,254] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.53 | overflow_check: 8.96 | unscale_and_clip: 2.70 | basic_step: 10.29 | update_fp16: 2.49
[2024-01-23 19:47:38,958] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.56 | overflow_check: 8.93 | unscale_and_clip: 2.69 | basic_step: 10.31 | update_fp16: 2.49
[2024-01-23 19:47:39,666] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.58 | overflow_check: 11.94 | unscale_and_clip: 2.70 | basic_step: 10.31 | update_fp16: 2.48
[2024-01-23 19:47:40,374] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.56 | overflow_check: 9.63 | unscale_and_clip: 2.71 | basic_step: 10.29 | update_fp16: 2.49
[2024-01-23 19:47:41,080] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.55 | overflow_check: 8.82 | unscale_and_clip: 2.68 | basic_step: 10.30 | update_fp16: 2.49
[2024-01-23 19:47:41,782] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.55 | overflow_check: 8.83 | unscale_and_clip: 2.71 | basic_step: 10.31 | update_fp16: 2.49
[2024-01-23 19:47:42,485] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.54 | overflow_check: 8.87 | unscale_and_clip: 2.72 | basic_step: 10.30 | update_fp16: 2.49
[2024-01-23 19:47:43,186] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.54 | overflow_check: 8.89 | unscale_and_clip: 2.70 | basic_step: 10.31 | update_fp16: 2.49
[2024-01-23 19:47:43,891] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.54 | overflow_check: 9.01 | unscale_and_clip: 2.69 | basic_step: 10.30 | update_fp16: 2.48
[2024-01-23 19:47:43,891] [INFO] [logging.py:96:log_dist] [Rank 0] step=80, skipped=0, lr=[1.4999999999999999e-05, 1.4999999999999999e-05], mom=[[0.9, 0.95], [0.9, 0.95]]
[2024-01-23 19:47:43,893] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 17.56 | forward_microstep: 1397.88 | backward_microstep: 4818.85 | backward_inner_microstep: 4815.64 | backward_allreduce_microstep: 0.72 | step_microstep: 322.56
[2024-01-23 19:47:43,895] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 1397.43 | backward: 4818.82 | backward_inner: 4815.72 | backward_allreduce: 0.79 | step: 322.95
steps: 80 loss: 7.7299 iter time (s): 0.699 samples/sec: 5.721
[2024-01-23 19:47:43,897] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms)
[2024-01-23 19:47:44,603] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.56 | overflow_check: 9.57 | unscale_and_clip: 2.70 | basic_step: 10.30 | update_fp16: 2.49
[2024-01-23 19:47:45,313] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.60 | overflow_check: 9.62 | unscale_and_clip: 2.69 | basic_step: 10.31 | update_fp16: 2.49
[2024-01-23 19:47:46,020] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.58 | overflow_check: 9.14 | unscale_and_clip: 2.69 | basic_step: 10.29 | update_fp16: 2.49
[2024-01-23 19:47:46,721] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.54 | overflow_check: 8.90 | unscale_and_clip: 2.70 | basic_step: 10.31 | update_fp16: 2.49
[2024-01-23 19:47:47,427] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.66 | overflow_check: 10.95 | unscale_and_clip: 2.71 | basic_step: 10.30 | update_fp16: 2.49
[2024-01-23 19:47:48,129] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.54 | overflow_check: 8.98 | unscale_and_clip: 2.70 | basic_step: 10.30 | update_fp16: 2.48
[2024-01-23 19:47:48,831] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.54 | overflow_check: 8.85 | unscale_and_clip: 2.71 | basic_step: 10.31 | update_fp16: 2.49
[2024-01-23 19:47:49,540] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.60 | overflow_check: 9.61 | unscale_and_clip: 2.70 | basic_step: 10.31 | update_fp16: 2.48
[2024-01-23 19:47:50,249] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.58 | overflow_check: 9.59 | unscale_and_clip: 2.70 | basic_step: 10.29 | update_fp16: 2.49
[2024-01-23 19:47:50,967] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.55 | overflow_check: 9.10 | unscale_and_clip: 2.70 | basic_step: 10.30 | update_fp16: 2.49
[2024-01-23 19:47:50,968] [INFO] [logging.py:96:log_dist] [Rank 0] step=90, skipped=0, lr=[1.6874999999999997e-05, 1.6874999999999997e-05], mom=[[0.9, 0.95], [0.9, 0.95]]
[2024-01-23 19:47:50,970] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 23.14 | forward_microstep: 1399.06 | backward_microstep: 4830.62 | backward_inner_microstep: 4827.09 | backward_allreduce_microstep: 0.84 | step_microstep: 324.58
[2024-01-23 19:47:50,972] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 1398.64 | backward: 4830.65 | backward_inner: 4827.24 | backward_allreduce: 0.89 | step: 325.01
steps: 90 loss: 8.3211 iter time (s): 0.702 samples/sec: 5.700
[2024-01-23 19:47:50,974] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms)
[2024-01-23 19:47:57,543] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.57 | overflow_check: 8.96 | unscale_and_clip: 2.69 | basic_step: 10.30 | update_fp16: 2.49
[2024-01-23 19:47:59,874] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.53 | overflow_check: 8.93 | unscale_and_clip: 2.70 | basic_step: 10.30 | update_fp16: 2.48
[2024-01-23 19:48:02,275] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.55 | overflow_check: 8.93 | unscale_and_clip: 2.70 | basic_step: 10.30 | update_fp16: 2.49
[2024-01-23 19:48:10,979] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.66 | overflow_check: 9.76 | unscale_and_clip: 2.70 | basic_step: 10.29 | update_fp16: 2.48
[2024-01-23 19:48:32,687] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.65 | overflow_check: 9.86 | unscale_and_clip: 2.69 | basic_step: 10.29 | update_fp16: 2.49
[2024-01-23 19:48:47,036] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.57 | overflow_check: 9.70 | unscale_and_clip: 2.70 | basic_step: 10.30 | update_fp16: 2.48
[2024-01-23 19:48:50,073] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.56 | overflow_check: 8.90 | unscale_and_clip: 2.70 | basic_step: 10.28 | update_fp16: 2.48
[2024-01-23 19:48:52,675] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.57 | overflow_check: 9.25 | unscale_and_clip: 2.69 | basic_step: 10.30 | update_fp16: 2.49
[2024-01-23 19:48:55,803] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.52 | overflow_check: 8.83 | unscale_and_clip: 2.68 | basic_step: 10.29 | update_fp16: 2.49
[2024-01-23 19:48:56,509] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.72 | overflow_check: 9.53 | unscale_and_clip: 2.70 | basic_step: 10.29 | update_fp16: 2.49
[2024-01-23 19:48:56,510] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=0, lr=[1.875e-05, 1.875e-05], mom=[[0.9, 0.95], [0.9, 0.95]]
[2024-01-23 19:48:56,512] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 27.80 | forward_microstep: 1412.57 | backward_microstep: 4820.62 | backward_inner_microstep: 4816.06 | backward_allreduce_microstep: 1.06 | step_microstep: 323.80
[2024-01-23 19:48:56,514] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 1411.98 | backward: 4820.60 | backward_inner: 4816.21 | backward_allreduce: 1.08 | step: 324.33
steps: 100 loss: 7.9945 iter time (s): 0.703 samples/sec: 5.692
[2024-01-23 19:48:56,515] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms)
samples/sec: 2.822 | iteration 100/ 320000 | elapsed time per iteration (ms): 1417.6 | learning rate: 1.875E-05 | approx flops per GPU: 7.0TFLOPS | lm_loss: 9.210952E+00 | loss scale: 65536.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
MTDS
# moe, num-experts 1
iteration 1/ 1716613 | consumed samples: 256 | consumed tokens: 524288 | elapsed time per iteration (ms): 34374.4 | learning rate: 0.000E+00 | global batch size: 256 | lm loss: 1.088943E+01 | loss scale: 2048.0 | grad norm: 11.603 | num zeros: 0.0 | actual seqlen: 2048 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 7.447 | tokens per gpu per second (tgs): 15252.280 | TFLOPs: 18.51 |
[2024-01-23 21:49:23,182] [INFO] [logging.py:96:log_dist] [Rank 0] step=2, skipped=0, lr=[6.291456e-07, 6.291456e-07], mom=[(0.9, 0.95), (0.9, 0.95)]
steps: 2 loss: 10.8885 iter time (s): 33.112 samples/sec: 7.731
iteration 2/ 1716613 | consumed samples: 512 | consumed tokens: 1048576 | elapsed time per iteration (ms): 33130.1 | learning rate: 6.291E-07 | global batch size: 256 | lm loss: 1.088852E+01 | loss scale: 2048.0 | grad norm: 12.009 | num zeros: 0.0 | actual seqlen: 2048 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 7.727 | tokens per gpu per second (tgs): 15825.136 | TFLOPs: 19.20 |
[Rank 0] (after 2 iterations) memory (MB) | allocated: 5231.03076171875 | max allocated: 8000.60400390625 | reserved: 9550.0 | max reserved: 9550.0
[2024-01-23 21:49:56,220] [INFO] [logging.py:96:log_dist] [Rank 0] step=3, skipped=0, lr=[1.2582912e-06, 1.2582912e-06], mom=[(0.9, 0.95), (0.9, 0.95)]
steps: 3 loss: 10.8943 iter time (s): 33.011 samples/sec: 7.755
iteration 3/ 1716613 | consumed samples: 768 | consumed tokens: 1572864 | elapsed time per iteration (ms): 33037.9 | learning rate: 1.258E-06 | global batch size: 256 | lm loss: 1.089434E+01 | loss scale: 2048.0 | grad norm: 11.706 | num zeros: 0.0 | actual seqlen: 2048 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 7.749 | tokens per gpu per second (tgs): 15869.295 | TFLOPs: 19.25 |
# moe, num-experts 2
iteration 1/ 1716613 | consumed samples: 256 | consumed tokens: 524288 | elapsed time per iteration (ms): 109878.0 | learning rate: 0.000E+00 | global batch size: 256 | lm loss: 1.087590E+01 | moe loss: 6.700792E-02 | loss scale: 2048.0 | actual seqlen: 2048 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 2.330 | tokens per gpu per second (tgs): 4771.546 | TFLOPs: 5.79 |
iteration 2/ 1716613 | consumed samples: 512 | consumed tokens: 1048576 | elapsed time per iteration (ms): 43840.7 | learning rate: 6.291E-07 | global batch size: 256 | lm loss: 1.087427E+01 | moe loss: 6.697765E-02 | loss scale: 2048.0 | actual seqlen: 2048 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 5.839 | tokens per gpu per second (tgs): 11958.931 | TFLOPs: 14.51 |
[Rank 0] (after 2 iterations) memory (MB) | allocated: 2596.92578125 | max allocated: 6963.3564453125 | reserved: 9186.0 | max reserved: 9186.0
# moe, num-experts 128
without vs with moe layers:
DeepSpeedEngine(
(module): SequentialWrapper(
(sequential): Sequential(
(0): EmbeddingPipe(
(word_embeddings): VocabParallelEmbedding()
(embedding_dropout): Dropout(p=0.0, inplace=False)
)
(1): Lambda()
(2): ParallelTransformerLayerPipe(
(input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attention): ParallelSelfAttention(
(query_key_value): ColumnParallelLinear()
(rotary_emb): RotaryEmbedding()
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
(dense): RowParallelLinear()
)
(post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
)
(3): ParallelTransformerLayerPipe(
(input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attention): ParallelSelfAttention(
(query_key_value): ColumnParallelLinear()
(rotary_emb): RotaryEmbedding()
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
(dense): RowParallelLinear()
)
(post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
)
(4): ParallelTransformerLayerPipe(
(input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attention): ParallelSelfAttention(
(query_key_value): ColumnParallelLinear()
(rotary_emb): RotaryEmbedding()
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
(dense): RowParallelLinear()
)
(post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
)
(5): ParallelTransformerLayerPipe(
(input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attention): ParallelSelfAttention(
(query_key_value): ColumnParallelLinear()
(rotary_emb): RotaryEmbedding()
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
(dense): RowParallelLinear()
)
(post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
)
(6): ParallelTransformerLayerPipe(
(input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attention): ParallelSelfAttention(
(query_key_value): ColumnParallelLinear()
(rotary_emb): RotaryEmbedding()
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
(dense): RowParallelLinear()
)
(post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
)
(7): ParallelTransformerLayerPipe(
(input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attention): ParallelSelfAttention(
(query_key_value): ColumnParallelLinear()
(rotary_emb): RotaryEmbedding()
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
(dense): RowParallelLinear()
)
(post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
)
(8): ParallelTransformerLayerPipe(
(input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attention): ParallelSelfAttention(
(query_key_value): ColumnParallelLinear()
(rotary_emb): RotaryEmbedding()
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
(dense): RowParallelLinear()
)
(post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
)
(9): ParallelTransformerLayerPipe(
(input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attention): ParallelSelfAttention(
(query_key_value): ColumnParallelLinear()
(rotary_emb): RotaryEmbedding()
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
(dense): RowParallelLinear()
)
(post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
)
(10): ParallelTransformerLayerPipe(
(input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attention): ParallelSelfAttention(
(query_key_value): ColumnParallelLinear()
(rotary_emb): RotaryEmbedding()
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
(dense): RowParallelLinear()
)
(post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
)
(11): ParallelTransformerLayerPipe(
(input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attention): ParallelSelfAttention(
(query_key_value): ColumnParallelLinear()
(rotary_emb): RotaryEmbedding()
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
(dense): RowParallelLinear()
)
(post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
)
(12): ParallelTransformerLayerPipe(
(input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attention): ParallelSelfAttention(
(query_key_value): ColumnParallelLinear()
(rotary_emb): RotaryEmbedding()
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
(dense): RowParallelLinear()
)
(post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
)
(13): ParallelTransformerLayerPipe(
(input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attention): ParallelSelfAttention(
(query_key_value): ColumnParallelLinear()
(rotary_emb): RotaryEmbedding()
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
(dense): RowParallelLinear()
)
(post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
)
(14): Lambda()
(15): NormPipe(
(norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(16): ParallelLinearPipe(
(final_linear): ColumnParallelLinear()
)
)
)
)
DeepSpeedEngine(
(module): SequentialWrapper(
(sequential): Sequential(
(0): EmbeddingPipe(
(word_embeddings): VocabParallelEmbedding()
(embedding_dropout): Dropout(p=0.0, inplace=False)
)
(1): Lambda()
(2): ParallelTransformerLayerPipe(
(input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attention): ParallelSelfAttention(
(query_key_value): ColumnParallelLinear()
(rotary_emb): RotaryEmbedding()
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
(dense): RowParallelLinear()
)
(post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): MoE(
(deepspeed_moe): MOELayer(
(gate): TopKGate(
(wg): Linear(in_features=768, out_features=2, bias=False)
)
(experts): Experts(
(deepspeed_experts): ModuleList(
(0): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
(1): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
)
)
)
)
)
(3): ParallelTransformerLayerPipe(
(input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attention): ParallelSelfAttention(
(query_key_value): ColumnParallelLinear()
(rotary_emb): RotaryEmbedding()
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
(dense): RowParallelLinear()
)
(post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): MoE(
(deepspeed_moe): MOELayer(
(gate): TopKGate(
(wg): Linear(in_features=768, out_features=2, bias=False)
)
(experts): Experts(
(deepspeed_experts): ModuleList(
(0): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
(1): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
)
)
)
)
)
(4): ParallelTransformerLayerPipe(
(input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attention): ParallelSelfAttention(
(query_key_value): ColumnParallelLinear()
(rotary_emb): RotaryEmbedding()
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
(dense): RowParallelLinear()
)
(post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): MoE(
(deepspeed_moe): MOELayer(
(gate): TopKGate(
(wg): Linear(in_features=768, out_features=2, bias=False)
)
(experts): Experts(
(deepspeed_experts): ModuleList(
(0): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
(1): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
)
)
)
)
)
(5): ParallelTransformerLayerPipe(
(input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attention): ParallelSelfAttention(
(query_key_value): ColumnParallelLinear()
(rotary_emb): RotaryEmbedding()
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
(dense): RowParallelLinear()
)
(post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): MoE(
(deepspeed_moe): MOELayer(
(gate): TopKGate(
(wg): Linear(in_features=768, out_features=2, bias=False)
)
(experts): Experts(
(deepspeed_experts): ModuleList(
(0): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
(1): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
)
)
)
)
)
(6): ParallelTransformerLayerPipe(
(input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attention): ParallelSelfAttention(
(query_key_value): ColumnParallelLinear()
(rotary_emb): RotaryEmbedding()
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
(dense): RowParallelLinear()
)
(post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): MoE(
(deepspeed_moe): MOELayer(
(gate): TopKGate(
(wg): Linear(in_features=768, out_features=2, bias=False)
)
(experts): Experts(
(deepspeed_experts): ModuleList(
(0): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
(1): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
)
)
)
)
)
(7): ParallelTransformerLayerPipe(
(input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attention): ParallelSelfAttention(
(query_key_value): ColumnParallelLinear()
(rotary_emb): RotaryEmbedding()
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
(dense): RowParallelLinear()
)
(post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): MoE(
(deepspeed_moe): MOELayer(
(gate): TopKGate(
(wg): Linear(in_features=768, out_features=2, bias=False)
)
(experts): Experts(
(deepspeed_experts): ModuleList(
(0): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
(1): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
)
)
)
)
)
(8): ParallelTransformerLayerPipe(
(input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attention): ParallelSelfAttention(
(query_key_value): ColumnParallelLinear()
(rotary_emb): RotaryEmbedding()
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
(dense): RowParallelLinear()
)
(post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): MoE(
(deepspeed_moe): MOELayer(
(gate): TopKGate(
(wg): Linear(in_features=768, out_features=2, bias=False)
)
(experts): Experts(
(deepspeed_experts): ModuleList(
(0): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
(1): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
)
)
)
)
)
(9): ParallelTransformerLayerPipe(
(input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attention): ParallelSelfAttention(
(query_key_value): ColumnParallelLinear()
(rotary_emb): RotaryEmbedding()
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
(dense): RowParallelLinear()
)
(post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): MoE(
(deepspeed_moe): MOELayer(
(gate): TopKGate(
(wg): Linear(in_features=768, out_features=2, bias=False)
)
(experts): Experts(
(deepspeed_experts): ModuleList(
(0): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
(1): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
)
)
)
)
)
(10): ParallelTransformerLayerPipe(
(input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attention): ParallelSelfAttention(
(query_key_value): ColumnParallelLinear()
(rotary_emb): RotaryEmbedding()
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
(dense): RowParallelLinear()
)
(post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): MoE(
(deepspeed_moe): MOELayer(
(gate): TopKGate(
(wg): Linear(in_features=768, out_features=2, bias=False)
)
(experts): Experts(
(deepspeed_experts): ModuleList(
(0): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
(1): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
)
)
)
)
)
(11): ParallelTransformerLayerPipe(
(input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attention): ParallelSelfAttention(
(query_key_value): ColumnParallelLinear()
(rotary_emb): RotaryEmbedding()
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
(dense): RowParallelLinear()
)
(post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): MoE(
(deepspeed_moe): MOELayer(
(gate): TopKGate(
(wg): Linear(in_features=768, out_features=2, bias=False)
)
(experts): Experts(
(deepspeed_experts): ModuleList(
(0): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
(1): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
)
)
)
)
)
(12): ParallelTransformerLayerPipe(
(input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attention): ParallelSelfAttention(
(query_key_value): ColumnParallelLinear()
(rotary_emb): RotaryEmbedding()
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
(dense): RowParallelLinear()
)
(post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): MoE(
(deepspeed_moe): MOELayer(
(gate): TopKGate(
(wg): Linear(in_features=768, out_features=2, bias=False)
)
(experts): Experts(
(deepspeed_experts): ModuleList(
(0): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
(1): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
)
)
)
)
)
(13): ParallelTransformerLayerPipe(
(input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attention): ParallelSelfAttention(
(query_key_value): ColumnParallelLinear()
(rotary_emb): RotaryEmbedding()
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
(dense): RowParallelLinear()
)
(post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): MoE(
(deepspeed_moe): MOELayer(
(gate): TopKGate(
(wg): Linear(in_features=768, out_features=2, bias=False)
)
(experts): Experts(
(deepspeed_experts): ModuleList(
(0): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
(1): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
)
)
)
)
)
(14): Lambda()
(15): NormPipe(
(norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(16): ParallelLinearPipe(
(final_linear): ColumnParallelLinear()
)
)
)
)
2024-01-22
gpt-neox moe
'PipelineParallelGrid' object has no attribute 'get_tensor_model_parallel_world_size'
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/moe/mappings.py", line 103, in drop_tokens
if mpu is None or mpu.get_tensor_model_parallel_world_size() == 1:
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/moe/sharded_moe.py", line 507, in forward
dispatched_input = drop_tokens(dispatched_input, dim=1)
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/moe/layer.py", line 115, in forward
output = self.deepspeed_moe(hidden_states, used_token)
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/gpt-neox/megatron/model/transformer.py", line 983, in forward
mlp_output, moe_loss, _ = self.mlp(layernorm_output)
File "/home/ubuntu/gpt-neox/megatron/model/transformer.py", line 1010, in forward
return super().forward(hidden_states, attention_mask)[0], attention_mask
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 327, in exec_func
inputs = layer(inputs)
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 555, in forward
outputs = run_function(*inputs_cuda)
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 713, in checkpoint
CheckpointFunction.apply(function, all_outputs, *args)
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 348, in forward
x = self.activation_checkpoint_func(exec_range_func(start_idx, end_idx), *x)
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1731, in forward
loss = self.module(*inputs, **kwargs)
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 627, in _exec_forward_pass
outputs = super().forward(inputs)
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1307, in _exec_schedule
self._exec_instr(**cmd.kwargs)
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 336, in train_batch
self._exec_schedule(sched)
File "/home/ubuntu/gpt-neox/megatron/training.py", line 787, in train_step_pipe
loss = model.train_batch(data_iter=data_iterator)
File "/home/ubuntu/gpt-neox/megatron/training.py", line 736, in train_step
reduced_loss = train_step_pipe(
File "/home/ubuntu/gpt-neox/megatron/training.py", line 831, in train
loss_dict, skipped_iter = train_step(
File "/home/ubuntu/gpt-neox/megatron/training.py", line 228, in pretrain
iteration = train(
File "/home/ubuntu/gpt-neox/train.py", line 78, in main
pretrain(neox_args=neox_args)
File "/home/ubuntu/gpt-neox/train.py", line 82, in <module> (Current frame)
main()
AttributeError: 'PipelineParallelGrid' object has no attribute 'get_tensor_model_parallel_world_size'
Minor
2024-01-18
TinyLllama pretrain
Had to download tokenizer separately following lit-gpt instructions for obtaining checkpoints. Fetched the TinyLlama intermediate checkpoint.
Never finished building flash-attn fused layer norms:
ubuntu@ip-172-31-13-251:~/TinyLlama/flash-attention/csrc/layer_norm$ MAX_JOBS=4 pip install .
Processing /home/ubuntu/TinyLlama/flash-attention/csrc/layer_norm
Preparing metadata (setup.py) ... done
Building wheels for collected packages: dropout-layer-norm
Building wheel for dropout-layer-norm (setup.py) ... -
Megatron-Deepspeed
It runs, both normal and MOE, I think after fixing DATA_PATH
Inference works
ubuntu@ip-172-31-13-251:~/Megatron-DeepSpeed$ python tools/text_generation_cli.py localhost:5000
Enter prompt: Hello my name is
Enter number of tokens to generate: 32
Megatron Response:
Hello my name is perennlington<|endoftext|>
Enter prompt: Traceback (most recent call last):
File "tools/text_generation_cli.py", line 13, in <module>
sentence = input("Enter prompt: ")
EOFError
gpt-neox
Training MoE fails with
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 1.1272997856140137 seconds
KeyError('moe_params_with_weight_decay')
AttributeError("'Tee' object has no attribute 'isatty'")
Traceback (most recent call last):
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/ipdb/__main__.py", line 232, in launch_ipdb_on_exception
yield
File "/home/ubuntu/gpt-neox/megatron/training.py", line 194, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(
File "/home/ubuntu/gpt-neox/megatron/training.py", line 652, in setup_model_and_optimizer
model, optimizer, _, lr_scheduler = deepspeed.initialize(
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/__init__.py", line 180, in initialize
engine = PipelineEngine(args=args,
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 55, in __init__
super().__init__(*super_args, **super_kwargs)
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 309, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1180, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1415, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 197, in __init__
self._configure_moe_settings()
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 556, in _configure_moe_settings
self.real_dp_process_group[i] = self.expert_dp_process_group[group['name']]
KeyError: 'moe_params_with_weight_decay'
Inference fails, throws an error using tokenizer on the token IDs, about a None when expecting a str
Generating samples unconditionally and saving results to sample_output.txt
generate_samples_unconditional() generating...
TypeError('sequence item 67: expected str instance, NoneType found')
Traceback (most recent call last):
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/ipdb/__main__.py", line 232, in launch_ipdb_on_exception
yield
File "generate.py", line 46, in main
generate_samples_unconditional(
File "/home/ubuntu/gpt-neox/megatron/text_generation_utils.py", line 680, in generate_samples_unconditional
generated_texts = generate_samples_from_prompt(
File "/home/ubuntu/gpt-neox/megatron/text_generation_utils.py", line 519, in generate_samples_from_prompt
generated_text = neox_args.tokenizer.detokenize(generated_tokens)
File "/home/ubuntu/gpt-neox/megatron/tokenizer/tokenizer.py", line 177, in detokenize
return self.tokenizer.decode(token_ids)
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3750, in decode
return self._decode(
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 1024, in _decode
sub_texts.append(self.convert_tokens_to_string(current_sub_text))
File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/transformers/models/gpt2/tokenization_gpt2.py", line 322, in convert_tokens_
to_string
text = "".join(tokens)
TypeError: sequence item 67: expected str instance, NoneType found