• 2024-03-21

    Screenshot 2024-03-21 at 4.59.36 PM.png

  • 2024-02-29

    • megablocks
      • next steps:
        • verify that currently all routers & experts are getting synced in DP group (not desirable)
        • verify that marking experts as moe_layers prevents syncing (and doesn’t break anything else in DSE), but that router is still synced
  • 2024-02-26

    • megablocks
      • notes

        • it introduces TP after exhausting EP (source)
        • moe_weight_parallelism is for FSDP…but how does this get fed? (source)
      • loss calculated on non final pipeline layers?

      • it is using data parallel group for expert parallelism?

        • A: seems like just a hack for simpler adapting of megatron. They do support FSDP separately too, perhaps for non-megatron?
      • how should router be replicated across within TP group?

        • how do distributed collectives work in the first place? (are all model params automagically identically initialized and synced within data parallel group?)
      • how are losses communicated?

      • is this row slicing the TP? what’s hidden_sharding_degree ?

        if not args.moe_expert_model_parallelism:
                return master_weights
        
            # Calculate the amount of sharding in each dimension.
            expert_sharding_degree = mpu.expert_sharding_degree(args)
            hidden_sharding_degree = mpu.hidden_sharding_degree(args)
        
            # Calculate the experts per rank.
            #
            # NOTE: We assign ranks to be expert parallel before going
            # tensor parallel.
            rank = mpu.get_expert_parallel_rank(args)
            expert_rank = rank % expert_sharding_degree
            num_experts_per_rank = num_experts // expert_sharding_degree
            start_expert = expert_rank * num_experts_per_rank
            end_expert = (expert_rank + 1) * num_experts_per_rank
        
            # Calculate the rows per rank.
            row_rank = rank // expert_sharding_degree
            num_rows_per_rank = ffn_hidden_size // hidden_sharding_degree
            start_row = row_rank * num_rows_per_rank
            end_row = (row_rank + 1) * num_rows_per_rank
        
            # Slice the weight matrix to get the chunk for this rank.
            with torch.no_grad():
                weights = master_weights[
                    start_expert:end_expert, start_row:end_row]
            return weights
        
        • Seems so, this is the PR for TP: https://github.com/stanford-futuredata/megablocks/pull/15/files
      • Understand this (source)

        Our Megatron fork is mostly for small-scale experiments and uses the data parallel process group for expert model parallelism. If you scale out to multiple nodes with data parallelism and expert parallelism enabled you'll do expert parallelism across those nodes, which can be slow because the all2alls become a bit expensive.

        One thing you could try is using pipeline parallelism between nodes. If you were to use MegaBlocks in a custom framework, I'd recommend using something like FSDP across nodes and expert parallelism within each node.

        I do not have reference scripts for multi-node training, but for pipeline parallelism the flags are the same as they are in upstream Megatron-LM. I hope this helps!

  • 2024-01-23

    • gpt-neox inference with

      • gpt-neox inference: couldn’t figure out how to load pretrained models from hf.

        For s3 checkpointing, please install hf_transfer either using requirements/requirements-s3.txt or <https://github.com/huggingface/hf_transfer>
        2024-01-24:01:24:33,675 INFO     [utils.py:160] NumExpr defaulting to 4 threads.
        2024-01-24:01:24:33,930 INFO     [config.py:58] PyTorch version 1.13.0+cu117 available.
        ======================================================================
        Warning the following script will delete files within checkpoints/neox_converted/pythia/70m
        Warning the following script will delete this directory /tmp/ckpt_tmp_dir
        ======================================================================
        NeoXArgs.from_ymls() ['configs/pythia/70M.yml', 'configs/local_setup.yml']
        2024-01-24:01:24:35,262 INFO     [arguments.py:849] NeoXArgs.calculate_derived() Total number of GPUs determined to be: 1
        NeoXArgs.configure_distributed_args() using world size: 1 and model-parallel size: 1
        > building HFTokenizer tokenizer ...
        Traceback (most recent call last):
          File "tools/ckpts/convert_hf_to_sequential.py", line 505, in <module>
            neox_args.build_tokenizer()
          File "/home/ubuntu/gpt-neox/megatron/neox_arguments/arguments.py", line 147, in build_tokenizer
            self.tokenizer = build_tokenizer(self)
          File "/home/ubuntu/gpt-neox/megatron/tokenizer/tokenizer.py", line 45, in build_tokenizer
            tokenizer = HFTokenizer(args.vocab_file)
          File "/home/ubuntu/gpt-neox/megatron/tokenizer/tokenizer.py", line 229, in __init__
            self.tokenizer = Tokenizer.from_file(vocab_file)
        Exception: expected `,` or `}` at line 1 column 5
        
      • MTDS inference with 345m

        request IP: 127.0.0.1
        {"prompts": ["Hello my name is"], "tokens_to_generate": 32}
        start time:  2024-01-23 23:58:31.383247
        127.0.0.1 - - [23/Jan/2024 23:58:32] "PUT /api HTTP/1.1" 200 -
        request IP: 127.0.0.1
        {"prompts": ["Hello my name is"], "tokens_to_generate": 32}
        start time:  2024-01-23 23:59:06.512734
        127.0.0.1 - - [23/Jan/2024 23:59:07] "PUT /api HTTP/1.1" 200 -
        request IP: 127.0.0.1
        {"prompts": ["Hello my name is"], "tokens_to_generate": 32}
        start time:  2024-01-23 23:59:07.587983
        127.0.0.1 - - [23/Jan/2024 23:59:08] "PUT /api HTTP/1.1" 200 -
        request IP: 127.0.0.1
        {"prompts": ["Hello my name is"], "tokens_to_generate": 32}
        start time:  2024-01-23 23:59:08.611759
        127.0.0.1 - - [23/Jan/2024 23:59:09] "PUT /api HTTP/1.1" 200 -
        request IP: 127.0.0.1
        {"prompts": ["Hello my name is"], "tokens_to_generate": 32}
        start time:  2024-01-23 23:59:09.683290
        127.0.0.1 - - [23/Jan/2024 23:59:10] "PUT /api HTTP/1.1" 200 -
        request IP: 127.0.0.1
        {"prompts": ["Hello my name is"], "tokens_to_generate": 32}
        start time:  2024-01-23 23:59:10.690690
        127.0.0.1 - - [23/Jan/2024 23:59:11] "PUT /api HTTP/1.1" 200 -
        request IP: 127.0.0.1
        {"prompts": ["Hello my name is"], "tokens_to_generate": 32}
        start time:  2024-01-23 23:59:11.728337
        127.0.0.1 - - [23/Jan/2024 23:59:12] "PUT /api HTTP/1.1" 200 -
        request IP: 127.0.0.1
        {"prompts": ["Hello my name is"], "tokens_to_generate": 32}
        start time:  2024-01-23 23:59:12.765529
        127.0.0.1 - - [23/Jan/2024 23:59:13] "PUT /api HTTP/1.1" 200 -
        request IP: 127.0.0.1
        {"prompts": ["Hello my name is"], "tokens_to_generate": 32}
        start time:  2024-01-23 23:59:13.819065
        127.0.0.1 - - [23/Jan/2024 23:59:14] "PUT /api HTTP/1.1" 200 -
        request IP: 127.0.0.1
        {"prompts": ["Hello my name is"], "tokens_to_generate": 32}
        start time:  2024-01-23 23:59:14.842124
        127.0.0.1 - - [23/Jan/2024 23:59:15] "PUT /api HTTP/1.1" 200 -
        request IP: 127.0.0.1
        {"prompts": ["Hello my name is"], "tokens_to_generate": 32}
        start time:  2024-01-23 23:59:15.872852
        127.0.0.1 - - [23/Jan/2024 23:59:16] "PUT /api HTTP/1.1" 200 -
        
    • gpt-neox moe

      • Figuring out the parallelism config
      • Logs look diff from base. Same speed with nexperts=1, much slower with nexperts=2.
      nexperts=2...
      [2024-01-23 19:43:51,492] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 492.27 | backward_microstep: 1273.92 | backward_inner_microstep: 1252.38 | backward_allreduce_microstep: 21.28 | step_microstep: 61.84
      [2024-01-23 19:43:51,492] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 492.24 (forward_moe: 371.40, 1st alltoall: 1.32, 2nd alltoall: 1.35, top-k: 197.07)
      [2024-01-23 19:43:51,492] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 0.00 | backward: 1273.92 | backward_inner: 1252.40 | backward_allreduce: 21.28 | step: 61.85
      [2024-01-23 19:43:53,357] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.62 | optimizer_gradients: 8.49 | optimizer_step: 16.70
      [2024-01-23 19:43:53,357] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 490.92 | backward_microstep: 1275.11 | backward_inner_microstep: 1254.66 | backward_allreduce_microstep: 20.21 | step_microstep: 61.67
      [2024-01-23 19:43:53,358] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 490.89 (forward_moe: 373.46, 1st alltoall: 1.32, 2nd alltoall: 1.35, top-k: 198.63)
      [2024-01-23 19:43:53,358] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 0.00 | backward: 1275.11 | backward_inner: 1254.67 | backward_allreduce: 20.22 | step: 61.67
      [2024-01-23 19:43:55,226] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.62 | optimizer_gradients: 8.53 | optimizer_step: 16.71
      [2024-01-23 19:43:55,227] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 495.61 | backward_microstep: 1275.26 | backward_inner_microstep: 1252.79 | backward_allreduce_microstep: 22.22 | step_microstep: 61.68
      [2024-01-23 19:43:55,227] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 495.58 (forward_moe: 372.04, 1st alltoall: 1.32, 2nd alltoall: 1.35, top-k: 197.63)
      [2024-01-23 19:43:55,228] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 0.00 | backward: 1275.26 | backward_inner: 1252.81 | backward_allreduce: 22.22 | step: 61.69
      [2024-01-23 19:43:57,092] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.62 | optimizer_gradients: 8.51 | optimizer_step: 16.70
      [2024-01-23 19:43:57,093] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 491.13 | backward_microstep: 1273.29 | backward_inner_microstep: 1250.94 | backward_allreduce_microstep: 22.10 | step_microstep: 63.19
      [2024-01-23 19:43:57,093] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 491.09 (forward_moe: 371.54, 1st alltoall: 1.32, 2nd alltoall: 1.35, top-k: 197.05)
      [2024-01-23 19:43:57,093] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 0.00 | backward: 1273.29 | backward_inner: 1250.96 | backward_allreduce: 22.10 | step: 63.22
      [2024-01-23 19:43:58,969] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.62 | optimizer_gradients: 8.65 | optimizer_step: 16.69
      [2024-01-23 19:43:58,970] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 492.43 | backward_microstep: 1276.46 | backward_inner_microstep: 1253.49 | backward_allreduce_microstep: 22.63 | step_microstep: 70.64
      [2024-01-23 19:43:58,970] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 492.40 (forward_moe: 372.34, 1st alltoall: 1.32, 2nd alltoall: 1.35, top-k: 197.69)
      [2024-01-23 19:43:58,970] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 0.00 | backward: 1276.48 | backward_inner: 1253.54 | backward_allreduce: 22.67 | step: 70.64
      [2024-01-23 19:44:00,829] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.62 | optimizer_gradients: 8.52 | optimizer_step: 16.69
      [2024-01-23 19:44:00,830] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 490.72 | backward_microstep: 1270.46 | backward_inner_microstep: 1251.14 | backward_allreduce_microstep: 19.08 | step_microstep: 61.69
      [2024-01-23 19:44:00,830] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 490.66 (forward_moe: 371.14, 1st alltoall: 1.32, 2nd alltoall: 1.35, top-k: 197.02)
      [2024-01-23 19:44:00,831] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 0.00 | backward: 1270.46 | backward_inner: 1251.15 | backward_allreduce: 19.09 | step: 61.70
      [2024-01-23 19:44:02,692] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.62 | optimizer_gradients: 8.48 | optimizer_step: 16.72
      [2024-01-23 19:44:02,693] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 492.22 | backward_microstep: 1271.68 | backward_inner_microstep: 1249.68 | backward_allreduce_microstep: 21.75 | step_microstep: 61.85
      [2024-01-23 19:44:02,693] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 492.19 (forward_moe: 371.21, 1st alltoall: 1.32, 2nd alltoall: 1.35, top-k: 196.70)
      [2024-01-23 19:44:02,693] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 0.00 | backward: 1271.68 | backward_inner: 1249.70 | backward_allreduce: 21.75 | step: 61.85
      [2024-01-23 19:44:04,554] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.62 | optimizer_gradients: 8.49 | optimizer_step: 16.70
      [2024-01-23 19:44:04,554] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 493.41 | backward_microstep: 1270.02 | backward_inner_microstep: 1250.86 | backward_allreduce_microstep: 18.91 | step_microstep: 61.47
      [2024-01-23 19:44:04,555] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 493.37 (forward_moe: 371.28, 1st alltoall: 1.32, 2nd alltoall: 1.35, top-k: 197.03)
      [2024-01-23 19:44:04,555] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 0.00 | backward: 1270.03 | backward_inner: 1250.87 | backward_allreduce: 18.92 | step: 61.48
      [2024-01-23 19:44:06,417] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.62 | optimizer_gradients: 8.51 | optimizer_step: 16.68
      [2024-01-23 19:44:06,417] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 492.15 | backward_microstep: 1271.31 | backward_inner_microstep: 1250.80 | backward_allreduce_microstep: 20.28 | step_microstep: 62.41
      [2024-01-23 19:44:06,418] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 492.12 (forward_moe: 371.15, 1st alltoall: 1.32, 2nd alltoall: 1.35, top-k: 196.96)
      [2024-01-23 19:44:06,418] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 0.00 | backward: 1271.31 | backward_inner: 1250.81 | backward_allreduce: 20.29 | step: 62.42
      q[2024-01-23 19:44:08,283] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.61 | optimizer_gradients: 8.50 | optimizer_step: 16.69
      [2024-01-23 19:44:08,283] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=1, lr=[1.8562499999999997e-05, 1.8562499999999997e-05, 1.8562499999999997e-05, 1.8562499999999997e-05], mom=[[0.9, 0.95], [0.9, 0.95], [0.9, 0.95], [0.9, 0.95]]
      [2024-01-23 19:44:08,284] [INFO] [timer.py:215:stop] epoch=0/micro_step=100/global_step=100, RunningAvgSamplesPerSec=1.7303383609839986, CurrSamplesPerSec=2.1486294003304676, MemAllocated=3.55GB, MaxMemAllocated=7.46GB
      [2024-01-23 19:44:08,285] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 493.25 | backward_microstep: 1274.53 | backward_inner_microstep: 1252.74 | backward_allreduce_microstep: 21.50 | step_microstep: 62.48
      [2024-01-23 19:44:08,285] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 493.22 (forward_moe: 371.95, 1st alltoall: 1.32, 2nd alltoall: 1.35, top-k: 197.59)
      [2024-01-23 19:44:08,285] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 0.00 | backward: 1274.51 | backward_inner: 1252.76 | backward_allreduce: 21.50 | step: 62.52
       samples/sec: 0.877 | iteration      100/  320000 | elapsed time per iteration (ms): 4563.5 | learning rate: 1.856E-05 | approx flops per GPU: 2.2TFLOPS | lm_loss: 9.289667E+00 | loss scale: 65536.0 | number of skipped iterations:   1 | number of nan iterations:   0 |
      
      • Base logs

        nexperts=1
        [2024-01-23 19:47:38,254] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.53 | overflow_check: 8.96 | unscale_and_clip: 2.70 | basic_step: 10.29 | update_fp16: 2.49
        [2024-01-23 19:47:38,958] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.56 | overflow_check: 8.93 | unscale_and_clip: 2.69 | basic_step: 10.31 | update_fp16: 2.49
        [2024-01-23 19:47:39,666] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.58 | overflow_check: 11.94 | unscale_and_clip: 2.70 | basic_step: 10.31 | update_fp16: 2.48
        [2024-01-23 19:47:40,374] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.56 | overflow_check: 9.63 | unscale_and_clip: 2.71 | basic_step: 10.29 | update_fp16: 2.49
        [2024-01-23 19:47:41,080] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.55 | overflow_check: 8.82 | unscale_and_clip: 2.68 | basic_step: 10.30 | update_fp16: 2.49
        [2024-01-23 19:47:41,782] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.55 | overflow_check: 8.83 | unscale_and_clip: 2.71 | basic_step: 10.31 | update_fp16: 2.49
        [2024-01-23 19:47:42,485] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.54 | overflow_check: 8.87 | unscale_and_clip: 2.72 | basic_step: 10.30 | update_fp16: 2.49
        [2024-01-23 19:47:43,186] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.54 | overflow_check: 8.89 | unscale_and_clip: 2.70 | basic_step: 10.31 | update_fp16: 2.49
        [2024-01-23 19:47:43,891] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.54 | overflow_check: 9.01 | unscale_and_clip: 2.69 | basic_step: 10.30 | update_fp16: 2.48
        [2024-01-23 19:47:43,891] [INFO] [logging.py:96:log_dist] [Rank 0] step=80, skipped=0, lr=[1.4999999999999999e-05, 1.4999999999999999e-05], mom=[[0.9, 0.95], [0.9, 0.95]]
        [2024-01-23 19:47:43,893] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 17.56 | forward_microstep: 1397.88 | backward_microstep: 4818.85 | backward_inner_microstep: 4815.64 | backward_allreduce_microstep: 0.72 | step_microstep: 322.56
        [2024-01-23 19:47:43,895] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 1397.43 | backward: 4818.82 | backward_inner: 4815.72 | backward_allreduce: 0.79 | step: 322.95
        steps: 80 loss: 7.7299 iter time (s): 0.699 samples/sec: 5.721
        [2024-01-23 19:47:43,897] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms)
        [2024-01-23 19:47:44,603] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.56 | overflow_check: 9.57 | unscale_and_clip: 2.70 | basic_step: 10.30 | update_fp16: 2.49
        [2024-01-23 19:47:45,313] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.60 | overflow_check: 9.62 | unscale_and_clip: 2.69 | basic_step: 10.31 | update_fp16: 2.49
        [2024-01-23 19:47:46,020] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.58 | overflow_check: 9.14 | unscale_and_clip: 2.69 | basic_step: 10.29 | update_fp16: 2.49
        [2024-01-23 19:47:46,721] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.54 | overflow_check: 8.90 | unscale_and_clip: 2.70 | basic_step: 10.31 | update_fp16: 2.49
        [2024-01-23 19:47:47,427] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.66 | overflow_check: 10.95 | unscale_and_clip: 2.71 | basic_step: 10.30 | update_fp16: 2.49
        [2024-01-23 19:47:48,129] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.54 | overflow_check: 8.98 | unscale_and_clip: 2.70 | basic_step: 10.30 | update_fp16: 2.48
        [2024-01-23 19:47:48,831] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.54 | overflow_check: 8.85 | unscale_and_clip: 2.71 | basic_step: 10.31 | update_fp16: 2.49
        [2024-01-23 19:47:49,540] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.60 | overflow_check: 9.61 | unscale_and_clip: 2.70 | basic_step: 10.31 | update_fp16: 2.48
        [2024-01-23 19:47:50,249] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.58 | overflow_check: 9.59 | unscale_and_clip: 2.70 | basic_step: 10.29 | update_fp16: 2.49
        [2024-01-23 19:47:50,967] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.55 | overflow_check: 9.10 | unscale_and_clip: 2.70 | basic_step: 10.30 | update_fp16: 2.49
        [2024-01-23 19:47:50,968] [INFO] [logging.py:96:log_dist] [Rank 0] step=90, skipped=0, lr=[1.6874999999999997e-05, 1.6874999999999997e-05], mom=[[0.9, 0.95], [0.9, 0.95]]
        [2024-01-23 19:47:50,970] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 23.14 | forward_microstep: 1399.06 | backward_microstep: 4830.62 | backward_inner_microstep: 4827.09 | backward_allreduce_microstep: 0.84 | step_microstep: 324.58
        [2024-01-23 19:47:50,972] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 1398.64 | backward: 4830.65 | backward_inner: 4827.24 | backward_allreduce: 0.89 | step: 325.01
        steps: 90 loss: 8.3211 iter time (s): 0.702 samples/sec: 5.700
        [2024-01-23 19:47:50,974] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms)
        [2024-01-23 19:47:57,543] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.57 | overflow_check: 8.96 | unscale_and_clip: 2.69 | basic_step: 10.30 | update_fp16: 2.49
        [2024-01-23 19:47:59,874] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.53 | overflow_check: 8.93 | unscale_and_clip: 2.70 | basic_step: 10.30 | update_fp16: 2.48
        [2024-01-23 19:48:02,275] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.55 | overflow_check: 8.93 | unscale_and_clip: 2.70 | basic_step: 10.30 | update_fp16: 2.49
        [2024-01-23 19:48:10,979] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.66 | overflow_check: 9.76 | unscale_and_clip: 2.70 | basic_step: 10.29 | update_fp16: 2.48
        [2024-01-23 19:48:32,687] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.65 | overflow_check: 9.86 | unscale_and_clip: 2.69 | basic_step: 10.29 | update_fp16: 2.49
        [2024-01-23 19:48:47,036] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.57 | overflow_check: 9.70 | unscale_and_clip: 2.70 | basic_step: 10.30 | update_fp16: 2.48
        [2024-01-23 19:48:50,073] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.56 | overflow_check: 8.90 | unscale_and_clip: 2.70 | basic_step: 10.28 | update_fp16: 2.48
        [2024-01-23 19:48:52,675] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.57 | overflow_check: 9.25 | unscale_and_clip: 2.69 | basic_step: 10.30 | update_fp16: 2.49
        [2024-01-23 19:48:55,803] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.52 | overflow_check: 8.83 | unscale_and_clip: 2.68 | basic_step: 10.29 | update_fp16: 2.49
        [2024-01-23 19:48:56,509] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | compute_norm: 1.72 | overflow_check: 9.53 | unscale_and_clip: 2.70 | basic_step: 10.29 | update_fp16: 2.49
        [2024-01-23 19:48:56,510] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=0, lr=[1.875e-05, 1.875e-05], mom=[[0.9, 0.95], [0.9, 0.95]]
        [2024-01-23 19:48:56,512] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 27.80 | forward_microstep: 1412.57 | backward_microstep: 4820.62 | backward_inner_microstep: 4816.06 | backward_allreduce_microstep: 1.06 | step_microstep: 323.80
        [2024-01-23 19:48:56,514] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms) | forward: 1411.98 | backward: 4820.60 | backward_inner: 4816.21 | backward_allreduce: 1.08 | step: 324.33
        steps: 100 loss: 7.9945 iter time (s): 0.703 samples/sec: 5.692
        [2024-01-23 19:48:56,515] [INFO] [logging.py:96:log_dist] [Rank 0] rank=0 time (ms)
         samples/sec: 2.822 | iteration      100/  320000 | elapsed time per iteration (ms): 1417.6 | learning rate: 1.875E-05 | approx flops per GPU: 7.0TFLOPS | lm_loss: 9.210952E+00 | loss scale: 65536.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
        
      • MTDS

        # moe, num-experts 1
        iteration        1/ 1716613 | consumed samples:          256 | consumed tokens:       524288 | elapsed time per iteration (ms): 34374.4 | learning rate: 0.000E+00 | global batch size:   256 | lm loss: 1.088943E+01 | loss scale: 2048.0 | grad norm: 11.603 | num zeros: 0.0 | actual seqlen:  2048 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 7.447 | tokens per gpu per second (tgs): 15252.280 | TFLOPs: 18.51 |
        [2024-01-23 21:49:23,182] [INFO] [logging.py:96:log_dist] [Rank 0] step=2, skipped=0, lr=[6.291456e-07, 6.291456e-07], mom=[(0.9, 0.95), (0.9, 0.95)]
        steps: 2 loss: 10.8885 iter time (s): 33.112 samples/sec: 7.731
         iteration        2/ 1716613 | consumed samples:          512 | consumed tokens:      1048576 | elapsed time per iteration (ms): 33130.1 | learning rate: 6.291E-07 | global batch size:   256 | lm loss: 1.088852E+01 | loss scale: 2048.0 | grad norm: 12.009 | num zeros: 0.0 | actual seqlen:  2048 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 7.727 | tokens per gpu per second (tgs): 15825.136 | TFLOPs: 19.20 |
        [Rank 0] (after 2 iterations) memory (MB) | allocated: 5231.03076171875 | max allocated: 8000.60400390625 | reserved: 9550.0 | max reserved: 9550.0
        [2024-01-23 21:49:56,220] [INFO] [logging.py:96:log_dist] [Rank 0] step=3, skipped=0, lr=[1.2582912e-06, 1.2582912e-06], mom=[(0.9, 0.95), (0.9, 0.95)]
        steps: 3 loss: 10.8943 iter time (s): 33.011 samples/sec: 7.755
         iteration        3/ 1716613 | consumed samples:          768 | consumed tokens:      1572864 | elapsed time per iteration (ms): 33037.9 | learning rate: 1.258E-06 | global batch size:   256 | lm loss: 1.089434E+01 | loss scale: 2048.0 | grad norm: 11.706 | num zeros: 0.0 | actual seqlen:  2048 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 7.749 | tokens per gpu per second (tgs): 15869.295 | TFLOPs: 19.25 |
        
        # moe, num-experts 2
        iteration        1/ 1716613 | consumed samples:          256 | consumed tokens:       524288 | elapsed time per iteration (ms): 109878.0 | learning rate: 0.000E+00 | global batch size:   256 | lm loss: 1.087590E+01 | moe loss: 6.700792E-02 | loss scale: 2048.0 | actual seqlen:  2048 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 2.330 | tokens per gpu per second (tgs): 4771.546 | TFLOPs: 5.79 |
        iteration        2/ 1716613 | consumed samples:          512 | consumed tokens:      1048576 | elapsed time per iteration (ms): 43840.7 | learning rate: 6.291E-07 | global batch size:   256 | lm loss: 1.087427E+01 | moe loss: 6.697765E-02 | loss scale: 2048.0 | actual seqlen:  2048 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 5.839 | tokens per gpu per second (tgs): 11958.931 | TFLOPs: 14.51 |
        [Rank 0] (after 2 iterations) memory (MB) | allocated: 2596.92578125 | max allocated: 6963.3564453125 | reserved: 9186.0 | max reserved: 9186.0
        # moe, num-experts 128
        
        
      • without vs with moe layers:

        DeepSpeedEngine(
          (module): SequentialWrapper(
            (sequential): Sequential(
              (0): EmbeddingPipe(
                (word_embeddings): VocabParallelEmbedding()
                (embedding_dropout): Dropout(p=0.0, inplace=False)
              )
              (1): Lambda()
              (2): ParallelTransformerLayerPipe(
                (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (attention): ParallelSelfAttention(
                  (query_key_value): ColumnParallelLinear()
                  (rotary_emb): RotaryEmbedding()
                  (scale_mask_softmax): FusedScaleMaskSoftmax()
                  (attention_dropout): Dropout(p=0.0, inplace=False)
                  (dense): RowParallelLinear()
                )
                (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (mlp): ParallelMLP(
                  (dense_h_to_4h): ColumnParallelLinear()
                  (dense_4h_to_h): RowParallelLinear()
                )
              )
              (3): ParallelTransformerLayerPipe(
                (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (attention): ParallelSelfAttention(
                  (query_key_value): ColumnParallelLinear()
                  (rotary_emb): RotaryEmbedding()
                  (scale_mask_softmax): FusedScaleMaskSoftmax()
                  (attention_dropout): Dropout(p=0.0, inplace=False)
                  (dense): RowParallelLinear()
                )
                (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (mlp): ParallelMLP(
                  (dense_h_to_4h): ColumnParallelLinear()
                  (dense_4h_to_h): RowParallelLinear()
                )
              )
              (4): ParallelTransformerLayerPipe(
                (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (attention): ParallelSelfAttention(
                  (query_key_value): ColumnParallelLinear()
                  (rotary_emb): RotaryEmbedding()
                  (scale_mask_softmax): FusedScaleMaskSoftmax()
                  (attention_dropout): Dropout(p=0.0, inplace=False)
                  (dense): RowParallelLinear()
                )
                (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (mlp): ParallelMLP(
                  (dense_h_to_4h): ColumnParallelLinear()
                  (dense_4h_to_h): RowParallelLinear()
                )
              )
              (5): ParallelTransformerLayerPipe(
                (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (attention): ParallelSelfAttention(
                  (query_key_value): ColumnParallelLinear()
                  (rotary_emb): RotaryEmbedding()
                  (scale_mask_softmax): FusedScaleMaskSoftmax()
                  (attention_dropout): Dropout(p=0.0, inplace=False)
                  (dense): RowParallelLinear()
                )
                (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (mlp): ParallelMLP(
                  (dense_h_to_4h): ColumnParallelLinear()
                  (dense_4h_to_h): RowParallelLinear()
                )
              )
              (6): ParallelTransformerLayerPipe(
                (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (attention): ParallelSelfAttention(
                  (query_key_value): ColumnParallelLinear()
                  (rotary_emb): RotaryEmbedding()
                  (scale_mask_softmax): FusedScaleMaskSoftmax()
                  (attention_dropout): Dropout(p=0.0, inplace=False)
                  (dense): RowParallelLinear()
                )
                (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (mlp): ParallelMLP(
                  (dense_h_to_4h): ColumnParallelLinear()
                  (dense_4h_to_h): RowParallelLinear()
                )
              )
              (7): ParallelTransformerLayerPipe(
                (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (attention): ParallelSelfAttention(
                  (query_key_value): ColumnParallelLinear()
                  (rotary_emb): RotaryEmbedding()
                  (scale_mask_softmax): FusedScaleMaskSoftmax()
                  (attention_dropout): Dropout(p=0.0, inplace=False)
                  (dense): RowParallelLinear()
                )
                (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (mlp): ParallelMLP(
                  (dense_h_to_4h): ColumnParallelLinear()
                  (dense_4h_to_h): RowParallelLinear()
                )
              )
              (8): ParallelTransformerLayerPipe(
                (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (attention): ParallelSelfAttention(
                  (query_key_value): ColumnParallelLinear()
                  (rotary_emb): RotaryEmbedding()
                  (scale_mask_softmax): FusedScaleMaskSoftmax()
                  (attention_dropout): Dropout(p=0.0, inplace=False)
                  (dense): RowParallelLinear()
                )
                (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (mlp): ParallelMLP(
                  (dense_h_to_4h): ColumnParallelLinear()
                  (dense_4h_to_h): RowParallelLinear()
                )
              )
              (9): ParallelTransformerLayerPipe(
                (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (attention): ParallelSelfAttention(
                  (query_key_value): ColumnParallelLinear()
                  (rotary_emb): RotaryEmbedding()
                  (scale_mask_softmax): FusedScaleMaskSoftmax()
                  (attention_dropout): Dropout(p=0.0, inplace=False)
                  (dense): RowParallelLinear()
                )
                (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (mlp): ParallelMLP(
                  (dense_h_to_4h): ColumnParallelLinear()
                  (dense_4h_to_h): RowParallelLinear()
                )
              )
              (10): ParallelTransformerLayerPipe(
                (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (attention): ParallelSelfAttention(
                  (query_key_value): ColumnParallelLinear()
                  (rotary_emb): RotaryEmbedding()
                  (scale_mask_softmax): FusedScaleMaskSoftmax()
                  (attention_dropout): Dropout(p=0.0, inplace=False)
                  (dense): RowParallelLinear()
                )
                (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (mlp): ParallelMLP(
                  (dense_h_to_4h): ColumnParallelLinear()
                  (dense_4h_to_h): RowParallelLinear()
                )
              )
              (11): ParallelTransformerLayerPipe(
                (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (attention): ParallelSelfAttention(
                  (query_key_value): ColumnParallelLinear()
                  (rotary_emb): RotaryEmbedding()
                  (scale_mask_softmax): FusedScaleMaskSoftmax()
                  (attention_dropout): Dropout(p=0.0, inplace=False)
                  (dense): RowParallelLinear()
                )
                (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (mlp): ParallelMLP(
                  (dense_h_to_4h): ColumnParallelLinear()
                  (dense_4h_to_h): RowParallelLinear()
                )
              )
              (12): ParallelTransformerLayerPipe(
                (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (attention): ParallelSelfAttention(
                  (query_key_value): ColumnParallelLinear()
                  (rotary_emb): RotaryEmbedding()
                  (scale_mask_softmax): FusedScaleMaskSoftmax()
                  (attention_dropout): Dropout(p=0.0, inplace=False)
                  (dense): RowParallelLinear()
                )
                (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (mlp): ParallelMLP(
                  (dense_h_to_4h): ColumnParallelLinear()
                  (dense_4h_to_h): RowParallelLinear()
                )
              )
              (13): ParallelTransformerLayerPipe(
                (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (attention): ParallelSelfAttention(
                  (query_key_value): ColumnParallelLinear()
                  (rotary_emb): RotaryEmbedding()
                  (scale_mask_softmax): FusedScaleMaskSoftmax()
                  (attention_dropout): Dropout(p=0.0, inplace=False)
                  (dense): RowParallelLinear()
                )
                (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (mlp): ParallelMLP(
                  (dense_h_to_4h): ColumnParallelLinear()
                  (dense_4h_to_h): RowParallelLinear()
                )
              )
              (14): Lambda()
              (15): NormPipe(
                (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              )
              (16): ParallelLinearPipe(
                (final_linear): ColumnParallelLinear()
              )
            )
          )
        )
        
        DeepSpeedEngine(
          (module): SequentialWrapper(
            (sequential): Sequential(
              (0): EmbeddingPipe(
                (word_embeddings): VocabParallelEmbedding()
                (embedding_dropout): Dropout(p=0.0, inplace=False)
              )
              (1): Lambda()
              (2): ParallelTransformerLayerPipe(
                (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (attention): ParallelSelfAttention(
                  (query_key_value): ColumnParallelLinear()
                  (rotary_emb): RotaryEmbedding()
                  (scale_mask_softmax): FusedScaleMaskSoftmax()
                  (attention_dropout): Dropout(p=0.0, inplace=False)
                  (dense): RowParallelLinear()
                )
                (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (mlp): MoE(
                  (deepspeed_moe): MOELayer(
                    (gate): TopKGate(
                      (wg): Linear(in_features=768, out_features=2, bias=False)
                    )
                    (experts): Experts(
                      (deepspeed_experts): ModuleList(
                        (0): ParallelMLP(
                          (dense_h_to_4h): ColumnParallelLinear()
                          (dense_4h_to_h): RowParallelLinear()
                        )
                        (1): ParallelMLP(
                          (dense_h_to_4h): ColumnParallelLinear()
                          (dense_4h_to_h): RowParallelLinear()
                        )
                      )
                    )
                  )
                )
              )
              (3): ParallelTransformerLayerPipe(
                (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (attention): ParallelSelfAttention(
                  (query_key_value): ColumnParallelLinear()
                  (rotary_emb): RotaryEmbedding()
                  (scale_mask_softmax): FusedScaleMaskSoftmax()
                  (attention_dropout): Dropout(p=0.0, inplace=False)
                  (dense): RowParallelLinear()
                )
                (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (mlp): MoE(
                  (deepspeed_moe): MOELayer(
                    (gate): TopKGate(
                      (wg): Linear(in_features=768, out_features=2, bias=False)
                    )
                    (experts): Experts(
                      (deepspeed_experts): ModuleList(
                        (0): ParallelMLP(
                          (dense_h_to_4h): ColumnParallelLinear()
                          (dense_4h_to_h): RowParallelLinear()
                        )
                        (1): ParallelMLP(
                          (dense_h_to_4h): ColumnParallelLinear()
                          (dense_4h_to_h): RowParallelLinear()
                        )
                      )
                    )
                  )
                )
              )
              (4): ParallelTransformerLayerPipe(
                (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (attention): ParallelSelfAttention(
                  (query_key_value): ColumnParallelLinear()
                  (rotary_emb): RotaryEmbedding()
                  (scale_mask_softmax): FusedScaleMaskSoftmax()
                  (attention_dropout): Dropout(p=0.0, inplace=False)
                  (dense): RowParallelLinear()
                )
                (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (mlp): MoE(
                  (deepspeed_moe): MOELayer(
                    (gate): TopKGate(
                      (wg): Linear(in_features=768, out_features=2, bias=False)
                    )
                    (experts): Experts(
                      (deepspeed_experts): ModuleList(
                        (0): ParallelMLP(
                          (dense_h_to_4h): ColumnParallelLinear()
                          (dense_4h_to_h): RowParallelLinear()
                        )
                        (1): ParallelMLP(
                          (dense_h_to_4h): ColumnParallelLinear()
                          (dense_4h_to_h): RowParallelLinear()
                        )
                      )
                    )
                  )
                )
              )
              (5): ParallelTransformerLayerPipe(
                (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (attention): ParallelSelfAttention(
                  (query_key_value): ColumnParallelLinear()
                  (rotary_emb): RotaryEmbedding()
                  (scale_mask_softmax): FusedScaleMaskSoftmax()
                  (attention_dropout): Dropout(p=0.0, inplace=False)
                  (dense): RowParallelLinear()
                )
                (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (mlp): MoE(
                  (deepspeed_moe): MOELayer(
                    (gate): TopKGate(
                      (wg): Linear(in_features=768, out_features=2, bias=False)
                    )
                    (experts): Experts(
                      (deepspeed_experts): ModuleList(
                        (0): ParallelMLP(
                          (dense_h_to_4h): ColumnParallelLinear()
                          (dense_4h_to_h): RowParallelLinear()
                        )
                        (1): ParallelMLP(
                          (dense_h_to_4h): ColumnParallelLinear()
                          (dense_4h_to_h): RowParallelLinear()
                        )
                      )
                    )
                  )
                )
              )
              (6): ParallelTransformerLayerPipe(
                (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (attention): ParallelSelfAttention(
                  (query_key_value): ColumnParallelLinear()
                  (rotary_emb): RotaryEmbedding()
                  (scale_mask_softmax): FusedScaleMaskSoftmax()
                  (attention_dropout): Dropout(p=0.0, inplace=False)
                  (dense): RowParallelLinear()
                )
                (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (mlp): MoE(
                  (deepspeed_moe): MOELayer(
                    (gate): TopKGate(
                      (wg): Linear(in_features=768, out_features=2, bias=False)
                    )
                    (experts): Experts(
                      (deepspeed_experts): ModuleList(
                        (0): ParallelMLP(
                          (dense_h_to_4h): ColumnParallelLinear()
                          (dense_4h_to_h): RowParallelLinear()
                        )
                        (1): ParallelMLP(
                          (dense_h_to_4h): ColumnParallelLinear()
                          (dense_4h_to_h): RowParallelLinear()
                        )
                      )
                    )
                  )
                )
              )
              (7): ParallelTransformerLayerPipe(
                (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (attention): ParallelSelfAttention(
                  (query_key_value): ColumnParallelLinear()
                  (rotary_emb): RotaryEmbedding()
                  (scale_mask_softmax): FusedScaleMaskSoftmax()
                  (attention_dropout): Dropout(p=0.0, inplace=False)
                  (dense): RowParallelLinear()
                )
                (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (mlp): MoE(
                  (deepspeed_moe): MOELayer(
                    (gate): TopKGate(
                      (wg): Linear(in_features=768, out_features=2, bias=False)
                    )
                    (experts): Experts(
                      (deepspeed_experts): ModuleList(
                        (0): ParallelMLP(
                          (dense_h_to_4h): ColumnParallelLinear()
                          (dense_4h_to_h): RowParallelLinear()
                        )
                        (1): ParallelMLP(
                          (dense_h_to_4h): ColumnParallelLinear()
                          (dense_4h_to_h): RowParallelLinear()
                        )
                      )
                    )
                  )
                )
              )
              (8): ParallelTransformerLayerPipe(
                (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (attention): ParallelSelfAttention(
                  (query_key_value): ColumnParallelLinear()
                  (rotary_emb): RotaryEmbedding()
                  (scale_mask_softmax): FusedScaleMaskSoftmax()
                  (attention_dropout): Dropout(p=0.0, inplace=False)
                  (dense): RowParallelLinear()
                )
                (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (mlp): MoE(
                  (deepspeed_moe): MOELayer(
                    (gate): TopKGate(
                      (wg): Linear(in_features=768, out_features=2, bias=False)
                    )
                    (experts): Experts(
                      (deepspeed_experts): ModuleList(
                        (0): ParallelMLP(
                          (dense_h_to_4h): ColumnParallelLinear()
                          (dense_4h_to_h): RowParallelLinear()
                        )
                        (1): ParallelMLP(
                          (dense_h_to_4h): ColumnParallelLinear()
                          (dense_4h_to_h): RowParallelLinear()
                        )
                      )
                    )
                  )
                )
              )
              (9): ParallelTransformerLayerPipe(
                (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (attention): ParallelSelfAttention(
                  (query_key_value): ColumnParallelLinear()
                  (rotary_emb): RotaryEmbedding()
                  (scale_mask_softmax): FusedScaleMaskSoftmax()
                  (attention_dropout): Dropout(p=0.0, inplace=False)
                  (dense): RowParallelLinear()
                )
                (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (mlp): MoE(
                  (deepspeed_moe): MOELayer(
                    (gate): TopKGate(
                      (wg): Linear(in_features=768, out_features=2, bias=False)
                    )
                    (experts): Experts(
                      (deepspeed_experts): ModuleList(
                        (0): ParallelMLP(
                          (dense_h_to_4h): ColumnParallelLinear()
                          (dense_4h_to_h): RowParallelLinear()
                        )
                        (1): ParallelMLP(
                          (dense_h_to_4h): ColumnParallelLinear()
                          (dense_4h_to_h): RowParallelLinear()
                        )
                      )
                    )
                  )
                )
              )
              (10): ParallelTransformerLayerPipe(
                (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (attention): ParallelSelfAttention(
                  (query_key_value): ColumnParallelLinear()
                  (rotary_emb): RotaryEmbedding()
                  (scale_mask_softmax): FusedScaleMaskSoftmax()
                  (attention_dropout): Dropout(p=0.0, inplace=False)
                  (dense): RowParallelLinear()
                )
                (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (mlp): MoE(
                  (deepspeed_moe): MOELayer(
                    (gate): TopKGate(
                      (wg): Linear(in_features=768, out_features=2, bias=False)
                    )
                    (experts): Experts(
                      (deepspeed_experts): ModuleList(
                        (0): ParallelMLP(
                          (dense_h_to_4h): ColumnParallelLinear()
                          (dense_4h_to_h): RowParallelLinear()
                        )
                        (1): ParallelMLP(
                          (dense_h_to_4h): ColumnParallelLinear()
                          (dense_4h_to_h): RowParallelLinear()
                        )
                      )
                    )
                  )
                )
              )
              (11): ParallelTransformerLayerPipe(
                (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (attention): ParallelSelfAttention(
                  (query_key_value): ColumnParallelLinear()
                  (rotary_emb): RotaryEmbedding()
                  (scale_mask_softmax): FusedScaleMaskSoftmax()
                  (attention_dropout): Dropout(p=0.0, inplace=False)
                  (dense): RowParallelLinear()
                )
                (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (mlp): MoE(
                  (deepspeed_moe): MOELayer(
                    (gate): TopKGate(
                      (wg): Linear(in_features=768, out_features=2, bias=False)
                    )
                    (experts): Experts(
                      (deepspeed_experts): ModuleList(
                        (0): ParallelMLP(
                          (dense_h_to_4h): ColumnParallelLinear()
                          (dense_4h_to_h): RowParallelLinear()
                        )
                        (1): ParallelMLP(
                          (dense_h_to_4h): ColumnParallelLinear()
                          (dense_4h_to_h): RowParallelLinear()
                        )
                      )
                    )
                  )
                )
              )
              (12): ParallelTransformerLayerPipe(
                (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (attention): ParallelSelfAttention(
                  (query_key_value): ColumnParallelLinear()
                  (rotary_emb): RotaryEmbedding()
                  (scale_mask_softmax): FusedScaleMaskSoftmax()
                  (attention_dropout): Dropout(p=0.0, inplace=False)
                  (dense): RowParallelLinear()
                )
                (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (mlp): MoE(
                  (deepspeed_moe): MOELayer(
                    (gate): TopKGate(
                      (wg): Linear(in_features=768, out_features=2, bias=False)
                    )
                    (experts): Experts(
                      (deepspeed_experts): ModuleList(
                        (0): ParallelMLP(
                          (dense_h_to_4h): ColumnParallelLinear()
                          (dense_4h_to_h): RowParallelLinear()
                        )
                        (1): ParallelMLP(
                          (dense_h_to_4h): ColumnParallelLinear()
                          (dense_4h_to_h): RowParallelLinear()
                        )
                      )
                    )
                  )
                )
              )
              (13): ParallelTransformerLayerPipe(
                (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (attention): ParallelSelfAttention(
                  (query_key_value): ColumnParallelLinear()
                  (rotary_emb): RotaryEmbedding()
                  (scale_mask_softmax): FusedScaleMaskSoftmax()
                  (attention_dropout): Dropout(p=0.0, inplace=False)
                  (dense): RowParallelLinear()
                )
                (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
                (mlp): MoE(
                  (deepspeed_moe): MOELayer(
                    (gate): TopKGate(
                      (wg): Linear(in_features=768, out_features=2, bias=False)
                    )
                    (experts): Experts(
                      (deepspeed_experts): ModuleList(
                        (0): ParallelMLP(
                          (dense_h_to_4h): ColumnParallelLinear()
                          (dense_4h_to_h): RowParallelLinear()
                        )
                        (1): ParallelMLP(
                          (dense_h_to_4h): ColumnParallelLinear()
                          (dense_4h_to_h): RowParallelLinear()
                        )
                      )
                    )
                  )
                )
              )
              (14): Lambda()
              (15): NormPipe(
                (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              )
              (16): ParallelLinearPipe(
                (final_linear): ColumnParallelLinear()
              )
            )
          )
        )
        
  • 2024-01-22

    • gpt-neox moe

      'PipelineParallelGrid' object has no attribute 'get_tensor_model_parallel_world_size'
        File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/moe/mappings.py", line 103, in drop_tokens
          if mpu is None or mpu.get_tensor_model_parallel_world_size() == 1:
        File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/moe/sharded_moe.py", line 507, in forward
          dispatched_input = drop_tokens(dispatched_input, dim=1)
        File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
          return forward_call(*input, **kwargs)
        File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/moe/layer.py", line 115, in forward
          output = self.deepspeed_moe(hidden_states, used_token)
        File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
          return forward_call(*input, **kwargs)
        File "/home/ubuntu/gpt-neox/megatron/model/transformer.py", line 983, in forward
          mlp_output, moe_loss, _ = self.mlp(layernorm_output)
        File "/home/ubuntu/gpt-neox/megatron/model/transformer.py", line 1010, in forward
          return super().forward(hidden_states, attention_mask)[0], attention_mask
        File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
          return forward_call(*input, **kwargs)
        File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 327, in exec_func
          inputs = layer(inputs)
        File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 555, in forward
          outputs = run_function(*inputs_cuda)
        File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 713, in checkpoint
          CheckpointFunction.apply(function, all_outputs, *args)
        File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 348, in forward
          x = self.activation_checkpoint_func(exec_range_func(start_idx, end_idx), *x)
        File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
          return forward_call(*input, **kwargs)
        File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1731, in forward
          loss = self.module(*inputs, **kwargs)
        File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
          ret_val = func(*args, **kwargs)
        File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 627, in _exec_forward_pass
          outputs = super().forward(inputs)
        File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1307, in _exec_schedule
          self._exec_instr(**cmd.kwargs)
        File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 336, in train_batch
          self._exec_schedule(sched)
        File "/home/ubuntu/gpt-neox/megatron/training.py", line 787, in train_step_pipe
          loss = model.train_batch(data_iter=data_iterator)
        File "/home/ubuntu/gpt-neox/megatron/training.py", line 736, in train_step
          reduced_loss = train_step_pipe(
        File "/home/ubuntu/gpt-neox/megatron/training.py", line 831, in train
          loss_dict, skipped_iter = train_step(
        File "/home/ubuntu/gpt-neox/megatron/training.py", line 228, in pretrain
          iteration = train(
        File "/home/ubuntu/gpt-neox/train.py", line 78, in main
          pretrain(neox_args=neox_args)
        File "/home/ubuntu/gpt-neox/train.py", line 82, in <module> (Current frame)
          main()
      AttributeError: 'PipelineParallelGrid' object has no attribute 'get_tensor_model_parallel_world_size'
      
    • Minor

      • Can’t get master_port/MASTER_PORT to work in gpt-neox
  • 2024-01-18

    • TinyLllama pretrain

      • Had to download tokenizer separately following lit-gpt instructions for obtaining checkpoints. Fetched the TinyLlama intermediate checkpoint.

      • Never finished building flash-attn fused layer norms:

        ubuntu@ip-172-31-13-251:~/TinyLlama/flash-attention/csrc/layer_norm$ MAX_JOBS=4 pip install .
        Processing /home/ubuntu/TinyLlama/flash-attention/csrc/layer_norm
          Preparing metadata (setup.py) ... done
        Building wheels for collected packages: dropout-layer-norm
          Building wheel for dropout-layer-norm (setup.py) ... -
        
    • Megatron-Deepspeed

      • It runs, both normal and MOE, I think after fixing DATA_PATH

      • Inference works

        ubuntu@ip-172-31-13-251:~/Megatron-DeepSpeed$ python tools/text_generation_cli.py localhost:5000
        Enter prompt: Hello my name is
        Enter number of tokens to generate: 32
        Megatron Response:
        Hello my name is perennlington<|endoftext|>
        Enter prompt: Traceback (most recent call last):
          File "tools/text_generation_cli.py", line 13, in <module>
            sentence = input("Enter prompt: ")
        EOFError
        
    • gpt-neox

      • Training MoE fails with

        Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
        ninja: no work to do.
        Loading extension module utils...
        Time to load utils op: 1.1272997856140137 seconds
        KeyError('moe_params_with_weight_decay')
        AttributeError("'Tee' object has no attribute 'isatty'")
        Traceback (most recent call last):
          File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/ipdb/__main__.py", line 232, in launch_ipdb_on_exception
            yield
          File "/home/ubuntu/gpt-neox/megatron/training.py", line 194, in pretrain
            model, optimizer, lr_scheduler = setup_model_and_optimizer(
          File "/home/ubuntu/gpt-neox/megatron/training.py", line 652, in setup_model_and_optimizer
            model, optimizer, _, lr_scheduler = deepspeed.initialize(
          File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/__init__.py", line 180, in initialize
            engine = PipelineEngine(args=args,
          File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 55, in __init__
            super().__init__(*super_args, **super_kwargs)
          File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 309, in __init__
            self._configure_optimizer(optimizer, model_parameters)
          File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1180, in _configure_optimizer
            self.optimizer = self._configure_zero_optimizer(basic_optimizer)
          File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1415, in _configure_zero_optimizer
            optimizer = DeepSpeedZeroOptimizer(
          File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 197, in __init__
            self._configure_moe_settings()
          File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 556, in _configure_moe_settings
            self.real_dp_process_group[i] = self.expert_dp_process_group[group['name']]
        KeyError: 'moe_params_with_weight_decay'
        
      • Inference fails, throws an error using tokenizer on the token IDs, about a None when expecting a str

        Generating samples unconditionally and saving results to sample_output.txt
        generate_samples_unconditional() generating...
        TypeError('sequence item 67: expected str instance, NoneType found')
        Traceback (most recent call last):
          File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/ipdb/__main__.py", line 232, in launch_ipdb_on_exception
            yield
          File "generate.py", line 46, in main
            generate_samples_unconditional(
          File "/home/ubuntu/gpt-neox/megatron/text_generation_utils.py", line 680, in generate_samples_unconditional
            generated_texts = generate_samples_from_prompt(
          File "/home/ubuntu/gpt-neox/megatron/text_generation_utils.py", line 519, in generate_samples_from_prompt
            generated_text = neox_args.tokenizer.detokenize(generated_tokens)
          File "/home/ubuntu/gpt-neox/megatron/tokenizer/tokenizer.py", line 177, in detokenize
            return self.tokenizer.decode(token_ids)
          File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3750, in decode
            return self._decode(
          File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 1024, in _decode
            sub_texts.append(self.convert_tokens_to_string(current_sub_text))
          File "/home/ubuntu/gpt-neox/.direnv/python-3.8/lib/python3.8/site-packages/transformers/models/gpt2/tokenization_gpt2.py", line 322, in convert_tokens_
        to_string
            text = "".join(tokens)
        TypeError: sequence item 67: expected str instance, NoneType found