When llama uses bf16 training, there is an abnormal loss #947

suolyer · 2023-05-22T15:59:13Z

Describe the bug
I use the following configuration to enable bf16

"precision": "bfloat16",
  "bf16": {
   "enabled": true
    },

When loading llama's weights and training, I get abnormal lm_loss

samples/sec: 1.702 | iteration        1/   55000 | elapsed time per iteration (ms): 18806.7 | learning rate: 1.212E-09 | approx flops per GPU: 5.7TFLOPS | lm_loss: 2.275809E+00 | number of skipped iterations:   0 | number of nan iterations:   0 |
after 1 iterations memory (MB) | allocated: 19278.6044921875 | max allocated: 27906.7255859375 | reserved: 28864.0 | max reserved: 35216.0
time (ms) | forward: 6617.48 | backward: 10201.60 | backward-backward: 10201.28 | backward-allreduce: 0.00 | optimizer: 1884.36 | batch generator: 3237.32
[2023-05-22 21:15:09,288] [INFO] [logging.py:96:log_dist] [Rank 0] step=2, skipped=0, lr=[2.4242424242424244e-09, 2.4242424242424244e-09], mom=[[0.9, 0.95], [0.9, 0.95]]
 samples/sec: 2.721 | iteration        2/   55000 | elapsed time per iteration (ms): 11761.2 | learning rate: 2.424E-09 | approx flops per GPU: 9.1TFLOPS | lm_loss: 1.105449E+01 | number of skipped iterations:   0 | number of nan iterations:   0 |
time (ms) | forward: 1897.36 | backward: 8367.41 | backward-backward: 8367.18 | backward-allreduce: 0.00 | optimizer: 1450.71 | batch generator: 1350.31
[2023-05-22 21:15:19,225] [INFO] [logging.py:96:log_dist] [Rank 0] step=3, skipped=0, lr=[3.636363636363637e-09, 3.636363636363637e-09], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-05-22 21:15:20,097] [INFO] [timer.py:215:stop] epoch=0/micro_step=24/global_step=3, RunningAvgSamplesPerSec=4.700705741206184, CurrSamplesPerSec=4.700705741206184, MemAllocated=18.83GB, MaxMemAllocated=27.25GB
 samples/sec: 3.216 | iteration        3/   55000 | elapsed time per iteration (ms): 9950.8 | learning rate: 3.636E-09 | approx flops per GPU: 10.8TFLOPS | lm_loss: 1.107767E+01 | number of skipped iterations:   0 | number of nan iterations:   0 |
time (ms) | forward: 3641.26 | backward: 5170.31 | backward-backward: 5170.03 | backward-allreduce: 0.00 | optimizer: 1055.48 | batch generator: 3058.58
[2023-05-22 21:15:27,954] [INFO] [logging.py:96:log_dist] [Rank 0] step=4, skipped=0, lr=[4.848484848484849e-09, 4.848484848484849e-09], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-05-22 21:15:28,856] [INFO] [timer.py:215:stop] epoch=0/micro_step=32/global_step=4, RunningAvgSamplesPerSec=5.395925629281533, CurrSamplesPerSec=6.332479882523952, MemAllocated=18.83GB, MaxMemAllocated=27.25GB
 samples/sec: 3.663 | iteration        4/   55000 | elapsed time per iteration (ms): 8735.1 | learning rate: 4.848E-09 | approx flops per GPU: 12.3TFLOPS | lm_loss: 1.103901E+01 | number of skipped iterations:   0 | number of nan iterations:   0 |
time (ms) | forward: 4174.30 | backward: 3369.20 | backward-backward: 3368.98 | backward-allreduce: 0.00 | optimizer: 1131.46 | batch generator: 3620.85
[2023-05-22 21:15:42,624] [INFO] [logging.py:96:log_dist] [Rank 0] step=5, skipped=0, lr=[6.060606060606061e-09, 6.060606060606061e-09], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-05-22 21:15:43,470] [INFO] [timer.py:215:stop] epoch=0/micro_step=40/global_step=5, RunningAvgSamplesPerSec=5.865954850001298, CurrSamplesPerSec=7.103500795915698, MemAllocated=18.83GB, MaxMemAllocated=27.26GB
 samples/sec: 2.183 | iteration        5/   55000 | elapsed time per iteration (ms): 14660.6 | learning rate: 6.061E-09 | approx flops per GPU: 7.3TFLOPS | lm_loss: 1.106879E+01 | number of skipped iterations:   0 | number of nan iterations:   0 |

When we replace bf16 with fp16.

"precision": "float16",
  "fp16": {
    "enabled": true,

    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 12,
    "hysteresis": 2,
    "min_loss_scale": 1
    },

I get the normal lm_loss as follows

samples/sec: 2.930 | iteration        1/   55000 | elapsed time per iteration (ms): 10921.0 | learning rate: 1.212E-09 | approx flops per GPU: 9.8TFLOPS | lm_loss: 2.276005E+00 | loss scale: 4096.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
after 1 iterations memory (MB) | allocated: 19278.6044921875 | max allocated: 27906.3505859375 | reserved: 30812.0 | max reserved: 37120.0
time (ms) | forward: 4691.33 | backward: 4295.19 | backward-backward: 4294.82 | backward-allreduce: 0.00 | optimizer: 1907.59 | batch generator: 2369.79
[2023-05-22 21:52:36,066] [INFO] [logging.py:96:log_dist] [Rank 0] step=2, skipped=0, lr=[2.4242424242424244e-09, 2.4242424242424244e-09], mom=[[0.9, 0.95], [0.9, 0.95]]
 samples/sec: 4.904 | iteration        2/   55000 | elapsed time per iteration (ms): 6524.7 | learning rate: 2.424E-09 | approx flops per GPU: 16.4TFLOPS | lm_loss: 2.175701E+00 | loss scale: 4096.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
time (ms) | forward: 663.76 | backward: 4369.18 | backward-backward: 4368.26 | backward-allreduce: 0.00 | optimizer: 1418.27 | batch generator: 84.57
[2023-05-22 21:52:41,634] [INFO] [logging.py:96:log_dist] [Rank 0] step=3, skipped=0, lr=[3.636363636363637e-09, 3.636363636363637e-09], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-05-22 21:52:42,592] [INFO] [timer.py:215:stop] epoch=0/micro_step=24/global_step=3, RunningAvgSamplesPerSec=6.647539009386997, CurrSamplesPerSec=6.647539009386997, MemAllocated=18.83GB, MaxMemAllocated=27.25GB
 samples/sec: 5.774 | iteration        3/   55000 | elapsed time per iteration (ms): 5542.2 | learning rate: 3.636E-09 | approx flops per GPU: 19.3TFLOPS | lm_loss: 2.267862E+00 | loss scale: 4096.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
time (ms) | forward: 1316.84 | backward: 3117.71 | backward-backward: 3117.28 | backward-allreduce: 0.00 | optimizer: 1101.99 | batch generator: 720.53
[2023-05-22 21:52:47,019] [INFO] [logging.py:96:log_dist] [Rank 0] step=4, skipped=0, lr=[4.848484848484849e-09, 4.848484848484849e-09], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-05-22 21:52:47,970] [INFO] [timer.py:215:stop] epoch=0/micro_step=32/global_step=4, RunningAvgSamplesPerSec=6.6534217939786755, CurrSamplesPerSec=6.6593149998127, MemAllocated=18.83GB, MaxMemAllocated=27.25GB
 samples/sec: 5.906 | iteration        4/   55000 | elapsed time per iteration (ms): 5418.1 | learning rate: 4.848E-09 | approx flops per GPU: 19.8TFLOPS | lm_loss: 2.194994E+00 | loss scale: 4096.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
time (ms) | forward: 1229.05 | backward: 3074.19 | backward-backward: 3073.77 | backward-allreduce: 0.00 | optimizer: 1069.12 | batch generator: 565.65
[2023-05-22 21:52:52,115] [INFO] [logging.py:96:log_dist] [Rank 0] step=5, skipped=0, lr=[6.060606060606061e-09, 6.060606060606061e-09], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-05-22 21:52:53,075] [INFO] [timer.py:215:stop] epoch=0/micro_step=40/global_step=5, RunningAvgSamplesPerSec=6.554320719250964, CurrSamplesPerSec=6.364718886823974, MemAllocated=18.83GB, MaxMemAllocated=27.25GB
 samples/sec: 6.319 | iteration        5/   55000 | elapsed time per iteration (ms): 5064.1 | learning rate: 6.061E-09 | approx flops per GPU: 21.1TFLOPS | lm_loss: 2.281892E+00 | loss scale: 4096.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
time (ms) | forward: 742.19 | backward: 3176.28 | backward-backward: 3175.64 | backward-allreduce: 0.00 | optimizer: 1140.68 | batch generator: 29.57
[2023-05-22 21:52:57,429] [INFO] [logging.py:96:log_dist] [Rank 0] step=6, skipped=0, lr=[7.272727272727274e-09, 7.272727272727274e-09], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-05-22 21:52:58,397] [INFO] [timer.py:215:stop] epoch=0/micro_step=48/global_step=6, RunningAvgSamplesPerSec=6.418855053096874, CurrSamplesPerSec=6.0440943555750195, MemAllocated=18.83GB, MaxMemAllocated=27.25GB
 samples/sec: 6.013 | iteration        6/   55000 | elapsed time per iteration (ms): 5321.5 | learning rate: 7.273E-09 | approx flops per GPU: 20.1TFLOPS | lm_loss: 2.202072E+00 | loss scale: 4096.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
time (ms) | forward: 637.17 | backward: 3661.38 | backward-backward: 3660.93 | backward-allreduce: 0.00 | optimizer: 1018.83 | batch generator: 21.43

I tried several methods mentioned in this issue and got the same result

The text was updated successfully, but these errors were encountered:

@Quentin-Anthony

…ia `load_module_only` (#4141) This PR makes some fixes to the case where we want to resume training from a DeepSpeed ZeRO checkpoint and initialize a new optimizer, while not using the old optimizer in the checkpoint or relying on its existence at all. in this situation, despite passing `load_module_only=True` and `load_optimizer_states=False` to `load_checkpoint()`, the previous behavior was that: - `self._load_zero_checkpoint` would still be called, which attempts to load from the (in this case, nonexistent) checkpoint files. This PR stops this function from being called if using `load_module_only=True` and `load_optimizer_states=False`. Alternatively, calling this function may be alright if `"load_from_fp32_weights": true` is set in the DeepSpeed ZeRO config (reference: https://github.com/microsoft/DeepSpeed/blob/ff7d5275f2aa916cb5f320e0d817154e96f9cdb6/deepspeed/runtime/engine.py#L733) but this parameter does not seem to be documented in the docs for ZeRO config dicts. - in `_load_checkpoint`, the following codeblock: ``` if self.optimizer is not None and self.fp16_enabled(): self.optimizer.refresh_fp32_params() ``` results in `self.optimizer.refresh_fp32_params()` being called only if using FP16. As a result, the FP32 optimizer state is never initialized from the 16-bit model weights. This PR removes the fp16-specific condition. Previously reported in: EleutherAI/gpt-neox#947 EleutherAI/gpt-neox#843 Should also close: #4017 Fixes: #4944 and #4017 This caused problems for a freshly-converted LLama checkpoint, which did not contain optimizer states, when trying to train with this model as initialization. I have confirmed the following fixes prevent this behavior. cc @Quentin-Anthony @zhangir-azerbayev --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>

@Quentin-Anthony

…ia `load_module_only` (deepspeedai#4141) This PR makes some fixes to the case where we want to resume training from a DeepSpeed ZeRO checkpoint and initialize a new optimizer, while not using the old optimizer in the checkpoint or relying on its existence at all. in this situation, despite passing `load_module_only=True` and `load_optimizer_states=False` to `load_checkpoint()`, the previous behavior was that: - `self._load_zero_checkpoint` would still be called, which attempts to load from the (in this case, nonexistent) checkpoint files. This PR stops this function from being called if using `load_module_only=True` and `load_optimizer_states=False`. Alternatively, calling this function may be alright if `"load_from_fp32_weights": true` is set in the DeepSpeed ZeRO config (reference: https://github.com/microsoft/DeepSpeed/blob/ff7d5275f2aa916cb5f320e0d817154e96f9cdb6/deepspeed/runtime/engine.py#L733) but this parameter does not seem to be documented in the docs for ZeRO config dicts. - in `_load_checkpoint`, the following codeblock: ``` if self.optimizer is not None and self.fp16_enabled(): self.optimizer.refresh_fp32_params() ``` results in `self.optimizer.refresh_fp32_params()` being called only if using FP16. As a result, the FP32 optimizer state is never initialized from the 16-bit model weights. This PR removes the fp16-specific condition. Previously reported in: EleutherAI/gpt-neox#947 EleutherAI/gpt-neox#843 Should also close: deepspeedai#4017 Fixes: deepspeedai#4944 and deepspeedai#4017 This caused problems for a freshly-converted LLama checkpoint, which did not contain optimizer states, when trying to train with this model as initialization. I have confirmed the following fixes prevent this behavior. cc @Quentin-Anthony @zhangir-azerbayev --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>

Quentin-Anthony · 2024-04-21T22:12:32Z

This should be resolved by deepspeedai/DeepSpeed#4141

suolyer added the bug Something isn't working label May 22, 2023

haileyschoelkopf mentioned this issue Aug 13, 2023

Fixes for training models with bf16 + freshly initialized optimizer via load_module_only deepspeedai/DeepSpeed#4141

Merged

Quentin-Anthony closed this as completed Apr 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When llama uses bf16 training, there is an abnormal loss #947

When llama uses bf16 training, there is an abnormal loss #947

suolyer commented May 22, 2023

Quentin-Anthony commented Apr 21, 2024

When llama uses bf16 training, there is an abnormal loss #947

When llama uses bf16 training, there is an abnormal loss #947

Comments

suolyer commented May 22, 2023

Quentin-Anthony commented Apr 21, 2024