Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When llama uses bf16 training, there is an abnormal loss #947

Closed
suolyer opened this issue May 22, 2023 · 1 comment
Closed

When llama uses bf16 training, there is an abnormal loss #947

suolyer opened this issue May 22, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@suolyer
Copy link

suolyer commented May 22, 2023

Describe the bug
I use the following configuration to enable bf16

"precision": "bfloat16",
  "bf16": {
   "enabled": true
    },

When loading llama's weights and training, I get abnormal lm_loss

samples/sec: 1.702 | iteration        1/   55000 | elapsed time per iteration (ms): 18806.7 | learning rate: 1.212E-09 | approx flops per GPU: 5.7TFLOPS | lm_loss: 2.275809E+00 | number of skipped iterations:   0 | number of nan iterations:   0 |
after 1 iterations memory (MB) | allocated: 19278.6044921875 | max allocated: 27906.7255859375 | reserved: 28864.0 | max reserved: 35216.0
time (ms) | forward: 6617.48 | backward: 10201.60 | backward-backward: 10201.28 | backward-allreduce: 0.00 | optimizer: 1884.36 | batch generator: 3237.32
[2023-05-22 21:15:09,288] [INFO] [logging.py:96:log_dist] [Rank 0] step=2, skipped=0, lr=[2.4242424242424244e-09, 2.4242424242424244e-09], mom=[[0.9, 0.95], [0.9, 0.95]]
 samples/sec: 2.721 | iteration        2/   55000 | elapsed time per iteration (ms): 11761.2 | learning rate: 2.424E-09 | approx flops per GPU: 9.1TFLOPS | lm_loss: 1.105449E+01 | number of skipped iterations:   0 | number of nan iterations:   0 |
time (ms) | forward: 1897.36 | backward: 8367.41 | backward-backward: 8367.18 | backward-allreduce: 0.00 | optimizer: 1450.71 | batch generator: 1350.31
[2023-05-22 21:15:19,225] [INFO] [logging.py:96:log_dist] [Rank 0] step=3, skipped=0, lr=[3.636363636363637e-09, 3.636363636363637e-09], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-05-22 21:15:20,097] [INFO] [timer.py:215:stop] epoch=0/micro_step=24/global_step=3, RunningAvgSamplesPerSec=4.700705741206184, CurrSamplesPerSec=4.700705741206184, MemAllocated=18.83GB, MaxMemAllocated=27.25GB
 samples/sec: 3.216 | iteration        3/   55000 | elapsed time per iteration (ms): 9950.8 | learning rate: 3.636E-09 | approx flops per GPU: 10.8TFLOPS | lm_loss: 1.107767E+01 | number of skipped iterations:   0 | number of nan iterations:   0 |
time (ms) | forward: 3641.26 | backward: 5170.31 | backward-backward: 5170.03 | backward-allreduce: 0.00 | optimizer: 1055.48 | batch generator: 3058.58
[2023-05-22 21:15:27,954] [INFO] [logging.py:96:log_dist] [Rank 0] step=4, skipped=0, lr=[4.848484848484849e-09, 4.848484848484849e-09], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-05-22 21:15:28,856] [INFO] [timer.py:215:stop] epoch=0/micro_step=32/global_step=4, RunningAvgSamplesPerSec=5.395925629281533, CurrSamplesPerSec=6.332479882523952, MemAllocated=18.83GB, MaxMemAllocated=27.25GB
 samples/sec: 3.663 | iteration        4/   55000 | elapsed time per iteration (ms): 8735.1 | learning rate: 4.848E-09 | approx flops per GPU: 12.3TFLOPS | lm_loss: 1.103901E+01 | number of skipped iterations:   0 | number of nan iterations:   0 |
time (ms) | forward: 4174.30 | backward: 3369.20 | backward-backward: 3368.98 | backward-allreduce: 0.00 | optimizer: 1131.46 | batch generator: 3620.85
[2023-05-22 21:15:42,624] [INFO] [logging.py:96:log_dist] [Rank 0] step=5, skipped=0, lr=[6.060606060606061e-09, 6.060606060606061e-09], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-05-22 21:15:43,470] [INFO] [timer.py:215:stop] epoch=0/micro_step=40/global_step=5, RunningAvgSamplesPerSec=5.865954850001298, CurrSamplesPerSec=7.103500795915698, MemAllocated=18.83GB, MaxMemAllocated=27.26GB
 samples/sec: 2.183 | iteration        5/   55000 | elapsed time per iteration (ms): 14660.6 | learning rate: 6.061E-09 | approx flops per GPU: 7.3TFLOPS | lm_loss: 1.106879E+01 | number of skipped iterations:   0 | number of nan iterations:   0 |

When we replace bf16 with fp16.

"precision": "float16",
  "fp16": {
    "enabled": true,

    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 12,
    "hysteresis": 2,
    "min_loss_scale": 1
    },

I get the normal lm_loss as follows

samples/sec: 2.930 | iteration        1/   55000 | elapsed time per iteration (ms): 10921.0 | learning rate: 1.212E-09 | approx flops per GPU: 9.8TFLOPS | lm_loss: 2.276005E+00 | loss scale: 4096.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
after 1 iterations memory (MB) | allocated: 19278.6044921875 | max allocated: 27906.3505859375 | reserved: 30812.0 | max reserved: 37120.0
time (ms) | forward: 4691.33 | backward: 4295.19 | backward-backward: 4294.82 | backward-allreduce: 0.00 | optimizer: 1907.59 | batch generator: 2369.79
[2023-05-22 21:52:36,066] [INFO] [logging.py:96:log_dist] [Rank 0] step=2, skipped=0, lr=[2.4242424242424244e-09, 2.4242424242424244e-09], mom=[[0.9, 0.95], [0.9, 0.95]]
 samples/sec: 4.904 | iteration        2/   55000 | elapsed time per iteration (ms): 6524.7 | learning rate: 2.424E-09 | approx flops per GPU: 16.4TFLOPS | lm_loss: 2.175701E+00 | loss scale: 4096.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
time (ms) | forward: 663.76 | backward: 4369.18 | backward-backward: 4368.26 | backward-allreduce: 0.00 | optimizer: 1418.27 | batch generator: 84.57
[2023-05-22 21:52:41,634] [INFO] [logging.py:96:log_dist] [Rank 0] step=3, skipped=0, lr=[3.636363636363637e-09, 3.636363636363637e-09], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-05-22 21:52:42,592] [INFO] [timer.py:215:stop] epoch=0/micro_step=24/global_step=3, RunningAvgSamplesPerSec=6.647539009386997, CurrSamplesPerSec=6.647539009386997, MemAllocated=18.83GB, MaxMemAllocated=27.25GB
 samples/sec: 5.774 | iteration        3/   55000 | elapsed time per iteration (ms): 5542.2 | learning rate: 3.636E-09 | approx flops per GPU: 19.3TFLOPS | lm_loss: 2.267862E+00 | loss scale: 4096.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
time (ms) | forward: 1316.84 | backward: 3117.71 | backward-backward: 3117.28 | backward-allreduce: 0.00 | optimizer: 1101.99 | batch generator: 720.53
[2023-05-22 21:52:47,019] [INFO] [logging.py:96:log_dist] [Rank 0] step=4, skipped=0, lr=[4.848484848484849e-09, 4.848484848484849e-09], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-05-22 21:52:47,970] [INFO] [timer.py:215:stop] epoch=0/micro_step=32/global_step=4, RunningAvgSamplesPerSec=6.6534217939786755, CurrSamplesPerSec=6.6593149998127, MemAllocated=18.83GB, MaxMemAllocated=27.25GB
 samples/sec: 5.906 | iteration        4/   55000 | elapsed time per iteration (ms): 5418.1 | learning rate: 4.848E-09 | approx flops per GPU: 19.8TFLOPS | lm_loss: 2.194994E+00 | loss scale: 4096.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
time (ms) | forward: 1229.05 | backward: 3074.19 | backward-backward: 3073.77 | backward-allreduce: 0.00 | optimizer: 1069.12 | batch generator: 565.65
[2023-05-22 21:52:52,115] [INFO] [logging.py:96:log_dist] [Rank 0] step=5, skipped=0, lr=[6.060606060606061e-09, 6.060606060606061e-09], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-05-22 21:52:53,075] [INFO] [timer.py:215:stop] epoch=0/micro_step=40/global_step=5, RunningAvgSamplesPerSec=6.554320719250964, CurrSamplesPerSec=6.364718886823974, MemAllocated=18.83GB, MaxMemAllocated=27.25GB
 samples/sec: 6.319 | iteration        5/   55000 | elapsed time per iteration (ms): 5064.1 | learning rate: 6.061E-09 | approx flops per GPU: 21.1TFLOPS | lm_loss: 2.281892E+00 | loss scale: 4096.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
time (ms) | forward: 742.19 | backward: 3176.28 | backward-backward: 3175.64 | backward-allreduce: 0.00 | optimizer: 1140.68 | batch generator: 29.57
[2023-05-22 21:52:57,429] [INFO] [logging.py:96:log_dist] [Rank 0] step=6, skipped=0, lr=[7.272727272727274e-09, 7.272727272727274e-09], mom=[[0.9, 0.95], [0.9, 0.95]]
[2023-05-22 21:52:58,397] [INFO] [timer.py:215:stop] epoch=0/micro_step=48/global_step=6, RunningAvgSamplesPerSec=6.418855053096874, CurrSamplesPerSec=6.0440943555750195, MemAllocated=18.83GB, MaxMemAllocated=27.25GB
 samples/sec: 6.013 | iteration        6/   55000 | elapsed time per iteration (ms): 5321.5 | learning rate: 7.273E-09 | approx flops per GPU: 20.1TFLOPS | lm_loss: 2.202072E+00 | loss scale: 4096.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
time (ms) | forward: 637.17 | backward: 3661.38 | backward-backward: 3660.93 | backward-allreduce: 0.00 | optimizer: 1018.83 | batch generator: 21.43

I tried several methods mentioned in this issue and got the same result

@suolyer suolyer added the bug Something isn't working label May 22, 2023
github-merge-queue bot pushed a commit to deepspeedai/DeepSpeed that referenced this issue Jan 18, 2024
…ia `load_module_only` (#4141)

This PR makes some fixes to the case where we want to resume training
from a DeepSpeed ZeRO checkpoint and initialize a new optimizer, while
not using the old optimizer in the checkpoint or relying on its
existence at all.

in this situation, despite passing `load_module_only=True` and
`load_optimizer_states=False` to `load_checkpoint()`, the previous
behavior was that:
- `self._load_zero_checkpoint` would still be called, which attempts to
load from the (in this case, nonexistent) checkpoint files. This PR
stops this function from being called if using `load_module_only=True`
and `load_optimizer_states=False`. Alternatively, calling this function
may be alright if `"load_from_fp32_weights": true` is set in the
DeepSpeed ZeRO config (reference:
https://github.com/microsoft/DeepSpeed/blob/ff7d5275f2aa916cb5f320e0d817154e96f9cdb6/deepspeed/runtime/engine.py#L733)
but this parameter does not seem to be documented in the docs for ZeRO
config dicts.
- in `_load_checkpoint`, the following codeblock: 
```
if self.optimizer is not None and self.fp16_enabled():
    self.optimizer.refresh_fp32_params()
```
results in `self.optimizer.refresh_fp32_params()` being called only if
using FP16. As a result, the FP32 optimizer state is never initialized
from the 16-bit model weights. This PR removes the fp16-specific
condition.


Previously reported in:
EleutherAI/gpt-neox#947
EleutherAI/gpt-neox#843

Should also close:
#4017

Fixes: #4944 and #4017

This caused problems for a freshly-converted LLama checkpoint, which did
not contain optimizer states, when trying to train with this model as
initialization. I have confirmed the following fixes prevent this
behavior.

cc @Quentin-Anthony @zhangir-azerbayev

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
mauryaavinash95 pushed a commit to mauryaavinash95/DeepSpeed that referenced this issue Feb 17, 2024
…ia `load_module_only` (deepspeedai#4141)

This PR makes some fixes to the case where we want to resume training
from a DeepSpeed ZeRO checkpoint and initialize a new optimizer, while
not using the old optimizer in the checkpoint or relying on its
existence at all.

in this situation, despite passing `load_module_only=True` and
`load_optimizer_states=False` to `load_checkpoint()`, the previous
behavior was that:
- `self._load_zero_checkpoint` would still be called, which attempts to
load from the (in this case, nonexistent) checkpoint files. This PR
stops this function from being called if using `load_module_only=True`
and `load_optimizer_states=False`. Alternatively, calling this function
may be alright if `"load_from_fp32_weights": true` is set in the
DeepSpeed ZeRO config (reference:
https://github.com/microsoft/DeepSpeed/blob/ff7d5275f2aa916cb5f320e0d817154e96f9cdb6/deepspeed/runtime/engine.py#L733)
but this parameter does not seem to be documented in the docs for ZeRO
config dicts.
- in `_load_checkpoint`, the following codeblock: 
```
if self.optimizer is not None and self.fp16_enabled():
    self.optimizer.refresh_fp32_params()
```
results in `self.optimizer.refresh_fp32_params()` being called only if
using FP16. As a result, the FP32 optimizer state is never initialized
from the 16-bit model weights. This PR removes the fp16-specific
condition.


Previously reported in:
EleutherAI/gpt-neox#947
EleutherAI/gpt-neox#843

Should also close:
deepspeedai#4017

Fixes: deepspeedai#4944 and deepspeedai#4017

This caused problems for a freshly-converted LLama checkpoint, which did
not contain optimizer states, when trying to train with this model as
initialization. I have confirmed the following fixes prevent this
behavior.

cc @Quentin-Anthony @zhangir-azerbayev

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
@Quentin-Anthony
Copy link
Member

This should be resolved by deepspeedai/DeepSpeed#4141

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants