-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When llama uses bf16 training, there is an abnormal loss #947
Labels
bug
Something isn't working
Comments
github-merge-queue bot
pushed a commit
to deepspeedai/DeepSpeed
that referenced
this issue
Jan 18, 2024
…ia `load_module_only` (#4141) This PR makes some fixes to the case where we want to resume training from a DeepSpeed ZeRO checkpoint and initialize a new optimizer, while not using the old optimizer in the checkpoint or relying on its existence at all. in this situation, despite passing `load_module_only=True` and `load_optimizer_states=False` to `load_checkpoint()`, the previous behavior was that: - `self._load_zero_checkpoint` would still be called, which attempts to load from the (in this case, nonexistent) checkpoint files. This PR stops this function from being called if using `load_module_only=True` and `load_optimizer_states=False`. Alternatively, calling this function may be alright if `"load_from_fp32_weights": true` is set in the DeepSpeed ZeRO config (reference: https://github.com/microsoft/DeepSpeed/blob/ff7d5275f2aa916cb5f320e0d817154e96f9cdb6/deepspeed/runtime/engine.py#L733) but this parameter does not seem to be documented in the docs for ZeRO config dicts. - in `_load_checkpoint`, the following codeblock: ``` if self.optimizer is not None and self.fp16_enabled(): self.optimizer.refresh_fp32_params() ``` results in `self.optimizer.refresh_fp32_params()` being called only if using FP16. As a result, the FP32 optimizer state is never initialized from the 16-bit model weights. This PR removes the fp16-specific condition. Previously reported in: EleutherAI/gpt-neox#947 EleutherAI/gpt-neox#843 Should also close: #4017 Fixes: #4944 and #4017 This caused problems for a freshly-converted LLama checkpoint, which did not contain optimizer states, when trying to train with this model as initialization. I have confirmed the following fixes prevent this behavior. cc @Quentin-Anthony @zhangir-azerbayev --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>
mauryaavinash95
pushed a commit
to mauryaavinash95/DeepSpeed
that referenced
this issue
Feb 17, 2024
…ia `load_module_only` (deepspeedai#4141) This PR makes some fixes to the case where we want to resume training from a DeepSpeed ZeRO checkpoint and initialize a new optimizer, while not using the old optimizer in the checkpoint or relying on its existence at all. in this situation, despite passing `load_module_only=True` and `load_optimizer_states=False` to `load_checkpoint()`, the previous behavior was that: - `self._load_zero_checkpoint` would still be called, which attempts to load from the (in this case, nonexistent) checkpoint files. This PR stops this function from being called if using `load_module_only=True` and `load_optimizer_states=False`. Alternatively, calling this function may be alright if `"load_from_fp32_weights": true` is set in the DeepSpeed ZeRO config (reference: https://github.com/microsoft/DeepSpeed/blob/ff7d5275f2aa916cb5f320e0d817154e96f9cdb6/deepspeed/runtime/engine.py#L733) but this parameter does not seem to be documented in the docs for ZeRO config dicts. - in `_load_checkpoint`, the following codeblock: ``` if self.optimizer is not None and self.fp16_enabled(): self.optimizer.refresh_fp32_params() ``` results in `self.optimizer.refresh_fp32_params()` being called only if using FP16. As a result, the FP32 optimizer state is never initialized from the 16-bit model weights. This PR removes the fp16-specific condition. Previously reported in: EleutherAI/gpt-neox#947 EleutherAI/gpt-neox#843 Should also close: deepspeedai#4017 Fixes: deepspeedai#4944 and deepspeedai#4017 This caused problems for a freshly-converted LLama checkpoint, which did not contain optimizer states, when trying to train with this model as initialization. I have confirmed the following fixes prevent this behavior. cc @Quentin-Anthony @zhangir-azerbayev --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>
This should be resolved by deepspeedai/DeepSpeed#4141 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
I use the following configuration to enable bf16
When loading llama's weights and training, I get abnormal lm_loss
When we replace bf16 with fp16.
I get the normal lm_loss as follows
I tried several methods mentioned in this issue and got the same result
The text was updated successfully, but these errors were encountered: