Qlora + Deepspeed Zero3 #1637

AnirudhVIyer · 2024-04-09T22:36:18Z

AnirudhVIyer
Apr 9, 2024

While running a finetuning job with Qlora + Zero3, my training begins but stops abruptly with the following error:

[2024-04-09 19:38:43,632] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2, reducing to 1
9%|▉ | 128/1402 [03:23<29:15, 1.38s/it]
9%|▉ | 128/1402 [03:23<29:15, 1.38s/it]#15 9%|▉ | 128/1402 [03:23<29:14, 1.38s/it]
9%|▉ | 128/1402 [03:23<29:09, 1.37s/it]
9%|▉ | 129/1402 [03:24<29:16, 1.38s/it]#15 9%|▉ | 129/1402 [03:25<29:16, 1.38s/it]
9%|▉ | 129/1402 [03:25<29:16, 1.38s/it]
9%|▉ | 129/1402 [03:25<29:17, 1.38s/it]
9%|▉ | 130/1402 [03:26<28:59, 1.37s/it]#15 9%|▉ | 130/1402 [03:26<28:58, 1.37s/it]
9%|▉ | 130/1402 [03:26<28:59, 1.37s/it]
9%|▉ | 130/1402 [03:26<29:02, 1.37s/it]
9%|▉ | 131/1402 [03:28<30:56, 1.46s/it]
9%|▉ | 131/1402 [03:27<30:56, 1.46s/it]
9%|▉ | 131/1402 [03:28<30:57, 1.46s/it]
9%|▉ | 131/1402 [03:28<30:59, 1.46s/it]
9%|▉ | 131/1402 [03:29<33:51, 1.60s/it]
Traceback (most recent call last):
File "/opt/ml/code/train.py", line 397, in
main(args)
File "/opt/ml/code/train.py", line 318, in main
accelerator.backward(loss)
File "/opt/conda/lib/python3.9/site-packages/accelerate/accelerator.py", line 1995, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 175, in backward
self.engine.step()
File "/opt/conda/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2169, in step
self._take_model_step(lr_kwargs)
File "/opt/conda/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
self.optimizer.step()
File "/opt/conda/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2024, in step
if self._overflow_check_and_loss_scale_update():
File "/opt/conda/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1972, in _overflow_check_and_loss_scale_update
self._update_scale(self.overflow)
File "/opt/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2359, in _update_scale
self.loss_scaler.update_scale(has_overflow)
File "/opt/conda/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
raise Exception(
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.

Has anyone faced this issue? I am using python 3.9 torch 2.2
torch==2.2.2
transformers==4.39.0
accelerate==0.28.0
deepspeed==0.14.0
datasets==2.10.1
bitsandbytes==0.43.0
trl==0.8.1
nltk
evaluate
peft==0.10.0
rouge-score
tensorboard

BenjaminBossan · 2024-04-10T10:34:49Z

BenjaminBossan
Apr 10, 2024
Maintainer

It's very hard to say what the issue could be based on the info you provide. This thread contains a couple of tips from users who encountered the same issue. Most notably, if you're using float16 (aka fp16, half precision) training, that is a likely culprit. Otherwise, you would have to try some other tips or test out different hyper-parameters.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qlora + Deepspeed Zero3 #1637

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Qlora + Deepspeed Zero3 #1637

AnirudhVIyer Apr 9, 2024

Replies: 1 comment

BenjaminBossan Apr 10, 2024 Maintainer

AnirudhVIyer
Apr 9, 2024

BenjaminBossan
Apr 10, 2024
Maintainer