You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. #722

bosmart · 2023-08-31T17:54:36Z

Getting the above error (You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on.) when trying to run the Llama2 SFT example:

accelerate launch sft_llama2.py --output_dir="sft"

My accelerate config file:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Library versions:

accelerate: 0.23.0.dev0
peft: 0.6.0.dev0
transformers: 4.33.0.dev0
trl: 0.7.2.dev0

I have a dual 3090 machine.

The text was updated successfully, but these errors were encountered:

younesbelkada · 2023-09-01T08:21:57Z

Hi @bosmart
Thanks a lot for the issue!
Can you please have a look at my comment here: huggingface/accelerate#1840 (comment) to understand how to fix the issue and let me know how it goes?

bosmart · 2023-09-01T09:11:37Z

Thanks @younesbelkada, makes a lot of sense now - device_map={"": 0} did make me a bit uneasy 😃

I am getting a new error now however:

File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1837, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2693, in training_step
    self.accelerator.backward(loss)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1923, in backward
    loss.backward(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py", line 157, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 127 has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debugging.

bosmart · 2023-09-02T07:23:32Z

@lewtun I'm not even using RewardTrainer, getting the error with SFTTrainer. Disabling checkpointing helps to an extent - now getting CUDA out of memory instead 🤦‍♂️

Is disabling checkpointing just a workaround or is there a reason why peft+ddp+4bit can't work with checkpointing enabled?

github-actions · 2023-11-01T15:06:02Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

younesbelkada · 2023-11-01T21:19:56Z

Hi @bosmart
This should be now fixed on TRL + PEFT + transformers main, please refer to my comment here: #891 (comment)
The trick is to use use_reentrant=False when calling gradient checkpointing

bbouldin · 2023-11-02T19:50:19Z

For anyone getting the "You can't train" error in dpo_llama2.py, you can fix by adding the following to the configs for the model and model-ref:
device_map={"": Accelerator().local_process_index},
also add the import:
from accelerate import Accelerator

github-actions · 2023-11-27T15:05:34Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

joann-alvarez · 2024-06-11T21:50:02Z

@bbouldin

For anyone getting the "You can't train" error in dpo_llama2.py, you can fix by adding the following to the configs for the model and model-ref: device_map={"": Accelerator().local_process_index}, also add the import: from accelerate import Accelerator

Sorry, what do you mean by configs for the model and model-ref?

I know I can include device_map as an argument to AutoModelForCausalLM.from_pretrained(), but I'm not sure where else it needs to be specified. Thanks!

lewtun mentioned this issue Sep 1, 2023

Enable gradient checkpointing to be disabled for reward modelling #725

Merged

github-actions bot closed this as completed Dec 5, 2023

crescentluna mentioned this issue Dec 23, 2023

dpo_llama2.py is outdated? #1136

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. #722

You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. #722

bosmart commented Aug 31, 2023

younesbelkada commented Sep 1, 2023

bosmart commented Sep 1, 2023

bosmart commented Sep 2, 2023

github-actions bot commented Nov 1, 2023

younesbelkada commented Nov 1, 2023

bbouldin commented Nov 2, 2023

github-actions bot commented Nov 27, 2023

joann-alvarez commented Jun 11, 2024

You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. #722

You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. #722

Comments

bosmart commented Aug 31, 2023

younesbelkada commented Sep 1, 2023

bosmart commented Sep 1, 2023

bosmart commented Sep 2, 2023

github-actions bot commented Nov 1, 2023

younesbelkada commented Nov 1, 2023

bbouldin commented Nov 2, 2023

github-actions bot commented Nov 27, 2023

joann-alvarez commented Jun 11, 2024