Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError: 'FP16_DeepSpeedZeroOptimizer' object has no attribute 'ipg_index' #1218

Closed
TianhaoFu opened this issue Jul 12, 2021 · 29 comments
Assignees

Comments

@TianhaoFu
Copy link

Hi,
I want use DeepSpeed to speed my transformer , and I came across such problem:

  File "main.py", line 460, in <module>
    main(args)
  File "main.py", line 392, in main
    train_stats = train_one_epoch(
  File "/opt/ml/code/deepspeed/engine.py", line 57, in train_one_epoch
    loss_scaler(loss, optimizer, clip_grad=clip_grad, clip_mode=clip_mode,
  File "/usr/local/lib/python3.8/dist-packages/timm/utils/cuda.py", line 43, in __call__
    self._scaler.scale(loss).backward(create_graph=create_graph)
  File "/usr/local/lib/python3.8/dist-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 661, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 1104, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 724, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'FP16_DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

My config.json is as follows:

{
  "gradient_accumulation_steps": 1,
  "train_micro_batch_size_per_gpu":1,
  "steps_per_print": 100,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00001,
      "weight_decay": 1e-2
    }
  },
  "flops_profiler": {
    "enabled": false,
    "profile_step": 100,
    "module_depth": -1,
    "top_modules": 3,
    "detailed": true
  },
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 18,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "zero_optimization": {
      "stage": 1,
      "cpu_offload": false,
      "contiguous_gradients": true,
      "overlap_comm": true,
      "reduce_scatter": true,
      "reduce_bucket_size":1e8,
      "allgather_bucket_size": 5e8

  },
  "activation_checkpointing": {
      "partition_activations": false,
      "contiguous_memory_optimization": false,
      "cpu_checkpointing": false
  },
  "gradient_clipping": 1.0,
  "wall_clock_breakdown": false,
  "zero_allow_untested_optimizer": true
}
@jeffra jeffra self-assigned this Jul 14, 2021
@jeffra
Copy link
Collaborator

jeffra commented Jul 14, 2021

Hi @TianhaoFu can you share ds_report with me? I am curious on what deepspeed version or commit hash you were on. I am trying to reproduce your issue.

Also if this issue is quick to reproduce can you also try with "stage": 2?

@chrjxj
Copy link

chrjxj commented Jul 23, 2021

config:

{
  "zero_optimization": {
    "stage": 1,
    "overlap_comm": true
  },

  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "initial_scale_power": 32,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
},


  "train_batch_size": 8,
  "steps_per_print": 4000,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.001,
      "adam_w_mode": true,
      "betas": [
        0.8,
        0.999
      ],
      "eps": 1e-8,
      "weight_decay": 3e-7
    }
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 0.001,
      "warmup_num_steps": 10000,
      "total_num_steps": 100000
    }
  },
  "wall_clock_breakdown": false
}



get error:

  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 251, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 146, in backward
    Variable._execution_engine.run_backward(
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 664, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1109, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 726, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'FP16_DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

Env:

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the libraries: ['libaio-dev'] but are missing. Can be fixed by: `apt install libaio-dev`.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.6/site-packages/torch']
torch version .................... 1.8.0a0+17f8c32
torch cuda version ............... 11.1
nvcc version ..................... 11.1
deepspeed install path ........... ['/opt/conda/lib/python3.6/site-packages/deepspeed']
deepspeed info ................... 0.4.4+6ba9628, 6ba9628, master
deepspeed wheel compiled w. ...... torch 1.8, cuda 11.1

@jeffra
Copy link
Collaborator

jeffra commented Jul 23, 2021

Hi @chrjxj, can you try setting "stage": 1, in your config json to "stage": 2,? I want to confirm if your issue occurs with both stages of zero. I am unable to reproduce the error on my side yet.

@jeffra
Copy link
Collaborator

jeffra commented Jul 23, 2021

Actually @chrjxj can you set these both to false in your config? I suspect this will fix your issues.

      "contiguous_gradients": false,
      "overlap_comm": false,

@chrjxj
Copy link

chrjxj commented Jul 25, 2021

@jeffra thanks. it still doesn't work and throw out new error msg.

@jeffra
Copy link
Collaborator

jeffra commented Jul 26, 2021

@chrjxj, can you provide the stack trace for the new error message?

@ant-louis
Copy link

Hi @chrjxj, did you find a solution?

@jeffra
Copy link
Collaborator

jeffra commented Jun 1, 2022

@antoiloui, are you also seeing this error? Can you share the deepspeed version you are using and the stack trace? Did you also try turning off contiguous_gradients and overlap_comm?

@ant-louis
Copy link

Hi @jeffra, yes I'm experiencing the same issue. Here is the error I get:

  File "/root/envs/star/lib/python3.8/site-packages/grad_cache/grad_cache.py", line 242, in forward_backward
    surrogate.backward()
  File "/root/envs/star/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/root/envs/star/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
  File "/root/envs/star/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 769, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/root/envs/star/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1250, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/root/envs/star/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 826, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

And here is my config file:

{
    "zero_optimization": {
       "stage": 2,
       "offload_optimizer": {
           "device": "cpu",
           "pin_memory": true
       },
       "allgather_partitions": true,
       "allgather_bucket_size": 2e8,
       "reduce_scatter": true,
       "reduce_bucket_size": 2e8,
       "overlap_comm": false,
       "contiguous_gradients": false
    },

    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}

@jeffra
Copy link
Collaborator

jeffra commented Jun 1, 2022

Gotcha, I see. Thank you @antoiloui. What version of deepspeed are you running?

Is it possible to provide a repro for this error that you're seeing?

@chrjxj
Copy link

chrjxj commented Jun 20, 2022

Hi @chrjxj, did you find a solution?

no... switched to other tasks...

@heojeongyun
Copy link

heojeongyun commented Mar 8, 2023

Isn't this problem solved? I'm currently facing a similar error. I'm using FusedAdam as an optimizer so I'm not using the FP16 option, but it's similar.

Here is the error I get:

Traceback (most recent call last):
  File "/root/QuickDraw/train.py", line 244, in <module>
    train(opt)
  File "/root/QuickDraw/train.py", line 165, in train
    torch.autograd.backward(loss)
  File "/project/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 857, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1349, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 902, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index' 

this is my deepspeed_config file:

{
    "train_batch_size": 32,
    "train_micro_batch_size_per_gpu": 8,
    "gradient_accumulation_steps": 4,
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu"
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true
    },


    "steps_per_print": 1,

    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.001
        }
    }
          
 
}

@heojeongyun
Copy link

Isn't this problem solved? I'm currently facing a similar error. I'm using FusedAdam as an optimizer so I'm not using the FP16 option, but it's similar.

Here is the error I get:

Traceback (most recent call last):
  File "/root/QuickDraw/train.py", line 244, in <module>
    train(opt)
  File "/root/QuickDraw/train.py", line 165, in train
    torch.autograd.backward(loss)
  File "/project/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 857, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1349, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 902, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index' 

this is my deepspeed_config file:

{
    "train_batch_size": 32,
    "train_micro_batch_size_per_gpu": 8,
    "gradient_accumulation_steps": 4,
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu"
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true
    },


    "steps_per_print": 1,

    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.001
        }
    }
          
 
}

"stage": 2 > "stage":1
Solved

@AlexKay28
Copy link

Well, let me join this thread too.. Have the same issue as described above

The code I run can be found here:
https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4/train.py

Configuration I use

{
    "zero_allow_untested_optimizer": True,
    "zero_optimization": {
        "stage": 2,
        "contiguous_gradients": True,
        "overlap_comm": True,
        "allgather_partitions": True,
        "reduce_scatter": True,
        "allgather_bucket_size": 200000000,
        "reduce_bucket_size": 200000000,
        "sub_group_size": 1000000000000,
    },
    "activation_checkpointing": {
        "partition_activations": False,
        "cpu_checkpointing": False,
        "contiguous_memory_optimization": False,
        "synchronize_checkpoint_boundary": False,
    },
    "aio": {
        "block_size": 1048576,
        "queue_depth": 8,
        "single_submit": False,
        "overlap_events": True,
        "thread_count": 1,
    },
    "gradient_clipping": 1.0,
    "gradient_accumulation_steps": 1,
    "bf16": {"enabled": True},
}

Traceback:

Traceback (most recent call last):
  File "train.py", line 367, in <module>
    trainer.run(m_cfg, train_dataset, None, tconf)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 433, in _run_impl
    return self._strategy.launcher.launch(run_method, *args, **kwargs)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 443, in _run_with_setup
    return run_method(*args, **kwargs)
  File "/home/alexkay28/RWKV-LM/RWKV-v4/src/trainer.py", line 177, in run
    run_epoch('train')
  File "/home/alexkay28/RWKV-LM/RWKV-v4/src/trainer.py", line 129, in run_epoch
    self.backward(loss)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 260, in backward
    self._precision.backward(tensor, module, *args, **kwargs)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/plugins/precision/precision.py", line 68, in backward
    tensor.backward(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 482, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 804, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1252, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 847, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(0, self.elements_in_ipg_bucket, param.numel())
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

@heojeongyun
Copy link

Well, let me join this thread too.. Have the same issue as described above

The code I run can be found here: https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4/train.py

Configuration I use

{
    "zero_allow_untested_optimizer": True,
    "zero_optimization": {
        "stage": 2,
        "contiguous_gradients": True,
        "overlap_comm": True,
        "allgather_partitions": True,
        "reduce_scatter": True,
        "allgather_bucket_size": 200000000,
        "reduce_bucket_size": 200000000,
        "sub_group_size": 1000000000000,
    },
    "activation_checkpointing": {
        "partition_activations": False,
        "cpu_checkpointing": False,
        "contiguous_memory_optimization": False,
        "synchronize_checkpoint_boundary": False,
    },
    "aio": {
        "block_size": 1048576,
        "queue_depth": 8,
        "single_submit": False,
        "overlap_events": True,
        "thread_count": 1,
    },
    "gradient_clipping": 1.0,
    "gradient_accumulation_steps": 1,
    "bf16": {"enabled": True},
}

Traceback:

Traceback (most recent call last):
  File "train.py", line 367, in <module>
    trainer.run(m_cfg, train_dataset, None, tconf)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 433, in _run_impl
    return self._strategy.launcher.launch(run_method, *args, **kwargs)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 443, in _run_with_setup
    return run_method(*args, **kwargs)
  File "/home/alexkay28/RWKV-LM/RWKV-v4/src/trainer.py", line 177, in run
    run_epoch('train')
  File "/home/alexkay28/RWKV-LM/RWKV-v4/src/trainer.py", line 129, in run_epoch
    self.backward(loss)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 260, in backward
    self._precision.backward(tensor, module, *args, **kwargs)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/plugins/precision/precision.py", line 68, in backward
    tensor.backward(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 482, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 804, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1252, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 847, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(0, self.elements_in_ipg_bucket, param.numel())
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

Try changing 'stage' from 2 to 1 in the configuration. Does it still have the same problem? I understand that this improves learning efficiency by partitioning parameters when learning a large model, but in my case, this solved the problem.

The official document describes the stage as follows:

Chooses different stages of ZeRO Optimizer. Stage 0, 1, 2, and 3 refer to disabled, optimizer state partitioning, and optimizer+gradient state partitioning, and optimizer+gradient+parameter partitioning, respectively.

@AlexKay28
Copy link

I have solved my problem by choosing right combination of python version and packages versions.
If someone is interested in it:

  • python 3.8
  • torch==2.0.0
  • deepspeed==0.9.1
  • pytorch-lightning==1.9.1

You can see (in my traceback) I was running deepspeed using pytorch-lightning interface. I was also playing with some configurations trying to provide predefined configurations from lightning like "deepspeed_strategy_2" and "deepspeed_strategy_3" and I got the same error every time, so I guess I just had some versions compatibility problem.

@Chain-Mao
Copy link

I have solved my problem by choosing right combination of python version and packages versions. If someone is interested in it:

* python 3.8

* torch==2.0.0

* deepspeed==0.9.1

* pytorch-lightning==1.9.1

You can see (in my traceback) I was running deepspeed using pytorch-lightning interface. I was also playing with some configurations trying to provide predefined configurations from lightning like "deepspeed_strategy_2" and "deepspeed_strategy_3" and I got the same error every time, so I guess I just had some versions compatibility problem.

This method can't solve my problem. I am also studying RWKV. Can you help me?
My problem is that:
Traceback (most recent call last):
File "/data1/RWKV-LM/RWKV-v4/train.py", line 280, in
trainer.run(m_cfg, train_dataset, None, tconf)
File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/lightning_fabric/fabric.py", line 628, in _run_impl
return self._strategy.launcher.launch(run_method, *args, **kwargs)
File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/lightning_fabric/strategies/launchers/subprocess_script.py", line 90, in launch
return function(*args, **kwargs)
File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/lightning_fabric/fabric.py", line 638, in _run_with_setup
return run_function(*args, **kwargs)
File "/data1/RWKV-LM/RWKV-v4/src/trainer.py", line 177, in run
run_epoch('train')
File "/data1/RWKV-LM/RWKV-v4/src/trainer.py", line 129, in run_epoch
self.backward(loss)
File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/lightning_fabric/fabric.py", line 359, in backward
self._precision.backward(tensor, module, *args, **kwargs)
File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/lightning_fabric/plugins/precision/precision.py", line 73, in backward
tensor.backward(*args, **kwargs)
File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 804, in reduce_partition_and_remove_grads
self.reduce_ready_partitions_and_remove_grads(param, i)
File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1252, in reduce_ready_partitions_and_remove_grads
self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 847, in reduce_independent_p_g_buckets_and_remove_grads
new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(0, self.elements_in_ipg_bucket, param.numel())
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

@AlexKay28
Copy link

@maomao279 Have you tried v4neo? + are you sure you use the same versions during the run and which cuda version do you use (not sure the last is important, just want to know)?

@dabney777
Copy link

dabney777 commented Jun 26, 2023

I got the same issue. But fixed by remove a redundant backward.

        outputs = model_engine(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        # loss.backward()   # remove this line
        model_engine.backward(loss)
        model_engine.step()

And this code is from chatGPT, so it is excusable.

@SophieOstmeier
Copy link

Has somebody found a solution other than using different package versions or changing to stage 1. I need stage 2 to work unfortunately and can not downgrade the package versions due to dependencies.
Help really appreciated.

@catqaq
Copy link

catqaq commented Sep 17, 2023

any idea? I got a similar bug: AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

@workingloong
Copy link

workingloong commented Dec 13, 2023

I solved it by using DeepSpeedEngine.backward(loss) and DeepSpeedEngine.step() not torch nativeloss.backward() and optimizer.step().

@tjruwase
Copy link
Contributor

I solved it by using DeepSpeedEngine.backward(loss) and DeepSpeedEngine.step() not torch nativeloss.backward() and optimizer.step().

Thanks for sharing this update. Can you clarify that you were seeing the same error as the original post?

Also, was your code following this guide for model porting: https://www.deepspeed.ai/getting-started/#writing-deepspeed-models

@sneha4948
Copy link

sneha4948 commented Oct 2, 2024

@jeffra Getting similar error:

Traceback (most recent call last): File "/opt/program/user_scripts/entry_script.py", line 655, in <module> result = trainer.train() File "/opt/conda/lib/python3.10/site-packages/mlflow/utils/autologging_utils/safety.py", line 578, in safe_patch_function patch_function(call_original, *args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/mlflow/transformers/__init__.py", line 2725, in train return original(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/mlflow/utils/autologging_utils/safety.py", line 559, in call_original return call_original_fn_with_event_logging(_original_fn, og_args, og_kwargs) File "/opt/conda/lib/python3.10/site-packages/mlflow/utils/autologging_utils/safety.py", line 494, in call_original_fn_with_event_logging original_fn_result = original_fn(*og_args, **og_kwargs) File "/opt/conda/lib/python3.10/site-packages/mlflow/utils/autologging_utils/safety.py", line 556, in _original_fn original_result = original(*_og_args, **_og_kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2052, in train return inner_training_loop( File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2388, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3477, in training_step self.optimizer.train() File "/opt/conda/lib/python3.10/site-packages/accelerate/optimizer.py", line 128, in train return self.optimizer.train() AttributeError: 'DeepSpeedZeroOptimizer_Stage3' object has no attribute 'train'

However, the same code was working a few weeks ago, but throws error now, I have checked with previous versions of deepspeed as well but keep getting this error. The first time I got this error was when I tried to manually pass loss.backward() but even after removing those lines and using previous version of code, I am getting this error.

@ryuzakace
Copy link

@sneha4948 any success?

@tjruwase
Copy link
Contributor

tjruwase commented Oct 7, 2024

@sneha4948 and @ryuzakace thanks for reporting this problem.

  1. Can you please confirm that the above fixes, here and here, do not apply to your case.
  2. I think it is better to create a new ticket and close this one. The reason is that this ticket is very old, the original problem seems to be fixed, and it is unclear what repro should be used for investigation.

@sneha4948
Copy link

@sneha4948 any success?

Hey yes, it was due to version incompatibility perhaps. 25th September, newer versions of transformer were released. Setting transformer=4.44.2, resolved the deepspeed issue.

@tjruwase
Copy link
Contributor

tjruwase commented Oct 7, 2024

@sneha4948, thanks for the response. I am closing this issue now. Please open a new ticket if needed.

@tjruwase tjruwase closed this as completed Oct 7, 2024
@yxteo2
Copy link

yxteo2 commented Oct 29, 2024

I solved it by using DeepSpeedEngine.backward(loss) and DeepSpeedEngine.step() not torch nativeloss.backward() and optimizer.step().

This solved my problem also
Thanks a lot for the help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests