AttributeError: 'FP16_DeepSpeedZeroOptimizer' object has no attribute 'ipg_index' #1218

TianhaoFu · 2021-07-12T09:50:47Z

Hi,
I want use DeepSpeed to speed my transformer , and I came across such problem:

  File "main.py", line 460, in <module>
    main(args)
  File "main.py", line 392, in main
    train_stats = train_one_epoch(
  File "/opt/ml/code/deepspeed/engine.py", line 57, in train_one_epoch
    loss_scaler(loss, optimizer, clip_grad=clip_grad, clip_mode=clip_mode,
  File "/usr/local/lib/python3.8/dist-packages/timm/utils/cuda.py", line 43, in __call__
    self._scaler.scale(loss).backward(create_graph=create_graph)
  File "/usr/local/lib/python3.8/dist-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 661, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 1104, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 724, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'FP16_DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

My config.json is as follows:

{
  "gradient_accumulation_steps": 1,
  "train_micro_batch_size_per_gpu":1,
  "steps_per_print": 100,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00001,
      "weight_decay": 1e-2
    }
  },
  "flops_profiler": {
    "enabled": false,
    "profile_step": 100,
    "module_depth": -1,
    "top_modules": 3,
    "detailed": true
  },
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 18,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "zero_optimization": {
      "stage": 1,
      "cpu_offload": false,
      "contiguous_gradients": true,
      "overlap_comm": true,
      "reduce_scatter": true,
      "reduce_bucket_size":1e8,
      "allgather_bucket_size": 5e8

  },
  "activation_checkpointing": {
      "partition_activations": false,
      "contiguous_memory_optimization": false,
      "cpu_checkpointing": false
  },
  "gradient_clipping": 1.0,
  "wall_clock_breakdown": false,
  "zero_allow_untested_optimizer": true
}

The text was updated successfully, but these errors were encountered:

jeffra · 2021-07-14T16:17:49Z

Hi @TianhaoFu can you share ds_report with me? I am curious on what deepspeed version or commit hash you were on. I am trying to reproduce your issue.

Also if this issue is quick to reproduce can you also try with "stage": 2?

chrjxj · 2021-07-23T16:42:45Z

config:

{
  "zero_optimization": {
    "stage": 1,
    "overlap_comm": true
  },

  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "initial_scale_power": 32,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
},


  "train_batch_size": 8,
  "steps_per_print": 4000,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.001,
      "adam_w_mode": true,
      "betas": [
        0.8,
        0.999
      ],
      "eps": 1e-8,
      "weight_decay": 3e-7
    }
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 0.001,
      "warmup_num_steps": 10000,
      "total_num_steps": 100000
    }
  },
  "wall_clock_breakdown": false
}

get error:

  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 251, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 146, in backward
    Variable._execution_engine.run_backward(
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 664, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1109, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 726, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'FP16_DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

Env:

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the libraries: ['libaio-dev'] but are missing. Can be fixed by: `apt install libaio-dev`.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.6/site-packages/torch']
torch version .................... 1.8.0a0+17f8c32
torch cuda version ............... 11.1
nvcc version ..................... 11.1
deepspeed install path ........... ['/opt/conda/lib/python3.6/site-packages/deepspeed']
deepspeed info ................... 0.4.4+6ba9628, 6ba9628, master
deepspeed wheel compiled w. ...... torch 1.8, cuda 11.1

jeffra · 2021-07-23T17:39:35Z

Hi @chrjxj, can you try setting "stage": 1, in your config json to "stage": 2,? I want to confirm if your issue occurs with both stages of zero. I am unable to reproduce the error on my side yet.

jeffra · 2021-07-23T20:14:32Z

Actually @chrjxj can you set these both to false in your config? I suspect this will fix your issues.

      "contiguous_gradients": false,
      "overlap_comm": false,

chrjxj · 2021-07-25T08:22:53Z

@jeffra thanks. it still doesn't work and throw out new error msg.

jeffra · 2021-07-26T15:45:13Z

@chrjxj, can you provide the stack trace for the new error message?

ant-louis · 2022-06-01T14:26:52Z

Hi @chrjxj, did you find a solution?

jeffra · 2022-06-01T17:02:58Z

@antoiloui, are you also seeing this error? Can you share the deepspeed version you are using and the stack trace? Did you also try turning off contiguous_gradients and overlap_comm?

ant-louis · 2022-06-01T17:15:14Z

Hi @jeffra, yes I'm experiencing the same issue. Here is the error I get:

  File "/root/envs/star/lib/python3.8/site-packages/grad_cache/grad_cache.py", line 242, in forward_backward
    surrogate.backward()
  File "/root/envs/star/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/root/envs/star/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
  File "/root/envs/star/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 769, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/root/envs/star/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1250, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/root/envs/star/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 826, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

And here is my config file:

{
    "zero_optimization": {
       "stage": 2,
       "offload_optimizer": {
           "device": "cpu",
           "pin_memory": true
       },
       "allgather_partitions": true,
       "allgather_bucket_size": 2e8,
       "reduce_scatter": true,
       "reduce_bucket_size": 2e8,
       "overlap_comm": false,
       "contiguous_gradients": false
    },

    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}

jeffra · 2022-06-01T18:20:57Z

Gotcha, I see. Thank you @antoiloui. What version of deepspeed are you running?

Is it possible to provide a repro for this error that you're seeing?

chrjxj · 2022-06-20T02:39:40Z

Hi @chrjxj, did you find a solution?

no... switched to other tasks...

heojeongyun · 2023-03-08T06:20:36Z

Isn't this problem solved? I'm currently facing a similar error. I'm using FusedAdam as an optimizer so I'm not using the FP16 option, but it's similar.

Here is the error I get:

Traceback (most recent call last):
  File "/root/QuickDraw/train.py", line 244, in <module>
    train(opt)
  File "/root/QuickDraw/train.py", line 165, in train
    torch.autograd.backward(loss)
  File "/project/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 857, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1349, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 902, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

this is my deepspeed_config file:

{
    "train_batch_size": 32,
    "train_micro_batch_size_per_gpu": 8,
    "gradient_accumulation_steps": 4,
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu"
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true
    },


    "steps_per_print": 1,

    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.001
        }
    }
          
 
}

heojeongyun · 2023-03-08T08:38:59Z

Isn't this problem solved? I'm currently facing a similar error. I'm using FusedAdam as an optimizer so I'm not using the FP16 option, but it's similar.

Here is the error I get:

Traceback (most recent call last):
  File "/root/QuickDraw/train.py", line 244, in <module>
    train(opt)
  File "/root/QuickDraw/train.py", line 165, in train
    torch.autograd.backward(loss)
  File "/project/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 857, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1349, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 902, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

this is my deepspeed_config file:

{
    "train_batch_size": 32,
    "train_micro_batch_size_per_gpu": 8,
    "gradient_accumulation_steps": 4,
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu"
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true
    },


    "steps_per_print": 1,

    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.001
        }
    }
          
 
}

"stage": 2 > "stage":1
Solved

AlexKay28 · 2023-04-20T15:46:38Z

Well, let me join this thread too.. Have the same issue as described above

The code I run can be found here:
https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4/train.py

Configuration I use

{
    "zero_allow_untested_optimizer": True,
    "zero_optimization": {
        "stage": 2,
        "contiguous_gradients": True,
        "overlap_comm": True,
        "allgather_partitions": True,
        "reduce_scatter": True,
        "allgather_bucket_size": 200000000,
        "reduce_bucket_size": 200000000,
        "sub_group_size": 1000000000000,
    },
    "activation_checkpointing": {
        "partition_activations": False,
        "cpu_checkpointing": False,
        "contiguous_memory_optimization": False,
        "synchronize_checkpoint_boundary": False,
    },
    "aio": {
        "block_size": 1048576,
        "queue_depth": 8,
        "single_submit": False,
        "overlap_events": True,
        "thread_count": 1,
    },
    "gradient_clipping": 1.0,
    "gradient_accumulation_steps": 1,
    "bf16": {"enabled": True},
}

Traceback:

Traceback (most recent call last):
  File "train.py", line 367, in <module>
    trainer.run(m_cfg, train_dataset, None, tconf)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 433, in _run_impl
    return self._strategy.launcher.launch(run_method, *args, **kwargs)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 443, in _run_with_setup
    return run_method(*args, **kwargs)
  File "/home/alexkay28/RWKV-LM/RWKV-v4/src/trainer.py", line 177, in run
    run_epoch('train')
  File "/home/alexkay28/RWKV-LM/RWKV-v4/src/trainer.py", line 129, in run_epoch
    self.backward(loss)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 260, in backward
    self._precision.backward(tensor, module, *args, **kwargs)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/plugins/precision/precision.py", line 68, in backward
    tensor.backward(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 482, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 804, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1252, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 847, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(0, self.elements_in_ipg_bucket, param.numel())
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

heojeongyun · 2023-04-24T03:16:20Z

Well, let me join this thread too.. Have the same issue as described above

The code I run can be found here: https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4/train.py

Configuration I use

{
    "zero_allow_untested_optimizer": True,
    "zero_optimization": {
        "stage": 2,
        "contiguous_gradients": True,
        "overlap_comm": True,
        "allgather_partitions": True,
        "reduce_scatter": True,
        "allgather_bucket_size": 200000000,
        "reduce_bucket_size": 200000000,
        "sub_group_size": 1000000000000,
    },
    "activation_checkpointing": {
        "partition_activations": False,
        "cpu_checkpointing": False,
        "contiguous_memory_optimization": False,
        "synchronize_checkpoint_boundary": False,
    },
    "aio": {
        "block_size": 1048576,
        "queue_depth": 8,
        "single_submit": False,
        "overlap_events": True,
        "thread_count": 1,
    },
    "gradient_clipping": 1.0,
    "gradient_accumulation_steps": 1,
    "bf16": {"enabled": True},
}

Traceback:

Traceback (most recent call last):
  File "train.py", line 367, in <module>
    trainer.run(m_cfg, train_dataset, None, tconf)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 433, in _run_impl
    return self._strategy.launcher.launch(run_method, *args, **kwargs)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 443, in _run_with_setup
    return run_method(*args, **kwargs)
  File "/home/alexkay28/RWKV-LM/RWKV-v4/src/trainer.py", line 177, in run
    run_epoch('train')
  File "/home/alexkay28/RWKV-LM/RWKV-v4/src/trainer.py", line 129, in run_epoch
    self.backward(loss)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 260, in backward
    self._precision.backward(tensor, module, *args, **kwargs)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/plugins/precision/precision.py", line 68, in backward
    tensor.backward(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 482, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 804, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1252, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 847, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(0, self.elements_in_ipg_bucket, param.numel())
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

Try changing 'stage' from 2 to 1 in the configuration. Does it still have the same problem? I understand that this improves learning efficiency by partitioning parameters when learning a large model, but in my case, this solved the problem.

The official document describes the stage as follows:

Chooses different stages of ZeRO Optimizer. Stage 0, 1, 2, and 3 refer to disabled, optimizer state partitioning, and optimizer+gradient state partitioning, and optimizer+gradient+parameter partitioning, respectively.

AlexKay28 · 2023-04-24T20:17:16Z

I have solved my problem by choosing right combination of python version and packages versions.
If someone is interested in it:

python 3.8
torch==2.0.0
deepspeed==0.9.1
pytorch-lightning==1.9.1

You can see (in my traceback) I was running deepspeed using pytorch-lightning interface. I was also playing with some configurations trying to provide predefined configurations from lightning like "deepspeed_strategy_2" and "deepspeed_strategy_3" and I got the same error every time, so I guess I just had some versions compatibility problem.

Chain-Mao · 2023-04-25T12:26:24Z

I have solved my problem by choosing right combination of python version and packages versions. If someone is interested in it:
* python 3.8

* torch==2.0.0

* deepspeed==0.9.1

* pytorch-lightning==1.9.1
You can see (in my traceback) I was running deepspeed using pytorch-lightning interface. I was also playing with some configurations trying to provide predefined configurations from lightning like "deepspeed_strategy_2" and "deepspeed_strategy_3" and I got the same error every time, so I guess I just had some versions compatibility problem.

This method can't solve my problem. I am also studying RWKV. Can you help me?
My problem is that:
Traceback (most recent call last):
File "/data1/RWKV-LM/RWKV-v4/train.py", line 280, in
trainer.run(m_cfg, train_dataset, None, tconf)
File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/lightning_fabric/fabric.py", line 628, in _run_impl
return self._strategy.launcher.launch(run_method, *args, **kwargs)
File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/lightning_fabric/strategies/launchers/subprocess_script.py", line 90, in launch
return function(*args, **kwargs)
File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/lightning_fabric/fabric.py", line 638, in _run_with_setup
return run_function(*args, **kwargs)
File "/data1/RWKV-LM/RWKV-v4/src/trainer.py", line 177, in run
run_epoch('train')
File "/data1/RWKV-LM/RWKV-v4/src/trainer.py", line 129, in run_epoch
self.backward(loss)
File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/lightning_fabric/fabric.py", line 359, in backward
self._precision.backward(tensor, module, *args, **kwargs)
File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/lightning_fabric/plugins/precision/precision.py", line 73, in backward
tensor.backward(*args, **kwargs)
File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 804, in reduce_partition_and_remove_grads
self.reduce_ready_partitions_and_remove_grads(param, i)
File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1252, in reduce_ready_partitions_and_remove_grads
self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
File "/opt/miniconda3/envs/rwkb_py38/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 847, in reduce_independent_p_g_buckets_and_remove_grads
new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(0, self.elements_in_ipg_bucket, param.numel())
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

AlexKay28 · 2023-04-26T12:26:05Z

@maomao279 Have you tried v4neo? + are you sure you use the same versions during the run and which cuda version do you use (not sure the last is important, just want to know)?

dabney777 · 2023-06-26T13:24:07Z

I got the same issue. But fixed by remove a redundant backward.

        outputs = model_engine(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        # loss.backward()   # remove this line
        model_engine.backward(loss)
        model_engine.step()

And this code is from chatGPT, so it is excusable.

SophieOstmeier · 2023-09-13T18:36:57Z

Has somebody found a solution other than using different package versions or changing to stage 1. I need stage 2 to work unfortunately and can not downgrade the package versions due to dependencies.
Help really appreciated.

catqaq · 2023-09-17T16:29:47Z

any idea? I got a similar bug: AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

workingloong · 2023-12-13T11:47:06Z

I solved it by using DeepSpeedEngine.backward(loss) and DeepSpeedEngine.step() not torch nativeloss.backward() and optimizer.step().

tjruwase · 2023-12-13T15:30:43Z

I solved it by using DeepSpeedEngine.backward(loss) and DeepSpeedEngine.step() not torch nativeloss.backward() and optimizer.step().

Thanks for sharing this update. Can you clarify that you were seeing the same error as the original post?

Also, was your code following this guide for model porting: https://www.deepspeed.ai/getting-started/#writing-deepspeed-models

sneha4948 · 2024-10-02T19:22:00Z

@jeffra Getting similar error:

Traceback (most recent call last): File "/opt/program/user_scripts/entry_script.py", line 655, in <module> result = trainer.train() File "/opt/conda/lib/python3.10/site-packages/mlflow/utils/autologging_utils/safety.py", line 578, in safe_patch_function patch_function(call_original, *args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/mlflow/transformers/__init__.py", line 2725, in train return original(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/mlflow/utils/autologging_utils/safety.py", line 559, in call_original return call_original_fn_with_event_logging(_original_fn, og_args, og_kwargs) File "/opt/conda/lib/python3.10/site-packages/mlflow/utils/autologging_utils/safety.py", line 494, in call_original_fn_with_event_logging original_fn_result = original_fn(*og_args, **og_kwargs) File "/opt/conda/lib/python3.10/site-packages/mlflow/utils/autologging_utils/safety.py", line 556, in _original_fn original_result = original(*_og_args, **_og_kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2052, in train return inner_training_loop( File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2388, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3477, in training_step self.optimizer.train() File "/opt/conda/lib/python3.10/site-packages/accelerate/optimizer.py", line 128, in train return self.optimizer.train() AttributeError: 'DeepSpeedZeroOptimizer_Stage3' object has no attribute 'train'

However, the same code was working a few weeks ago, but throws error now, I have checked with previous versions of deepspeed as well but keep getting this error. The first time I got this error was when I tried to manually pass loss.backward() but even after removing those lines and using previous version of code, I am getting this error.

ryuzakace · 2024-10-07T13:29:35Z

@sneha4948 any success?

tjruwase · 2024-10-07T13:41:03Z

@sneha4948 and @ryuzakace thanks for reporting this problem.

Can you please confirm that the above fixes, here and here, do not apply to your case.
I think it is better to create a new ticket and close this one. The reason is that this ticket is very old, the original problem seems to be fixed, and it is unclear what repro should be used for investigation.

sneha4948 · 2024-10-07T13:48:06Z

@sneha4948 any success?

Hey yes, it was due to version incompatibility perhaps. 25th September, newer versions of transformer were released. Setting transformer=4.44.2, resolved the deepspeed issue.

tjruwase · 2024-10-07T14:19:22Z

@sneha4948, thanks for the response. I am closing this issue now. Please open a new ticket if needed.

yxteo2 · 2024-10-29T00:51:53Z

I solved it by using DeepSpeedEngine.backward(loss) and DeepSpeedEngine.step() not torch nativeloss.backward() and optimizer.step().

This solved my problem also
Thanks a lot for the help

jeffra self-assigned this Jul 14, 2021

tjruwase closed this as completed Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AttributeError: 'FP16_DeepSpeedZeroOptimizer' object has no attribute 'ipg_index' #1218

AttributeError: 'FP16_DeepSpeedZeroOptimizer' object has no attribute 'ipg_index' #1218

TianhaoFu commented Jul 12, 2021

jeffra commented Jul 14, 2021 •

edited

Loading

chrjxj commented Jul 23, 2021 •

edited

Loading

jeffra commented Jul 23, 2021 •

edited

Loading

jeffra commented Jul 23, 2021 •

edited

Loading

chrjxj commented Jul 25, 2021

jeffra commented Jul 26, 2021

ant-louis commented Jun 1, 2022

jeffra commented Jun 1, 2022

ant-louis commented Jun 1, 2022

jeffra commented Jun 1, 2022

chrjxj commented Jun 20, 2022 •

edited

Loading

heojeongyun commented Mar 8, 2023 •

edited

Loading

heojeongyun commented Mar 8, 2023

AlexKay28 commented Apr 20, 2023

heojeongyun commented Apr 24, 2023

AlexKay28 commented Apr 24, 2023

Chain-Mao commented Apr 25, 2023

AlexKay28 commented Apr 26, 2023

dabney777 commented Jun 26, 2023 •

edited

Loading

SophieOstmeier commented Sep 13, 2023

catqaq commented Sep 17, 2023

workingloong commented Dec 13, 2023 •

edited

Loading

tjruwase commented Dec 13, 2023

sneha4948 commented Oct 2, 2024 •

edited

Loading

ryuzakace commented Oct 7, 2024

tjruwase commented Oct 7, 2024

sneha4948 commented Oct 7, 2024

tjruwase commented Oct 7, 2024

yxteo2 commented Oct 29, 2024

AttributeError: 'FP16_DeepSpeedZeroOptimizer' object has no attribute 'ipg_index' #1218

AttributeError: 'FP16_DeepSpeedZeroOptimizer' object has no attribute 'ipg_index' #1218

Comments

TianhaoFu commented Jul 12, 2021

jeffra commented Jul 14, 2021 • edited Loading

chrjxj commented Jul 23, 2021 • edited Loading

jeffra commented Jul 23, 2021 • edited Loading

jeffra commented Jul 23, 2021 • edited Loading

chrjxj commented Jul 25, 2021

jeffra commented Jul 26, 2021

ant-louis commented Jun 1, 2022

jeffra commented Jun 1, 2022

ant-louis commented Jun 1, 2022

jeffra commented Jun 1, 2022

chrjxj commented Jun 20, 2022 • edited Loading

heojeongyun commented Mar 8, 2023 • edited Loading

heojeongyun commented Mar 8, 2023

AlexKay28 commented Apr 20, 2023

heojeongyun commented Apr 24, 2023

AlexKay28 commented Apr 24, 2023

Chain-Mao commented Apr 25, 2023

AlexKay28 commented Apr 26, 2023

dabney777 commented Jun 26, 2023 • edited Loading

SophieOstmeier commented Sep 13, 2023

catqaq commented Sep 17, 2023

workingloong commented Dec 13, 2023 • edited Loading

tjruwase commented Dec 13, 2023

sneha4948 commented Oct 2, 2024 • edited Loading

ryuzakace commented Oct 7, 2024

tjruwase commented Oct 7, 2024

sneha4948 commented Oct 7, 2024

tjruwase commented Oct 7, 2024

yxteo2 commented Oct 29, 2024

jeffra commented Jul 14, 2021 •

edited

Loading

chrjxj commented Jul 23, 2021 •

edited

Loading

jeffra commented Jul 23, 2021 •

edited

Loading

jeffra commented Jul 23, 2021 •

edited

Loading

chrjxj commented Jun 20, 2022 •

edited

Loading

heojeongyun commented Mar 8, 2023 •

edited

Loading

dabney777 commented Jun 26, 2023 •

edited

Loading

workingloong commented Dec 13, 2023 •

edited

Loading

sneha4948 commented Oct 2, 2024 •

edited

Loading