Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: expected scalar type Float but found Half #1233

Closed
griff4692 opened this issue Jul 14, 2021 · 17 comments
Closed

RuntimeError: expected scalar type Float but found Half #1233

griff4692 opened this issue Jul 14, 2021 · 17 comments

Comments

@griff4692
Copy link

Hi - I'm trying to use the deepspeed plugin with Pytorch Lightning. My code worked before but changing the line in trainer

to add

plugins='deepspeed_stage_3_offload'

Causes the error posted in the title. I've tried casting parameters and variables as float and half, but the error persists.

Any suggestions would be much appreciated as I'm really looking forward to see what deepspeed can do.

I should note that the error is happening in a call to a pytorch_geometric method (if that changes anything).

deepspeed==0.4.3
pytorch-lightning==1.3.8
torch==1.9.0
torch-cluster==1.5.9
torch-geometric==1.7.1
torch-scatter==2.0.7
torch-sparse==0.6.10
torch-spline-conv==1.2.1
torchmetrics==0.3.2
torchvision==0.10.0

@tjruwase
Copy link
Contributor

@griff4692, can you share log and stack trace?

Also, can you check if the same error happens with deepspeed_stage_2_offload plugin?

@griff4692
Copy link
Author

Yes the same error persists regardless of stage 1, 2, or 3.

(sauce) griffin@lambda-dual:~/kabupra/graph$ python pretrain.py -debug
Num GPUs --> 1
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Starting training...
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1

distributed_backend=nccl
All DDP processes registered. Starting ddp with 1 processes

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Loaded 100 examples
Enabling DeepSpeed FP16.
You have not specified an optimizer or scheduler within the DeepSpeed config.Using configure_optimizers to define optimizer and scheduler.
Using /home/griffin/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /home/griffin/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.305880069732666 seconds
Using /home/griffin/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0002627372741699219 seconds
wandb: W&B syncing is set to offline in this directory. Run wandb online or set WANDB_MODE=online to enable cloud syncing.

| Name | Type | Params

0 | sent_encoder | SentBERT | 199
1 | node_encoder | NodeEncoder | 7
2 | mask_embed | Embedding | 1
3 | entity_graph_encoder | EntityGraphEncoder | 8
4 | dropout | Dropout | 0
5 | label_loss | CrossEntropyLoss | 0
6 | cui_mask_output | Linear | 2
7 | tui_mask_output | Linear | 2
8 | sg_mask_output | Linear | 2
9 | sec_mask_output | Linear | 2
10 | sent_pool_score | Linear | 2
11 | cui_proj | Linear | 2
12 | sent_proj | Linear | 2

30 Trainable params
199 Non-trainable params
229 Total params
0.000 Total estimated model params size (MB)
Loaded 100 examples
Loaded 100 examples
/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/callbacks/lr_monitor.py:97: RuntimeWarning: You are using LearningRateMonitor callback with models that have no learning rate schedulers. Please see documentation for configure_optimizers method.
rank_zero_warn(
Epoch 0: 0%| | 0/100 [00:00<?, ?it/s]Traceback (most recent call last):
File "pretrain.py", line 29, in
run(args, ReSAUCE)
File "/home/griffin/kabupra/graph/main.py", line 119, in run
trainer.fit(model)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit
self._run(model)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 758, in _run
self.dispatch()
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 799, in dispatch
self.accelerator.start_training(self)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
self._results = trainer.run_stage()
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in run_stage
return self.run_train()
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 871, in run_train
self.train_loop.run_training_epoch()
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 499, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 738, in run_training_batch
self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 434, in optimizer_step
model_ref.optimizer_step(
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1403, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 325, in optimizer_step
make_optimizer_step = self.precision_plugin.pre_optimizer_step(
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/deepspeed_precision.py", line 46, in pre_optimizer_step
lambda_closure()
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 732, in train_step_and_backward_closure
result = self.training_step_and_backward(
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 823, in training_step_and_backward
result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 290, in training_step
training_step_output = self.trainer.accelerator.training_step(args)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 204, in training_step
return self.training_type_plugin.training_step(*args)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 337, in training_step
return self.model(*args, **kwargs)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/griffin/sauce/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1105, in forward
loss = self.module(*inputs, **kwargs)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 62, in forward
return super().forward(*inputs, **kwargs)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 46, in forward
output = self.module.training_step(*inputs, **kwargs)
File "/home/griffin/kabupra/graph/models/resauce.py", line 124, in training_step
return self.shared_step(batch, is_train=True)
File "/home/griffin/kabupra/graph/models/resauce.py", line 109, in shared_step
cui_loss, cui_acc, tui_loss, tui_acc, sg_loss, sg_acc, sec_loss, sec_acc = self(**batch)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/griffin/kabupra/graph/models/resauce.py", line 83, in forward
node_states = self.dropout(self.entity_graph_encoder(node_graph_input, edge_index))
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/griffin/kabupra/graph/models/model_utils.py", line 132, in forward
x = self.elu(self.conv1(x, edge_index))
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/griffin/.local/lib/python3.8/site-packages/torch_geometric/nn/conv/gat_conv.py", line 124, in forward
x_l = x_r = self.lin_l(x).view(-1, H, C)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "/usr/lib/python3/dist-packages/torch/nn/modules/linear.py", line 96, in forward
return F.linear(input, self.weight, self.bias)
File "/usr/lib/python3/dist-packages/torch/cuda/amp/autocast_mode.py", line 211, in decorate_fwd
return fwd(*args, **kwargs)
File "/home/griffin/sauce/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 60, in forward
output = input.matmul(weight.t())
RuntimeError: expected scalar type Float but found Half
Traceback (most recent call last):
File "pretrain.py", line 29, in
run(args, ReSAUCE)
File "/home/griffin/kabupra/graph/main.py", line 119, in run
trainer.fit(model)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit
self._run(model)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 758, in _run
self.dispatch()
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 799, in dispatch
self.accelerator.start_training(self)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
self._results = trainer.run_stage()
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in run_stage
return self.run_train()
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 871, in run_train
self.train_loop.run_training_epoch()
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 499, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 738, in run_training_batch
self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 434, in optimizer_step
model_ref.optimizer_step(
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1403, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 325, in optimizer_step
make_optimizer_step = self.precision_plugin.pre_optimizer_step(
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/deepspeed_precision.py", line 46, in pre_optimizer_step
lambda_closure()
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 732, in train_step_and_backward_closure
result = self.training_step_and_backward(
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 823, in training_step_and_backward
result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 290, in training_step
training_step_output = self.trainer.accelerator.training_step(args)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 204, in training_step
return self.training_type_plugin.training_step(*args)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 337, in training_step
return self.model(*args, **kwargs)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/griffin/sauce/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1105, in forward
loss = self.module(*inputs, **kwargs)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 62, in forward
return super().forward(*inputs, **kwargs)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 46, in forward
output = self.module.training_step(*inputs, **kwargs)
File "/home/griffin/kabupra/graph/models/resauce.py", line 124, in training_step
return self.shared_step(batch, is_train=True)
File "/home/griffin/kabupra/graph/models/resauce.py", line 109, in shared_step
cui_loss, cui_acc, tui_loss, tui_acc, sg_loss, sg_acc, sec_loss, sec_acc = self(**batch)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/griffin/kabupra/graph/models/resauce.py", line 83, in forward
node_states = self.dropout(self.entity_graph_encoder(node_graph_input, edge_index))
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/griffin/kabupra/graph/models/model_utils.py", line 132, in forward
x = self.elu(self.conv1(x, edge_index))
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/griffin/.local/lib/python3.8/site-packages/torch_geometric/nn/conv/gat_conv.py", line 124, in forward
x_l = x_r = self.lin_l(x).view(-1, H, C)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "/usr/lib/python3/dist-packages/torch/nn/modules/linear.py", line 96, in forward
return F.linear(input, self.weight, self.bias)
File "/usr/lib/python3/dist-packages/torch/cuda/amp/autocast_mode.py", line 211, in decorate_fwd
return fwd(*args, **kwargs)
File "/home/griffin/sauce/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 60, in forward
output = input.matmul(weight.t())
RuntimeError: expected scalar type Float but found Half

wandb: Waiting for W&B process to finish, PID 318426
wandb: Program failed with code 1.
wandb: Find user logs for this run at: /home/griffin/weights/graph/default/wandb/offline-run-20210714_200232-8ub8504m/logs/debug.log
wandb: Find internal logs for this run at: /home/griffin/weights/graph/default/wandb/offline-run-20210714_200232-8ub8504m/logs/debug-internal.log
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/griffin/weights/graph/default/wandb/offline-run-20210714_200232-8ub8504m

@tjruwase
Copy link
Contributor

Thanks for sharing these details. I would like to repro this problem. Can you please share the steps for me to do this?

@griff4692
Copy link
Author

griff4692 commented Jul 15, 2021

Hi - yes, let me try with a toy example!

import pytorch_lightning as pl
from torch_geometric.nn import GATConv
class MyGAT(pl.LightningModule):
    def __init__():
        super().__init__()
        self.gat = GATConv()
    def train_dataloader(self):
         # TODO
    def training_step(self, batch, batch_idx):
         node_features, edge_index = batch
         self.gat(node_features, edge_index)

trainer = pl.Trainer(precision=16, gpus=torch.cuda.device_count(), plugins='deepspeed_stage_3_offload')
model = MyGAT()
trainer.fit(model)

I'm tied up right now but a simple forward pass with GATConv hopefully will reproduce the error. Just will need to pass it a dummy dataloader for model.train_dataloader and a model.training_step(self, batch, batch_idx) to call self.gat(dummy_node_features, dummy_edge_index)

@tjruwase
Copy link
Contributor

@griff4692, got it. No rush, I can wait for your complete toy example. Thanks!

@griff4692
Copy link
Author

import pytorch_lightning as pl
import torch
from torch_geometric.nn import GATConv

import torch.nn as nn
from torch.utils.data import DataLoader
from torch.utils.data import Dataset


class DummyDataset(Dataset):
    def __len__(self):
        return 1

    def __getitem__(self, idx):
        return {
            'h': torch.zeros(size=[10, 5]),
            'edge_index': torch.zeros(size=[2, 10]).long(),
            'y': torch.ones(size=[1,])
        }


class MyGAT(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(5, 5)
        self.gat = GATConv(in_channels=5, out_channels=5)
        self.output = nn.Linear(5, 1)
        self.loss = nn.MSELoss()

    def shared_step(self, h, edge_index, y):
        h_proj = self.linear(h[0, :, :])
        h_conv = self.gat(h_proj, edge_index[0, :, :])
        y_pred = self.output(h_conv[0, :])
        return self.loss(y_pred, y)

    def validation_step(self, batch, batch_idx):
        return self.shared_step(**batch)

    def training_step(self, batch, batch_idx):
        return self.shared_step(**batch)

    def train_dataloader(self):
        return DataLoader(DummyDataset(), batch_size=1)

    def val_dataloader(self):
        return DataLoader(DummyDataset(), batch_size=1)

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters())


trainer = pl.Trainer(precision=16, gpus=torch.cuda.device_count(), plugins='deepspeed_stage_3_offload')
model = MyGAT()
trainer.fit(model)

@griff4692
Copy link
Author

Running this actually supplies a different error which is

RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/home/griffin/sauce/lib/python3.8/site-packages/torch_geometric/utils/softmax.py", line 41, in softmax
    elif index is not None:
        N = maybe_num_nodes(index, num_nodes)
        src_max = scatter(src, index, dim, dim_size=N, reduce='max')
                  ~~~~~~~ <--- HERE
        src_max = src_max.index_select(dim, index)
        out = (src - src_max).exp()
  File "/home/griffin/sauce/lib/python3.8/site-packages/torch_scatter/scatter.py", line 161, in scatter
        return scatter_min(src, index, dim, out, dim_size)[0]
    elif reduce == 'max':
        return scatter_max(src, index, dim, out, dim_size)[0]
               ~~~~~~~~~~~ <--- HERE
    else:
        raise ValueError
  File "/home/griffin/sauce/lib/python3.8/site-packages/torch_scatter/scatter.py", line 73, in scatter_max
        out: Optional[torch.Tensor] = None,
        dim_size: Optional[int] = None) -> Tuple[torch.Tensor, torch.Tensor]:
    return torch.ops.torch_scatter.scatter_max(src, index, dim, out, dim_size)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
RuntimeError: "scatter" not implemented for 'Half'

Which I've been able to reproduce on my more complex model by just calling half() on the inputs to the GATConv model. I've confirmed that this behavior is to be expected:

pyg-team/pytorch_geometric#2866

Maybe I just can't use deepspeed for this particular application or need to downgrade to earlier versions of either deepspeed or geometric for compatibility.

@tjruwase
Copy link
Contributor

I don't have much experience with TorchScript, but I am curious if the original issue can be repro'd without TorchScript. I suspect that in the case an appropriate cast would be the fix. But we can't know for sure unless we get a repro.

@griff4692
Copy link
Author

Does deepspeed need everything to be half? If that's the case, it seems incompatible with torch_scatter

@tjruwase
Copy link
Contributor

@griff4692, not at all. DeepSpeed can work with fp32 or halfs depending on the configuration. The problem here is that DeepSpeed does not attempt to do any automatic casting in the case of mixed-precision training. For example, can you try running in full fp32 by disabling fp16 in your deepspeed config? You can see docs for fp16 configuration here.

@griff4692
Copy link
Author

In PyTorch Lightning, if you use fp32 with deepspeed plugin, you get the following error:

Traceback (most recent call last):
  File "finetune.py", line 27, in <module>
    run(args, PureSAUCE)
  File "/home/griffin/kabupra/graph/main.py", line 119, in run
    trainer.fit(model)
  File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit
    self._run(model)
  File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 755, in _run
    self.pre_dispatch()
  File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 780, in pre_dispatch
    self.accelerator.pre_dispatch(self)
  File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 108, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 251, in pre_dispatch
    self.init_deepspeed()
  File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 256, in init_deepspeed
    self._format_config()
  File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 413, in _format_config
    self._format_precision_config()
  File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 461, in _format_precision_config
    raise MisconfigurationException("To use DeepSpeed ZeRO Optimization, you must set precision=16.")
pytorch_lightning.utilities.exceptions.MisconfigurationException: To use DeepSpeed ZeRO Optimization, you must set precision=16.

I'll look into circumventing this

@griff4692
Copy link
Author

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/plugins/training_type/deepspeed.py

looks like you can re-write config and it may work but not sure how it interacts with all the other settings

@SeanNaren
Copy link
Contributor

Thanks for checking this out @tjruwase appreciate it!

Lightning should be updated to allow FP32 support, let me try make a branch for us to try @griff4692 in Lightning!

@griff4692
Copy link
Author

Hi - I was able to manually call half() on a few tensors and got things working. Yet, I now get the following error

RuntimeError: Function LinearFunctionForZeroStage3Backward returned an invalid gradient at index 0 - got [14, 96] but expected shape compatible with [14, 96, 768]

This error only occurs with Deepspeed so maybe something fish is going on with all my manual interventions.

@SeanNaren
Copy link
Contributor

We have a PR for getting FP32 support on the Lightning side: Lightning-AI/pytorch-lightning#8462

@griff4692 I'll sync with you offline over the issue with half not being appropriately called on your tensors for you!

@tjruwase
Copy link
Contributor

@SeanNaren, thanks for helping out the Lightning side. Can you both please keep me in the loop if there any issues to fix on DeepSpeed in order to close this? Thanks!

@tjruwase
Copy link
Contributor

Closing as this seems to have been fixed on the Lightning side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants