RuntimeError: expected scalar type Float but found Half #1233

griff4692 · 2021-07-14T23:32:42Z

Hi - I'm trying to use the deepspeed plugin with Pytorch Lightning. My code worked before but changing the line in trainer

to add

plugins='deepspeed_stage_3_offload'

Causes the error posted in the title. I've tried casting parameters and variables as float and half, but the error persists.

Any suggestions would be much appreciated as I'm really looking forward to see what deepspeed can do.

I should note that the error is happening in a call to a pytorch_geometric method (if that changes anything).

deepspeed==0.4.3
pytorch-lightning==1.3.8
torch==1.9.0
torch-cluster==1.5.9
torch-geometric==1.7.1
torch-scatter==2.0.7
torch-sparse==0.6.10
torch-spline-conv==1.2.1
torchmetrics==0.3.2
torchvision==0.10.0

The text was updated successfully, but these errors were encountered:

tjruwase · 2021-07-14T23:53:07Z

@griff4692, can you share log and stack trace?

Also, can you check if the same error happens with deepspeed_stage_2_offload plugin?

griff4692 · 2021-07-15T12:04:54Z

Yes the same error persists regardless of stage 1, 2, or 3.

(sauce) griffin@lambda-dual:~/kabupra/graph$ python pretrain.py -debug
Num GPUs --> 1
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Starting training...
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1

distributed_backend=nccl
All DDP processes registered. Starting ddp with 1 processes

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Loaded 100 examples
Enabling DeepSpeed FP16.
You have not specified an optimizer or scheduler within the DeepSpeed config.Using configure_optimizers to define optimizer and scheduler.
Using /home/griffin/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /home/griffin/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.305880069732666 seconds
Using /home/griffin/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0002627372741699219 seconds
wandb: W&B syncing is set to offline in this directory. Run wandb online or set WANDB_MODE=online to enable cloud syncing.

| Name | Type | Params

0 | sent_encoder | SentBERT | 199
1 | node_encoder | NodeEncoder | 7
2 | mask_embed | Embedding | 1
3 | entity_graph_encoder | EntityGraphEncoder | 8
4 | dropout | Dropout | 0
5 | label_loss | CrossEntropyLoss | 0
6 | cui_mask_output | Linear | 2
7 | tui_mask_output | Linear | 2
8 | sg_mask_output | Linear | 2
9 | sec_mask_output | Linear | 2
10 | sent_pool_score | Linear | 2
11 | cui_proj | Linear | 2
12 | sent_proj | Linear | 2

30 Trainable params
199 Non-trainable params
229 Total params
0.000 Total estimated model params size (MB)
Loaded 100 examples
Loaded 100 examples
/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/callbacks/lr_monitor.py:97: RuntimeWarning: You are using LearningRateMonitor callback with models that have no learning rate schedulers. Please see documentation for configure_optimizers method.
rank_zero_warn(
Epoch 0: 0%| | 0/100 [00:00<?, ?it/s]Traceback (most recent call last):
File "pretrain.py", line 29, in
run(args, ReSAUCE)
File "/home/griffin/kabupra/graph/main.py", line 119, in run
trainer.fit(model)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit
self._run(model)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 758, in _run
self.dispatch()
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 799, in dispatch
self.accelerator.start_training(self)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
self._results = trainer.run_stage()
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in run_stage
return self.run_train()
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 871, in run_train
self.train_loop.run_training_epoch()
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 499, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 738, in run_training_batch
self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 434, in optimizer_step
model_ref.optimizer_step(
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1403, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 325, in optimizer_step
make_optimizer_step = self.precision_plugin.pre_optimizer_step(
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/deepspeed_precision.py", line 46, in pre_optimizer_step
lambda_closure()
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 732, in train_step_and_backward_closure
result = self.training_step_and_backward(
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 823, in training_step_and_backward
result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 290, in training_step
training_step_output = self.trainer.accelerator.training_step(args)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 204, in training_step
return self.training_type_plugin.training_step(*args)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 337, in training_step
return self.model(*args, **kwargs)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/griffin/sauce/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1105, in forward
loss = self.module(*inputs, **kwargs)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 62, in forward
return super().forward(*inputs, **kwargs)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 46, in forward
output = self.module.training_step(*inputs, **kwargs)
File "/home/griffin/kabupra/graph/models/resauce.py", line 124, in training_step
return self.shared_step(batch, is_train=True)
File "/home/griffin/kabupra/graph/models/resauce.py", line 109, in shared_step
cui_loss, cui_acc, tui_loss, tui_acc, sg_loss, sg_acc, sec_loss, sec_acc = self(**batch)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/griffin/kabupra/graph/models/resauce.py", line 83, in forward
node_states = self.dropout(self.entity_graph_encoder(node_graph_input, edge_index))
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/griffin/kabupra/graph/models/model_utils.py", line 132, in forward
x = self.elu(self.conv1(x, edge_index))
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/griffin/.local/lib/python3.8/site-packages/torch_geometric/nn/conv/gat_conv.py", line 124, in forward
x_l = x_r = self.lin_l(x).view(-1, H, C)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "/usr/lib/python3/dist-packages/torch/nn/modules/linear.py", line 96, in forward
return F.linear(input, self.weight, self.bias)
File "/usr/lib/python3/dist-packages/torch/cuda/amp/autocast_mode.py", line 211, in decorate_fwd
return fwd(*args, **kwargs)
File "/home/griffin/sauce/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 60, in forward
output = input.matmul(weight.t())
RuntimeError: expected scalar type Float but found Half
Traceback (most recent call last):
File "pretrain.py", line 29, in
run(args, ReSAUCE)
File "/home/griffin/kabupra/graph/main.py", line 119, in run
trainer.fit(model)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit
self._run(model)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 758, in _run
self.dispatch()
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 799, in dispatch
self.accelerator.start_training(self)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
self._results = trainer.run_stage()
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in run_stage
return self.run_train()
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 871, in run_train
self.train_loop.run_training_epoch()
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 499, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 738, in run_training_batch
self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 434, in optimizer_step
model_ref.optimizer_step(
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1403, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 325, in optimizer_step
make_optimizer_step = self.precision_plugin.pre_optimizer_step(
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/deepspeed_precision.py", line 46, in pre_optimizer_step
lambda_closure()
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 732, in train_step_and_backward_closure
result = self.training_step_and_backward(
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 823, in training_step_and_backward
result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 290, in training_step
training_step_output = self.trainer.accelerator.training_step(args)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 204, in training_step
return self.training_type_plugin.training_step(*args)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 337, in training_step
return self.model(*args, **kwargs)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/griffin/sauce/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1105, in forward
loss = self.module(*inputs, **kwargs)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 62, in forward
return super().forward(*inputs, **kwargs)
File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 46, in forward
output = self.module.training_step(*inputs, **kwargs)
File "/home/griffin/kabupra/graph/models/resauce.py", line 124, in training_step
return self.shared_step(batch, is_train=True)
File "/home/griffin/kabupra/graph/models/resauce.py", line 109, in shared_step
cui_loss, cui_acc, tui_loss, tui_acc, sg_loss, sg_acc, sec_loss, sec_acc = self(**batch)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/griffin/kabupra/graph/models/resauce.py", line 83, in forward
node_states = self.dropout(self.entity_graph_encoder(node_graph_input, edge_index))
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/griffin/kabupra/graph/models/model_utils.py", line 132, in forward
x = self.elu(self.conv1(x, edge_index))
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/griffin/.local/lib/python3.8/site-packages/torch_geometric/nn/conv/gat_conv.py", line 124, in forward
x_l = x_r = self.lin_l(x).view(-1, H, C)
File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "/usr/lib/python3/dist-packages/torch/nn/modules/linear.py", line 96, in forward
return F.linear(input, self.weight, self.bias)
File "/usr/lib/python3/dist-packages/torch/cuda/amp/autocast_mode.py", line 211, in decorate_fwd
return fwd(*args, **kwargs)
File "/home/griffin/sauce/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 60, in forward
output = input.matmul(weight.t())
RuntimeError: expected scalar type Float but found Half

wandb: Waiting for W&B process to finish, PID 318426
wandb: Program failed with code 1.
wandb: Find user logs for this run at: /home/griffin/weights/graph/default/wandb/offline-run-20210714_200232-8ub8504m/logs/debug.log
wandb: Find internal logs for this run at: /home/griffin/weights/graph/default/wandb/offline-run-20210714_200232-8ub8504m/logs/debug-internal.log
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/griffin/weights/graph/default/wandb/offline-run-20210714_200232-8ub8504m

tjruwase · 2021-07-15T15:05:01Z

Thanks for sharing these details. I would like to repro this problem. Can you please share the steps for me to do this?

griff4692 · 2021-07-15T15:13:42Z

Hi - yes, let me try with a toy example!

import pytorch_lightning as pl
from torch_geometric.nn import GATConv
class MyGAT(pl.LightningModule):
    def __init__():
        super().__init__()
        self.gat = GATConv()
    def train_dataloader(self):
         # TODO
    def training_step(self, batch, batch_idx):
         node_features, edge_index = batch
         self.gat(node_features, edge_index)

trainer = pl.Trainer(precision=16, gpus=torch.cuda.device_count(), plugins='deepspeed_stage_3_offload')
model = MyGAT()
trainer.fit(model)

I'm tied up right now but a simple forward pass with GATConv hopefully will reproduce the error. Just will need to pass it a dummy dataloader for model.train_dataloader and a model.training_step(self, batch, batch_idx) to call self.gat(dummy_node_features, dummy_edge_index)

tjruwase · 2021-07-15T16:06:50Z

@griff4692, got it. No rush, I can wait for your complete toy example. Thanks!

griff4692 · 2021-07-15T23:29:56Z

import pytorch_lightning as pl
import torch
from torch_geometric.nn import GATConv

import torch.nn as nn
from torch.utils.data import DataLoader
from torch.utils.data import Dataset


class DummyDataset(Dataset):
    def __len__(self):
        return 1

    def __getitem__(self, idx):
        return {
            'h': torch.zeros(size=[10, 5]),
            'edge_index': torch.zeros(size=[2, 10]).long(),
            'y': torch.ones(size=[1,])
        }


class MyGAT(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(5, 5)
        self.gat = GATConv(in_channels=5, out_channels=5)
        self.output = nn.Linear(5, 1)
        self.loss = nn.MSELoss()

    def shared_step(self, h, edge_index, y):
        h_proj = self.linear(h[0, :, :])
        h_conv = self.gat(h_proj, edge_index[0, :, :])
        y_pred = self.output(h_conv[0, :])
        return self.loss(y_pred, y)

    def validation_step(self, batch, batch_idx):
        return self.shared_step(**batch)

    def training_step(self, batch, batch_idx):
        return self.shared_step(**batch)

    def train_dataloader(self):
        return DataLoader(DummyDataset(), batch_size=1)

    def val_dataloader(self):
        return DataLoader(DummyDataset(), batch_size=1)

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters())


trainer = pl.Trainer(precision=16, gpus=torch.cuda.device_count(), plugins='deepspeed_stage_3_offload')
model = MyGAT()
trainer.fit(model)

griff4692 · 2021-07-15T23:31:28Z

Running this actually supplies a different error which is

RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/home/griffin/sauce/lib/python3.8/site-packages/torch_geometric/utils/softmax.py", line 41, in softmax
    elif index is not None:
        N = maybe_num_nodes(index, num_nodes)
        src_max = scatter(src, index, dim, dim_size=N, reduce='max')
                  ~~~~~~~ <--- HERE
        src_max = src_max.index_select(dim, index)
        out = (src - src_max).exp()
  File "/home/griffin/sauce/lib/python3.8/site-packages/torch_scatter/scatter.py", line 161, in scatter
        return scatter_min(src, index, dim, out, dim_size)[0]
    elif reduce == 'max':
        return scatter_max(src, index, dim, out, dim_size)[0]
               ~~~~~~~~~~~ <--- HERE
    else:
        raise ValueError
  File "/home/griffin/sauce/lib/python3.8/site-packages/torch_scatter/scatter.py", line 73, in scatter_max
        out: Optional[torch.Tensor] = None,
        dim_size: Optional[int] = None) -> Tuple[torch.Tensor, torch.Tensor]:
    return torch.ops.torch_scatter.scatter_max(src, index, dim, out, dim_size)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
RuntimeError: "scatter" not implemented for 'Half'

Which I've been able to reproduce on my more complex model by just calling half() on the inputs to the GATConv model. I've confirmed that this behavior is to be expected:

pyg-team/pytorch_geometric#2866

Maybe I just can't use deepspeed for this particular application or need to downgrade to earlier versions of either deepspeed or geometric for compatibility.

tjruwase · 2021-07-15T23:56:48Z

I don't have much experience with TorchScript, but I am curious if the original issue can be repro'd without TorchScript. I suspect that in the case an appropriate cast would be the fix. But we can't know for sure unless we get a repro.

griff4692 · 2021-07-16T00:17:22Z

Does deepspeed need everything to be half? If that's the case, it seems incompatible with torch_scatter

tjruwase · 2021-07-16T00:31:04Z

@griff4692, not at all. DeepSpeed can work with fp32 or halfs depending on the configuration. The problem here is that DeepSpeed does not attempt to do any automatic casting in the case of mixed-precision training. For example, can you try running in full fp32 by disabling fp16 in your deepspeed config? You can see docs for fp16 configuration here.

griff4692 · 2021-07-16T01:43:59Z

In PyTorch Lightning, if you use fp32 with deepspeed plugin, you get the following error:

Traceback (most recent call last):
  File "finetune.py", line 27, in <module>
    run(args, PureSAUCE)
  File "/home/griffin/kabupra/graph/main.py", line 119, in run
    trainer.fit(model)
  File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit
    self._run(model)
  File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 755, in _run
    self.pre_dispatch()
  File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 780, in pre_dispatch
    self.accelerator.pre_dispatch(self)
  File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 108, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 251, in pre_dispatch
    self.init_deepspeed()
  File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 256, in init_deepspeed
    self._format_config()
  File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 413, in _format_config
    self._format_precision_config()
  File "/home/griffin/sauce/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 461, in _format_precision_config
    raise MisconfigurationException("To use DeepSpeed ZeRO Optimization, you must set precision=16.")
pytorch_lightning.utilities.exceptions.MisconfigurationException: To use DeepSpeed ZeRO Optimization, you must set precision=16.

I'll look into circumventing this

griff4692 · 2021-07-16T01:46:08Z

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/plugins/training_type/deepspeed.py

looks like you can re-write config and it may work but not sure how it interacts with all the other settings

SeanNaren · 2021-07-16T08:59:27Z

Thanks for checking this out @tjruwase appreciate it!

Lightning should be updated to allow FP32 support, let me try make a branch for us to try @griff4692 in Lightning!

griff4692 · 2021-07-17T20:32:28Z

Hi - I was able to manually call half() on a few tensors and got things working. Yet, I now get the following error

RuntimeError: Function LinearFunctionForZeroStage3Backward returned an invalid gradient at index 0 - got [14, 96] but expected shape compatible with [14, 96, 768]

This error only occurs with Deepspeed so maybe something fish is going on with all my manual interventions.

SeanNaren · 2021-07-19T09:22:17Z

We have a PR for getting FP32 support on the Lightning side: Lightning-AI/pytorch-lightning#8462

@griff4692 I'll sync with you offline over the issue with half not being appropriately called on your tensors for you!

tjruwase · 2021-07-19T15:08:29Z

@SeanNaren, thanks for helping out the Lightning side. Can you both please keep me in the loop if there any issues to fix on DeepSpeed in order to close this? Thanks!

tjruwase · 2021-07-29T13:03:05Z

Closing as this seems to have been fixed on the Lightning side.

SeanNaren mentioned this issue Jul 19, 2021

DeepSpeed: "RuntimeError: expected scalar type Float but found Half" Lightning-AI/pytorch-lightning#8426

Closed

tjruwase closed this as completed Jul 29, 2021

adm995 mentioned this issue Feb 16, 2023

[BUG] RuntimeError: expected scalar type Float but found Half #2842

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: expected scalar type Float but found Half #1233

RuntimeError: expected scalar type Float but found Half #1233

griff4692 commented Jul 14, 2021

tjruwase commented Jul 14, 2021

griff4692 commented Jul 15, 2021

(sauce) griffin@lambda-dual:~/kabupra/graph$ python pretrain.py -debug
Num GPUs --> 1
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Starting training...
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1

distributed_backend=nccl
All DDP processes registered. Starting ddp with 1 processes

| Name | Type | Params

tjruwase commented Jul 15, 2021

griff4692 commented Jul 15, 2021 •

edited

Loading

tjruwase commented Jul 15, 2021

griff4692 commented Jul 15, 2021

griff4692 commented Jul 15, 2021

tjruwase commented Jul 15, 2021

griff4692 commented Jul 16, 2021

tjruwase commented Jul 16, 2021

griff4692 commented Jul 16, 2021

griff4692 commented Jul 16, 2021

SeanNaren commented Jul 16, 2021

griff4692 commented Jul 17, 2021

SeanNaren commented Jul 19, 2021

tjruwase commented Jul 19, 2021

tjruwase commented Jul 29, 2021

RuntimeError: expected scalar type Float but found Half #1233

RuntimeError: expected scalar type Float but found Half #1233

Comments

griff4692 commented Jul 14, 2021

tjruwase commented Jul 14, 2021

griff4692 commented Jul 15, 2021

(sauce) griffin@lambda-dual:~/kabupra/graph$ python pretrain.py -debug Num GPUs --> 1 GPU available: True, used: True TPU available: False, using: 0 TPU cores Starting training... initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1

distributed_backend=nccl All DDP processes registered. Starting ddp with 1 processes

| Name | Type | Params

tjruwase commented Jul 15, 2021

griff4692 commented Jul 15, 2021 • edited Loading

tjruwase commented Jul 15, 2021

griff4692 commented Jul 15, 2021

griff4692 commented Jul 15, 2021

tjruwase commented Jul 15, 2021

griff4692 commented Jul 16, 2021

tjruwase commented Jul 16, 2021

griff4692 commented Jul 16, 2021

griff4692 commented Jul 16, 2021

SeanNaren commented Jul 16, 2021

griff4692 commented Jul 17, 2021

SeanNaren commented Jul 19, 2021

tjruwase commented Jul 19, 2021

tjruwase commented Jul 29, 2021

(sauce) griffin@lambda-dual:~/kabupra/graph$ python pretrain.py -debug
Num GPUs --> 1
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Starting training...
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1

distributed_backend=nccl
All DDP processes registered. Starting ddp with 1 processes

griff4692 commented Jul 15, 2021 •

edited

Loading