ValueError: only one element tensors can be converted to Python scalars #1218

Ir1d · 2020-03-23T23:44:27Z

🐛 Bug

This happens in the training loop.

ValueError: only one element tensors can be converted to Python scalars

To Reproduce

From my observation, I believe this happens when the batch size can't be divided by gpu num. For example on the last batch of each epoch, and when you have 4 gpus but set batch size to 2.

Additional context

I think it would be nice to use only some of the gpus the user specified, while printing out a msg telling them that the gpus are not specified correctly. Current implementaion simply throws a not friendly error

The text was updated successfully, but these errors were encountered:

awaelchli · 2020-03-24T00:58:13Z

Do I understand correctly, this happens with every Lightning training where batch_size < num_gpus?
Then I would also like to see a warning message like you described and automatically set num_gpus to the batch size.
However, we still have the problem where batch_size > num_gpus for all batches except the last, when we specify drop_last=False in dataloader. What do we do then?

Would the same error occur in torch.nn.DataParallel (without PL)?

Ir1d · 2020-03-24T01:11:23Z

the last epoch seems a bit tricky.. I'm not expert on this, but I wonder if its possible to send the data to part of the specified gpus (for example 4 gpus in all, but send 3 batches to first 3 gpu). I remember there's some send_batch_to_gpu function in pl

Borda · 2020-03-25T14:54:16Z

@neggert @jeffling pls ^^

awaelchli · 2020-03-26T04:11:53Z

@Ir1d btw this person #1236 is getting the same error but doesn't have GPU.

Borda · 2020-04-16T23:05:33Z

@Ir1d mind send a PR or provide an example to replicate it?

Ir1d · 2020-04-16T23:32:48Z

I can provide an example next week. Currently not sure how to fix this

Richarizardd · 2020-04-25T04:29:49Z

Hi @Borda, I've reproduced the same issue here:

https://github.com/Richarizardd/pl_image_classification

Basic image classification with PL using MNIST, CIFAR10, and ImageFolder Datasets from torchvision. If you run the mnist_gpu1.yaml config file, you would get the same issue as @Ir1d

awaelchli · 2020-04-26T06:20:25Z

Hi @Richarizardd
I looked at your code and I found that in your validation epoch end, you don't reduce the outputs properly. PL does not do this for you. This is intentional, right @williamFalcon ?
So, in your validation_epoch_end, instead of

for output in outputs:
    metric_total += output[metric_name]
tqdm_dict[metric_name] = metric_total / len(outputs)

you should do the following:

tqdm_dict[metric_name] = torch.stack([output[metric_name] for output in outputs]).mean()

I tested this by adding it to your code and it worked (no error).
As far as I can tell, this is not a bug in PL. However, we could print a better error message.

@Ir1d you probably had the same mistake.

Ir1d · 2020-04-26T06:26:24Z

awaelchli · 2020-04-26T06:31:44Z

In @Richarizardd's case, the error was thrown in validation_epoch_end. Could you post your stack trace here so I can check if it's the same?

awaelchli · 2020-04-26T06:32:14Z

Your metric reduction looks fine to me.

Ir1d · 2020-04-26T06:44:48Z

I tried again and there's still this issue in pl v0.7.3

Here's the whole log for a recent run. I set bs=3 on 4 gpus and set pl use all the 4 gpus, and this is happening for the first batch. (in another case when drop_last is not set and bs=12 on 4 gpus, this is happening for the last batch of the epoch, which seems that it happens when bs < num gpus)

I'll try bring a minimal code for reproduction after I finish my midterm tests. Currently my code is available at https://github.com/ir1d/AFN , but the data is a bit large and might be hard to run.

awaelchli · 2020-04-26T07:14:34Z

@Ir1d found the bug in trainer code. It does not reduce the outputs if output_size of training step does not equal num_gpus. I will make a PR to fix it.

awaelchli · 2020-05-02T15:36:33Z

@Ir1d The fix got merged. Kindly asking you to verify the fix with latest master branch. Closing for now.

Ir1d added bug Something isn't working help wanted Open to be worked on labels Mar 23, 2020

Borda assigned jeffling Mar 27, 2020

Borda added the priority: 0 High priority task label Apr 8, 2020

Borda added this to the 0.7.3 milestone Apr 8, 2020

Borda mentioned this issue Apr 16, 2020

MultiGPU Training. Logging problem #1332

Closed

Borda assigned awaelchli and unassigned jeffling Apr 25, 2020

awaelchli mentioned this issue Apr 26, 2020

[WIP] Reduction when batch size < num gpus #1609

Merged

5 tasks

Borda modified the milestones: 0.7.4, 0.7.5 Apr 26, 2020

awaelchli closed this as completed May 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: only one element tensors can be converted to Python scalars #1218

ValueError: only one element tensors can be converted to Python scalars #1218

Ir1d commented Mar 23, 2020 •

edited by Borda

Loading

awaelchli commented Mar 24, 2020 •

edited by Borda

Loading

Ir1d commented Mar 24, 2020

Borda commented Mar 25, 2020

awaelchli commented Mar 26, 2020

Borda commented Apr 16, 2020

Ir1d commented Apr 16, 2020

Richarizardd commented Apr 25, 2020

awaelchli commented Apr 26, 2020

Ir1d commented Apr 26, 2020

awaelchli commented Apr 26, 2020

awaelchli commented Apr 26, 2020 •

edited

Loading

Ir1d commented Apr 26, 2020

awaelchli commented Apr 26, 2020

awaelchli commented May 2, 2020

ValueError: only one element tensors can be converted to Python scalars #1218

ValueError: only one element tensors can be converted to Python scalars #1218

Comments

Ir1d commented Mar 23, 2020 • edited by Borda Loading

🐛 Bug

To Reproduce

Additional context

awaelchli commented Mar 24, 2020 • edited by Borda Loading

Ir1d commented Mar 24, 2020

Borda commented Mar 25, 2020

awaelchli commented Mar 26, 2020

Borda commented Apr 16, 2020

Ir1d commented Apr 16, 2020

Richarizardd commented Apr 25, 2020

awaelchli commented Apr 26, 2020

Ir1d commented Apr 26, 2020

awaelchli commented Apr 26, 2020

awaelchli commented Apr 26, 2020 • edited Loading

Ir1d commented Apr 26, 2020

awaelchli commented Apr 26, 2020

awaelchli commented May 2, 2020

Ir1d commented Mar 23, 2020 •

edited by Borda

Loading

awaelchli commented Mar 24, 2020 •

edited by Borda

Loading

awaelchli commented Apr 26, 2020 •

edited

Loading