Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: only one element tensors can be converted to Python scalars #1218

Closed
Ir1d opened this issue Mar 23, 2020 · 14 comments
Closed

ValueError: only one element tensors can be converted to Python scalars #1218

Ir1d opened this issue Mar 23, 2020 · 14 comments
Assignees
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Milestone

Comments

@Ir1d
Copy link
Contributor

Ir1d commented Mar 23, 2020

🐛 Bug

This happens in the training loop.

ValueError: only one element tensors can be converted to Python scalars

To Reproduce

From my observation, I believe this happens when the batch size can't be divided by gpu num. For example on the last batch of each epoch, and when you have 4 gpus but set batch size to 2.

Additional context

I think it would be nice to use only some of the gpus the user specified, while printing out a msg telling them that the gpus are not specified correctly. Current implementaion simply throws a not friendly error

@Ir1d Ir1d added bug Something isn't working help wanted Open to be worked on labels Mar 23, 2020
@awaelchli
Copy link
Contributor

awaelchli commented Mar 24, 2020

Do I understand correctly, this happens with every Lightning training where batch_size < num_gpus?
Then I would also like to see a warning message like you described and automatically set num_gpus to the batch size.
However, we still have the problem where batch_size > num_gpus for all batches except the last, when we specify drop_last=False in dataloader. What do we do then?

Would the same error occur in torch.nn.DataParallel (without PL)?

@Ir1d
Copy link
Contributor Author

Ir1d commented Mar 24, 2020

the last epoch seems a bit tricky.. I'm not expert on this, but I wonder if its possible to send the data to part of the specified gpus (for example 4 gpus in all, but send 3 batches to first 3 gpu). I remember there's some send_batch_to_gpu function in pl

@Borda
Copy link
Member

Borda commented Mar 25, 2020

@neggert @jeffling pls ^^

@awaelchli
Copy link
Contributor

@Ir1d btw this person #1236 is getting the same error but doesn't have GPU.

@Borda Borda added the priority: 0 High priority task label Apr 8, 2020
@Borda Borda added this to the 0.7.3 milestone Apr 8, 2020
@Borda
Copy link
Member

Borda commented Apr 16, 2020

@Ir1d mind send a PR or provide an example to replicate it?

@Ir1d
Copy link
Contributor Author

Ir1d commented Apr 16, 2020

I can provide an example next week. Currently not sure how to fix this

@Richarizardd
Copy link

Hi @Borda, I've reproduced the same issue here:

https://github.com/Richarizardd/pl_image_classification

Basic image classification with PL using MNIST, CIFAR10, and ImageFolder Datasets from torchvision. If you run the mnist_gpu1.yaml config file, you would get the same issue as @Ir1d

@Borda Borda assigned awaelchli and unassigned jeffling Apr 25, 2020
@awaelchli
Copy link
Contributor

Hi @Richarizardd
I looked at your code and I found that in your validation epoch end, you don't reduce the outputs properly. PL does not do this for you. This is intentional, right @williamFalcon ?
So, in your validation_epoch_end, instead of

for output in outputs:
    metric_total += output[metric_name]
tqdm_dict[metric_name] = metric_total / len(outputs)

you should do the following:

tqdm_dict[metric_name] = torch.stack([output[metric_name] for output in outputs]).mean()

I tested this by adding it to your code and it worked (no error).
As far as I can tell, this is not a bug in PL. However, we could print a better error message.

@Ir1d you probably had the same mistake.

@Ir1d
Copy link
Contributor Author

Ir1d commented Apr 26, 2020

image

@awaelchli
Copy link
Contributor

In @Richarizardd's case, the error was thrown in validation_epoch_end. Could you post your stack trace here so I can check if it's the same?

@awaelchli
Copy link
Contributor

awaelchli commented Apr 26, 2020

Your metric reduction looks fine to me.

@Ir1d
Copy link
Contributor Author

Ir1d commented Apr 26, 2020

I tried again and there's still this issue in pl v0.7.3

Here's the whole log for a recent run. I set bs=3 on 4 gpus and set pl use all the 4 gpus, and this is happening for the first batch. (in another case when drop_last is not set and bs=12 on 4 gpus, this is happening for the last batch of the epoch, which seems that it happens when bs < num gpus)
image

I'll try bring a minimal code for reproduction after I finish my midterm tests. Currently my code is available at https://github.com/ir1d/AFN , but the data is a bit large and might be hard to run.

@awaelchli
Copy link
Contributor

@Ir1d found the bug in trainer code. It does not reduce the outputs if output_size of training step does not equal num_gpus. I will make a PR to fix it.

@Borda Borda modified the milestones: 0.7.4, 0.7.5 Apr 26, 2020
@awaelchli
Copy link
Contributor

@Ir1d The fix got merged. Kindly asking you to verify the fix with latest master branch. Closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
Development

No branches or pull requests

5 participants