-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError: only one element tensors can be converted to Python scalars #1218
Comments
Do I understand correctly, this happens with every Lightning training where Would the same error occur in |
the last epoch seems a bit tricky.. I'm not expert on this, but I wonder if its possible to send the data to part of the specified gpus (for example 4 gpus in all, but send 3 batches to first 3 gpu). I remember there's some send_batch_to_gpu function in pl |
@Ir1d mind send a PR or provide an example to replicate it? |
I can provide an example next week. Currently not sure how to fix this |
Hi @Borda, I've reproduced the same issue here: https://github.com/Richarizardd/pl_image_classification Basic image classification with PL using MNIST, CIFAR10, and ImageFolder Datasets from torchvision. If you run the mnist_gpu1.yaml config file, you would get the same issue as @Ir1d |
Hi @Richarizardd for output in outputs:
metric_total += output[metric_name]
tqdm_dict[metric_name] = metric_total / len(outputs) you should do the following: tqdm_dict[metric_name] = torch.stack([output[metric_name] for output in outputs]).mean() I tested this by adding it to your code and it worked (no error). @Ir1d you probably had the same mistake. |
In @Richarizardd's case, the error was thrown in |
Your metric reduction looks fine to me. |
I tried again and there's still this issue in pl v0.7.3 Here's the whole log for a recent run. I set bs=3 on 4 gpus and set pl use all the 4 gpus, and this is happening for the first batch. (in another case when drop_last is not set and bs=12 on 4 gpus, this is happening for the last batch of the epoch, which seems that it happens when bs < num gpus) I'll try bring a minimal code for reproduction after I finish my midterm tests. Currently my code is available at https://github.com/ir1d/AFN , but the data is a bit large and might be hard to run. |
@Ir1d found the bug in trainer code. It does not reduce the outputs if output_size of training step does not equal num_gpus. I will make a PR to fix it. |
@Ir1d The fix got merged. Kindly asking you to verify the fix with latest master branch. Closing for now. |
🐛 Bug
This happens in the training loop.
To Reproduce
From my observation, I believe this happens when the batch size can't be divided by gpu num. For example on the last batch of each epoch, and when you have 4 gpus but set batch size to 2.
Additional context
I think it would be nice to use only some of the gpus the user specified, while printing out a msg telling them that the gpus are not specified correctly. Current implementaion simply throws a not friendly error
The text was updated successfully, but these errors were encountered: