-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Reduction when batch size < num gpus #1609
[WIP] Reduction when batch size < num gpus #1609
Conversation
@williamFalcon what are your thoughts on this one? |
Codecov Report
@@ Coverage Diff @@
## master #1609 +/- ##
======================================
- Coverage 88% 88% -0%
======================================
Files 69 69
Lines 4133 4132 -1
======================================
- Hits 3656 3655 -1
Misses 477 477 |
it would be nice to ver the bug by test, do you have a minimal example for it? |
I verified that the test I added here fails on master but passes with the fix. However, to demonstrate it one needs at least 3 gpus, so CI won't help much :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🦝
This pull request is now in conflict... :( |
num_gpus = 3 | ||
batch_size = 3 | ||
|
||
class CurrentTestModel( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@awaelchli @Borda this needs the new test syntax not mixins...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use latest test syntax please
@williamFalcon I was not aware of any new syntax. I will have a look. |
elif output[k].size(0) == num_gpus: | ||
reduced = torch.mean(output[k]) | ||
output[k] = reduced | ||
# do not reduce metrics that have batch size > num gpus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batch size has nothing to do with dp.... why is this fix even needed?
size(0) should be the number of GPUs in DP... NOT batch size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simple example: batch_size = 2, num_gpus = 3. Lightning will forward the batch with only 2 gpus, so the number of outputs is 2, so size(0) = 2. Therefore Lightning will not reduce the output and we get a problem later when the progress bar metrics call .item on that tensor.
Is my explanation correct or not?
split_batch = ...
outs = []
for batch_split, model_copy in range(num_gpus):
out = model_copy(batch_split)
outs.append(out)
outs = torch.stack(outs, dim=0)
# size(0) is the number of GPUs NOT batch size...
# please debug this live to convince yourself. this has been working forever... can you give me a colab or an example where this is a problem? |
Ok thanks. |
@williamFalcon Here is the minimal example. from argparse import Namespace
from collections import OrderedDict
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim
from torch.utils.data import DataLoader, TensorDataset
from pytorch_lightning import Trainer
from pytorch_lightning.core import LightningModule
class LightningTemplateModel(LightningModule):
def __init__(self, hparams):
super().__init__()
self.hparams = hparams
self.layer = nn.Linear(32, 32)
def forward(self, x):
# dummy forward
return self.layer(x).sum()
def loss(self, labels, logits):
nll = F.nll_loss(logits, labels)
return nll
def training_step(self, batch, batch_idx):
loss = self.forward(*batch)
output = OrderedDict({
'loss': loss,
'progress_bar': {'some_val': loss}, # you see, if we comment this line, everyhing works
})
return output
def configure_optimizers(self):
optimizer = optim.Adam(self.parameters())
return optimizer
def train_dataloader(self):
num_gpus = self.hparams.num_gpus
batch_size = self.hparams.batch_size
# construct a dataset with a size that is not divisible by num_gpus
# therefore the last batch will have a size < num_gpus
size = num_gpus * batch_size + (num_gpus - 1)
print(size)
dataset = TensorDataset(torch.zeros(size, 32))
loader = DataLoader(
dataset=dataset,
batch_size=self.hparams.batch_size,
drop_last=False,
)
return loader
if __name__ == '__main__':
hparams = Namespace(num_gpus=3, batch_size=3)
model = LightningTemplateModel(hparams)
trainer = Trainer(
checkpoint_callback=False,
early_stop_callback=False,
gpus=hparams.num_gpus
)
trainer.fit(model) # this will crash The progress_bar metrics don't get reduced in the case where |
sorry but colab won't let me run with 3 gpus, and to show this bug we need at least 3 |
ok i see. this makes sense. |
Thanks! big relief. |
Can someone explain the logic of this? I fail to see why it should matter and it causes inconsistent behavior, depending on your batch size. |
@lolaclinton I gave an example here in this comment #1609 (comment) All of this logic only applies to DP. In data parallel, we need to reduce the output returned from all gpus, and even when the batch size was smaller than the num gpus. A batch size smaller than num gpus is not optimal, but it can happen for example when the dataset size is not evenly divisible by the batch size and we don't set drop_last in the dataloader. |
Before submitting
What does this PR do?
This fixes a problem where the metrics don't get reduced if batch size < num gpus #1218.
However I don't know if this is the correct thing to do, since PL explicitly checks that
batch_size == num_gpus
. Is there a reason for this?This bug also happens when the batch size dynamically changes during training, or when
drop_last = False
and the last batch happens to be smaller than num_gpus.PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃