-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Steps not incremented correctly with accumulate gradients #2242
Comments
@mRcSchwering can you try 0.8.0? |
Just tried on 0.8.1 (hope that's as good). Issue remains. E.g.
|
@mRcSchwering mind check it with our latest master? 🐰 |
I guess this is solved by #2853 import os
import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader, random_split
from torchvision.datasets import MNIST
from torchvision import transforms
import pytorch_lightning as pl
pl.seed_everything(666)
class MyModule(pl.LightningModule):
def __init__(self, hparams: dict):
super().__init__()
self.hparams = hparams
self.model = nn.Linear(28*28, 10)
def training_step_end(self, outputs: dict):
print(f'Epoch: {self.current_epoch} Step: {self.global_step} Batch size: {len(outputs["logits"])}')
return outputs
def on_before_zero_grad(self, optimizer: torch.optim.Optimizer):
current_lr = [d['lr'] for d in optimizer.param_groups][0]
print(f'Step: {self.global_step} LR: {current_lr:.4e}')
def train_dataloader(self):
return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=1171, shuffle=False)
def forward(self, x):
return self.model(x)
def training_step(self, batch, batch_idx: int) -> dict:
inputs, targets = batch
logits = self.forward(inputs.view(inputs.size(0), -1))
loss = F.cross_entropy(logits, targets)
return {'loss': loss, 'logits': logits}
def configure_optimizers(self):
return torch.optim.Adam(self.model.parameters(), lr=3e-4)
def optimizer_step(self, epoch, batch_idx, optimizer, opt_idx, lambda_closure, using_native_amp, using_lbfgs):
# modify learning rate...
optimizer.step()
self.on_before_zero_grad(optimizer)
optimizer.zero_grad()
trainer = pl.Trainer(
max_steps=100,
max_epochs=int(1e6),
gpus=-1,
num_sanity_val_steps=0,
progress_bar_refresh_rate=0,
accumulate_grad_batches=7,
early_stop_callback=False)
model = MyModule({})
trainer.fit(model) Output: LR is printed every 7 accumulate steps and also in the last batch. Epoch: 1 Step: 8 Batch size: 1171
Epoch: 1 Step: 8 Batch size: 1171
Epoch: 1 Step: 8 Batch size: 1171
Epoch: 1 Step: 8 Batch size: 1171
Epoch: 1 Step: 8 Batch size: 1171
Epoch: 1 Step: 8 Batch size: 1171
Epoch: 1 Step: 8 Batch size: 1171
Step: 8 LR: 3.0000e-04
Step: 8 LR: 3.0000e-04
Epoch: 1 Step: 9 Batch size: 1171
Epoch: 1 Step: 9 Batch size: 1171
Epoch: 1 Step: 9 Batch size: 1171
Epoch: 1 Step: 9 Batch size: 1171
Epoch: 1 Step: 9 Batch size: 1171
Epoch: 1 Step: 9 Batch size: 1171
Epoch: 1 Step: 9 Batch size: 1171
Step: 9 LR: 3.0000e-04
Step: 9 LR: 3.0000e-04
Epoch: 1 Step: 10 Batch size: 1171
Epoch: 1 Step: 10 Batch size: 1171
Epoch: 1 Step: 10 Batch size: 1171
Epoch: 1 Step: 10 Batch size: 1171
Epoch: 1 Step: 10 Batch size: 1171
Epoch: 1 Step: 10 Batch size: 1171
Epoch: 1 Step: 10 Batch size: 1171
Step: 10 LR: 3.0000e-04
Step: 10 LR: 3.0000e-04
Epoch: 1 Step: 11 Batch size: 1171
Epoch: 1 Step: 11 Batch size: 1171
Epoch: 1 Step: 11 Batch size: 1171
Epoch: 1 Step: 11 Batch size: 1171
Epoch: 1 Step: 11 Batch size: 1171
Epoch: 1 Step: 11 Batch size: 1171
Epoch: 1 Step: 11 Batch size: 1171
Step: 11 LR: 3.0000e-04
Step: 11 LR: 3.0000e-04
Epoch: 1 Step: 12 Batch size: 1171
Epoch: 1 Step: 12 Batch size: 1171
Epoch: 1 Step: 12 Batch size: 1171
Epoch: 1 Step: 12 Batch size: 1171
Epoch: 1 Step: 12 Batch size: 1171
Epoch: 1 Step: 12 Batch size: 1171
Epoch: 1 Step: 12 Batch size: 1171
Step: 12 LR: 3.0000e-04
Step: 12 LR: 3.0000e-04
Epoch: 1 Step: 13 Batch size: 1171
Epoch: 1 Step: 13 Batch size: 1171
Epoch: 1 Step: 13 Batch size: 1171
Epoch: 1 Step: 13 Batch size: 1171
Epoch: 1 Step: 13 Batch size: 1171
Epoch: 1 Step: 13 Batch size: 1171
Epoch: 1 Step: 13 Batch size: 1171
Step: 13 LR: 3.0000e-04
Step: 13 LR: 3.0000e-04
Epoch: 1 Step: 14 Batch size: 1171
Epoch: 1 Step: 14 Batch size: 1171
Epoch: 1 Step: 14 Batch size: 1171
Epoch: 1 Step: 14 Batch size: 1171
Epoch: 1 Step: 14 Batch size: 1171
Epoch: 1 Step: 14 Batch size: 1171
Epoch: 1 Step: 14 Batch size: 1171
Step: 14 LR: 3.0000e-04
Step: 14 LR: 3.0000e-04
Epoch: 1 Step: 15 Batch size: 1171
Epoch: 1 Step: 15 Batch size: 1171
Epoch: 1 Step: 15 Batch size: 279
Step: 15 LR: 3.0000e-04
Step: 15 LR: 3.0000e-04
Epoch: 2 Step: 16 Batch size: 1171
Epoch: 2 Step: 16 Batch size: 1171
Epoch: 2 Step: 16 Batch size: 1171
Epoch: 2 Step: 16 Batch size: 1171
Epoch: 2 Step: 16 Batch size: 1171
Epoch: 2 Step: 16 Batch size: 1171
Epoch: 2 Step: 16 Batch size: 1171
Step: 16 LR: 3.0000e-04
Step: 16 LR: 3.0000e-04 |
Cool, thx. And I learned something new ( |
so I guess this could be closed. |
🐛 Bug
global_step
andcurrent_epoch
do not match up anymore after more than 1 epoch when setting accumulate gradients greater 1.I think at the end of each epoch
optimizer_step
(andon_before_zero_grad
) is not called in that case.To Reproduce
pl.LightningModule
that logscurrent_epoch
andglobal_step
in everytraining_step_end
.accumulate_grad_batches=7
in the trainerExpected behavior
current_epoch
gets incremented,global_step
gets incremented as wellActual behavior
global_step
increments with every batch, but not ifcurrent_epoch
get incrementedglobal_step
is basically missing every 3rd incrementCode sample
Below is basically what I have.
I am adjusting the learning rate with every global step.
The learning rate adjustment and each
training_step_end
call gets printed.Below is some sample output.
You can see the end of the epoch where the last 93 samples are processed.
Then,
current_epoch
increases, butglobal_step
does not increase.Additionally, the learning rate print is missing, so
on_before_zero_grad
was not called.
Environment
The text was updated successfully, but these errors were encountered: