Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large Memory Differences with DP vs. DDP accelerator #8826

Closed
anshulcgm opened this issue Aug 10, 2021 · 4 comments
Closed

Large Memory Differences with DP vs. DDP accelerator #8826

anshulcgm opened this issue Aug 10, 2021 · 4 comments
Labels
distributed Generic distributed-related topic help wanted Open to be worked on question Further information is requested strategy: dp (removed in pl) DataParallel waiting on author Waiting on user action, correction, or update won't fix This will not be worked on

Comments

@anshulcgm
Copy link

anshulcgm commented Aug 10, 2021

🐛 Bug

I am running a training loop with a Transformer model with Pytorch Lightning and trying to use ddp as the accelerator. I run into CUDA OOM issues due to the large memory requirement of the multihead attention module, however I do not run into this issue when using DP as the accelerator. When tracking the GPU memory usage, DP runs through a batch using 25 GB of memory however DDP needs more than 45 GB.

To Reproduce

import torch
import torch.nn as nn
from torch.nn import MultiheadAttention, LayerNorm
from torch.utils.data import Dataset, DataLoader
import pytorch_lightning as pl
import pdb
import copy
BATCH_SIZE = 16
TENSOR_SHAPE = [32, 2752, 256]
NUM_GPUS = 2
class testDataset(Dataset):
    def __init__(self):
        self.test_tensor = torch.rand(size = TENSOR_SHAPE)
    def __len__(self):
        return 32
    def __getitem__(self, idx):
        return self.test_tensor[idx]

class testDataModule(pl.LightningDataModule):
    def __init__(self):
        super().__init__()
    def setup(self, stage = None):
        self.train_set = testDataset()
    def train_dataloader(self):
        return DataLoader(self.train_set, BATCH_SIZE)
class AttentionModule(pl.LightningModule):
    def __init__(self, d_model = 256, nhead = 4, dropout = 0.2, num_layers = 12):
        super(AttentionModule, self).__init__()
        self.stacked_attn = nn.ModuleList([copy.deepcopy(MultiheadAttention(d_model, nhead, dropout)) for i in range(num_layers)])
        self.num_layers = num_layers
    def forward(self, tgt):
        tgt_mask = torch.randint(high = 2, size = [tgt.shape[0], tgt.shape[0]], device = self.device)
        tgt_mask = (tgt_mask.float().masked_fill(tgt_mask == 0, float("-inf")).masked_fill(tgt_mask == 1, float(0.0)))
        output = tgt
        #pdb.set_trace()
        for i, mod in enumerate(self.stacked_attn):
            output = mod(output, output, output, attn_mask = tgt_mask)[0]
        return output
    
    def training_step(self, batch, batch_idx):
        result = self(batch.transpose(0, 1))
        return torch.sum(result)
    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr = 5e-4)

  def run_trainer(accelerator = 'dp'):
      attention_model = AttentionModule()
      trainer = pl.Trainer(accelerator = accelerator, gpus = NUM_GPUS, fast_dev_run = True)
      data_module = testDataModule()
      trainer.fit(attention_model, data_module)
if __name__ == "__main__":
    run_trainer(accelerator = 'dp')
    run_trainer(accelerator = 'ddp')

Console Output:

-----------DP Output-----------
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Running in fast_dev_run mode: will run a full train, val and test loop using 1 batch(es).
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Set SLURM handle signals.

  | Name         | Type       | Params
--------------------------------------------
0 | stacked_attn | ModuleList | 3.2 M 
--------------------------------------------
3.2 M     Trainable params
0         Non-trainable params
3.2 M     Total params
12.632    Total estimated model params size (MB)
/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py:103: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 128 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  f'The dataloader, {name}, does not have many workers which may be a bottleneck.'
Epoch 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.95s/it, loss=2.94, v_num=]
-----------DDP Output-----------
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Running in fast_dev_run mode: will run a full train, val and test loop using 1 batch(es).
-----------DP Output-----------
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Running in fast_dev_run mode: will run a full train, val and test loop using 1 batch(es).
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Set SLURM handle signals.

  | Name         | Type       | Params
--------------------------------------------
0 | stacked_attn | ModuleList | 3.2 M 
--------------------------------------------
3.2 M     Trainable params
0         Non-trainable params
3.2 M     Total params
12.632    Total estimated model params size (MB)
/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py:103: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 128 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  f'The dataloader, {name}, does not have many workers which may be a bottleneck.'
Epoch 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.32s/it, loss=5.98, v_num=]
-----------DDP Output-----------
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All DDP processes registered. Starting ddp with 2 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
Set SLURM handle signals.
Set SLURM handle signals.

  | Name         | Type       | Params
--------------------------------------------
0 | stacked_attn | ModuleList | 3.2 M 
--------------------------------------------
3.2 M     Trainable params
0         Non-trainable params
3.2 M     Total params
12.632    Total estimated model params size (MB)
Epoch 0:   0%|                                                                                                                                             | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
Traceback (most recent call last):
  File "polygen/training/reproducible_error.py", line 55, in <module>
  File "/coc/pskynet2/aahluwalia30/refactored-polygen/polygen/training/reproducible_error.py", line 55, in <module>
    run_trainer(accelerator = 'ddp')
  File "polygen/training/reproducible_error.py", line 50, in run_trainer
    trainer.fit(attention_model, data_module)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit
    run_trainer(accelerator = 'ddp')
  File "/coc/pskynet2/aahluwalia30/refactored-polygen/polygen/training/reproducible_error.py", line 50, in run_trainer
    self._run(model)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 758, in _run
    trainer.fit(attention_model, data_module)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit
    self.dispatch()
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 799, in dispatch
    self._run(model)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 758, in _run
    self.accelerator.start_training(self)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
    self._results = trainer.run_stage()
    self.dispatch()
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in run_stage
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 799, in dispatch
    self.accelerator.start_training(self)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
    return self.run_train()
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 871, in run_train
    self.training_type_plugin.start_training(trainer)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
    self.train_loop.run_training_epoch()
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 499, in run_training_epoch
    self._results = trainer.run_stage()
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in run_stage
    return self.run_train()
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 871, in run_train
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 738, in run_training_batch
    self.train_loop.run_training_epoch()
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 499, in run_training_epoch
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 442, in optimizer_step
    using_lbfgs=is_lbfgs,
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1403, in optimizer_step
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 738, in run_training_batch
    optimizer.step(closure=optimizer_closure)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 442, in optimizer_step
    using_lbfgs=is_lbfgs,
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1403, in optimizer_step
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
    trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 329, in optimizer_step
    self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 336, in run_optimizer_step
    self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 193, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
    optimizer.step(closure=lambda_closure, **kwargs)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/torch/optim/adam.py", line 66, in step
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
    loss = closure()
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 733, in train_step_and_backward_closure
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
    trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 329, in optimizer_step
    split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 823, in training_step_and_backward
    self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 336, in run_optimizer_step
    self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 193, in optimizer_step
    result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 290, in training_step
    optimizer.step(closure=lambda_closure, **kwargs)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    training_step_output = self.trainer.accelerator.training_step(args)
    return func(*args, **kwargs)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 204, in training_step
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/torch/optim/adam.py", line 66, in step
    loss = closure()
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 733, in train_step_and_backward_closure
    return self.training_type_plugin.training_step(*args)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 337, in training_step
    split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 823, in training_step_and_backward
    return self.model(*args, **kwargs)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 619, in forward
    result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 290, in training_step
    output = self.module(*inputs[0], **kwargs[0])
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    training_step_output = self.trainer.accelerator.training_step(args)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 204, in training_step
    result = self.forward(*input, **kwargs)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/overrides/base.py", line 46, in forward
    return self.training_type_plugin.training_step(*args)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 337, in training_step
    output = self.module.training_step(*inputs, **kwargs)
  File "polygen/training/reproducible_error.py", line 41, in training_step
    result = self(batch.transpose(0, 1))
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    return self.model(*args, **kwargs)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "polygen/training/reproducible_error.py", line 37, in forward
    output = mod(output, output, output, attn_mask = tgt_mask)[0]
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 619, in forward
    result = self.forward(*input, **kwargs)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/torch/nn/modules/activation.py", line 985, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    attn_mask=attn_mask)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/torch/nn/functional.py", line 4314, in multi_head_attention_forward
    result = self.forward(*input, **kwargs)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/pytorch_lightning/overrides/base.py", line 46, in forward
    output = self.module.training_step(*inputs, **kwargs)
  File "/coc/pskynet2/aahluwalia30/refactored-polygen/polygen/training/reproducible_error.py", line 41, in training_step
    result = self(batch.transpose(0, 1))
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/coc/pskynet2/aahluwalia30/refactored-polygen/polygen/training/reproducible_error.py", line 37, in forward
    output = mod(output, output, output, attn_mask = tgt_mask)[0]
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    attn_output_weights = dropout(attn_output_weights, p=dropout_p, training=training)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/torch/nn/functional.py", line 983, in dropout
    result = self.forward(*input, **kwargs)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/torch/nn/modules/activation.py", line 985, in forward
    else _VF.dropout(input, p, training))
RuntimeError: CUDA out of memory. Tried to allocate 1.81 GiB (GPU 0; 44.56 GiB total capacity; 41.27 GiB already allocated; 353.31 MiB free; 41.64 GiB reserved in total by PyTorch)
    attn_mask=attn_mask)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/torch/nn/functional.py", line 4314, in multi_head_attention_forward
    attn_output_weights = dropout(attn_output_weights, p=dropout_p, training=training)
  File "/nethome/aahluwalia30/anaconda3/envs/polygen-env/lib/python3.7/site-packages/torch/nn/functional.py", line 983, in dropout
    else _VF.dropout(input, p, training))
RuntimeError: CUDA out of memory. Tried to allocate 1.81 GiB (GPU 1; 44.56 GiB total capacity; 41.25 GiB already allocated; 407.31 MiB free; 41.62 GiB reserved in total by PyTorch)

Save this script as memory_error.py and run python memory_error.py on any machine with 2+ GPUs with each GPU having >40 GB of memory. The GPU model that I am using is the NVIDIA A40 which has roughly 45 GB of memory.

Expected behavior

Both dp and ddp should use similar amounts of memory to run this training loop, yet ddp uses significantly more memory.

Environment

  • PyTorch Lightning Version (e.g., 1.3.0): 1.3.8
  • PyTorch Version (e.g., 1.8) 1.7.1
  • Python version: 3.7.10
  • OS (e.g., Linux): Linux
  • CUDA/cuDNN version: 11.2
  • GPU models and configuration: NVIDIA A40 GPUs
  • How you installed PyTorch (conda, pip, source): conda
  • If compiling from source, the output of torch.__config__.show():
  • Any other relevant information:

Additional context

@anshulcgm anshulcgm added bug Something isn't working help wanted Open to be worked on labels Aug 10, 2021
@Borda Borda added distributed Generic distributed-related topic strategy: dp (removed in pl) DataParallel labels Aug 10, 2021
@awaelchli
Copy link
Contributor

Hey @anshulcgm when you run with DDP you need to divide the batch size by NUM_GPUS:

if __name__ == "__main__":
    # in DP, batch will be split by num gpus and then sent to gpu
    run_trainer(accelerator = 'dp')

    # in ddp, batch size is PER  gpu, batches get processed by each worker individually.
    BATCH_SIZE = BATCH_SIZE // NUM_GPUS
    run_trainer(accelerator = 'ddp')

@awaelchli awaelchli added question Further information is requested waiting on author Waiting on user action, correction, or update and removed bug Something isn't working labels Aug 11, 2021
@johnwlambert
Copy link

Thanks for raising this issue Anshul, and thanks @awaelchli for the response.

In my humble opinion, I think most people in optimization or machine learning would think of the batch size in SGD to be the number of total examples used for a single gradient update.

this is also what how DDP treats the batch_size param in torch.utils.data.DataLoader.

so i think this would be a good thing to consider changing in the next Lightning release : - )

@awaelchli
Copy link
Contributor

In my humble opinion, I think most people in optimization or machine learning would think of the batch size in SGD to be the number of total examples used for a single gradient update.

that's unfortunately not how it is implemented in pytorch. there is a dataloader per process and the distributed sampler is partitioning the samples. the batch size is per process and the model forward() gets exactly a batch of that size. I think your argument can also easily be twisted around in favor of the current behavior in respect to scaling models to many GPUs.

Also note that DDP and its variants is not an outlier in the behavor here. It's actually DP, which is the only parallel plugin that behaves the way you describe.

this is also what how DDP treats the batch_size param in torch.utils.data.DataLoader.

Can you elaborate on this part. I don't understand what you mean.

@stale
Copy link

stale bot commented Sep 12, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Sep 12, 2021
@stale stale bot closed this as completed Sep 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed Generic distributed-related topic help wanted Open to be worked on question Further information is requested strategy: dp (removed in pl) DataParallel waiting on author Waiting on user action, correction, or update won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

4 participants