Please allow different batch sizes per gpu in ddp #15573

mosheliv · 2022-11-07T20:03:43Z

🚀 Feature

I propose adding instead of batch size a dictionary with batch size per GPU, for example {"cuda0":4, "cuda1", 6}

Motivation

I have gv100 (32gb) and 3090 (24gb). using the current muti gpu strategies, i can only use 24gb of memory from the gv100

Pitch

explained above

Alternatives

Automatic batch size sizing would be really nice as well for multi gpu, with different batch on different gpus

Additional context

If you enjoy Lightning, check out our other projects! ⚡

Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging PyTorch Lightning, Transformers, and Hydra.

yichaoshen-MS · 2022-11-08T08:21:11Z

Can I ask a question that in these days, if I set the "batch_size==1" and "Trainer.strategy=="ddp", gpus==4", the total batch_size will be 4 or 1?

mosheliv · 2022-11-08T08:42:14Z

4

…

On Tue, 8 Nov 2022, 21:21 yichaoshen-MS, ***@***.***> wrote: Can I ask a question that in these days, if I set the "batch_size==1" and "Trainer.strategy=="ddp", gpus==4", the total batch_size will be 4 or 1? — Reply to this email directly, view it on GitHub <#15573 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC7IWCZOMIMQ5X73EAXIXMLWHIEQFANCNFSM6AAAAAARZRXYTE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

tchaton · 2022-11-08T09:38:07Z

Hey @mosheliv,

Technically, nothing prevents you to do this directly within your data loader and provides a different batch_size.

However, you would need a custom-distributed sampler to take this into account

The distributed sampler would need to ensure batches are uniquely seen by a single process and the total number of batches across all machines is the same.

Best,
T.C

mosheliv · 2022-11-08T10:56:27Z

Thank you for your prompt reply! much appreciated. I think this is a pretty basic operation, so an example should be in place in the "training on multi gpu", or at least a hint for the uninitiated. I was honestly surprised that this was the behaviour. A bit more clarification about the batch size is also needed. I have seen at least 4 people ask the same question - if I specify batch size 1 and i have 4 gpu, with ddp - would the batch size be 1 or 4. Perhaps a few informative messages? Specifically for me, we are talking about one machine with two GPUs. If I understand correctly (and I probably don't), currently each gpu gets its own dataloader. How does the current implementation solves the uniqueness problem? whats the difference between having batches with the same size or not in this respect?

…

On Tue, 8 Nov 2022 at 22:38, thomas chaton ***@***.***> wrote: Hey @mosheliv <https://github.com/mosheliv>, Technically, nothing prevents you to do this directly within your data loader and provides a different batch_size. However, you would need a custom-distributed sampler to take this into account The distributed sampler would need to ensure batches are uniquely seen by a single process and the total number of batches across all machines is the same. Best, T.C — Reply to this email directly, view it on GitHub <#15573 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC7IWCZESG4WAFPTQS7B5F3WHINQTANCNFSM6AAAAAARZRXYTE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

tchaton · 2022-11-08T16:17:30Z

Hey @mosheliv,

When using DDP, each process associated to their GPU are loading their own batches and therefore the batch_size is local. The total batch size is batch_size * world_size. This is the default behaviour in PyTorch and we kept it this way.

When using DP however, as there is a single process, it loads a single batch and scatter it across all GPUS.

one machine with two GPUs
Yes, for your own use case and using DDP, there are 2 processes running as you have 2 GPUS.

If you provide different batch_size between ranks, the ranks are going to see duplicated data. But it might not be a problem from a convergence point of view though if your dataset is large and you use the same batch size in validation and test.

I hope it helps a bit.

mosheliv · 2022-11-09T00:17:57Z

Just to understand, so it can happen (although unlikely) that the same image appears twice in a batch on ddp? dp is causing me errors that I can't understand with gathering statistics and it seems to be heavily discouraged by lightning so i stopped trying. it also seems to scatter the batch across multiple gpus symmetrically. Is there any other strategy that supports what I am looking for? I couldn't find if bagua has this option. I am currently losing a substantial chunk of gpu ram because of this, which is a shame.

…

On Wed, 9 Nov 2022 at 05:17, thomas chaton ***@***.***> wrote: Hey @mosheliv <https://github.com/mosheliv>, When using DDP, each process associated to their GPU are loading their own batches and therefore the batch_size is local. The total batch size is batch_size * world_size. This is the default behaviour in PyTorch and we kept it this way. When using DP however, as there is a single process, it loads a single batch and scatter it across all GPUS. one machine with two GPUs Yes, for your own use case and using DDP, there are 2 processes running as you have 2 GPUS. If you provide different batch_size between ranks, the ranks are going to see duplicated data. But it might not be a problem from a convergence point of view though if your dataset is large and you use the same batch size in validation and test. I hope it helps a bit. — Reply to this email directly, view it on GitHub <#15573 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC7IWC6JSJLHVTPADQ26BHLWHJ4KJANCNFSM6AAAAAARZRXYTE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

mosheliv · 2022-11-09T01:21:34Z

On a more pragmatic note, assuming this is not implemented in any strategy, can you please help with some of the details of the implementation and tell me if i am doing something horribly wrong? so, if i have a code like this: class LitModel(LightningModule): def train_dataloader(self): loader = DataLoader(train_ds, batch_size = self.BS, shuffle = True, num_workers=14) return loader if could, for example, change it to: class LitModel(LightningModule): def train_dataloader(self): loader = DataLoader(train_ds, batch_size = self.BS[my_gpu()], shuffle = True, num_workers=14) //my_gpu will return the gpu is of the gpu attached to this process? return loader and my dataset will return the relevant part of the batch. i.e the overall batch will be 16, but my dataset, according to the gpu its on, will return for gpu0 the first 6 of the batch and for gpu1 the other 10. I will need to probably sort the shuffle to be synchronized but this is doable. Questions: Will this work? Are the gradients accumulated from all processed and then processed or are they processed in every process and then somehow average the averages? How do I find what gpu was allocated to the current process in ddp?

…

On Wed, 9 Nov 2022 at 13:17, Moshe Livne ***@***.***> wrote: Just to understand, so it can happen (although unlikely) that the same image appears twice in a batch on ddp? dp is causing me errors that I can't understand with gathering statistics and it seems to be heavily discouraged by lightning so i stopped trying. it also seems to scatter the batch across multiple gpus symmetrically. Is there any other strategy that supports what I am looking for? I couldn't find if bagua has this option. I am currently losing a substantial chunk of gpu ram because of this, which is a shame. On Wed, 9 Nov 2022 at 05:17, thomas chaton ***@***.***> wrote: > Hey @mosheliv <https://github.com/mosheliv>, > > When using DDP, each process associated to their GPU are loading their > own batches and therefore the batch_size is local. The total batch size > is batch_size * world_size. This is the default behaviour in PyTorch and > we kept it this way. > > When using DP however, as there is a single process, it loads a single > batch and scatter it across all GPUS. > > one machine with two GPUs > Yes, for your own use case and using DDP, there are 2 processes running > as you have 2 GPUS. > > If you provide different batch_size between ranks, the ranks are going > to see duplicated data. But it might not be a problem from a convergence > point of view though if your dataset is large and you use the same batch > size in validation and test. > > I hope it helps a bit. > > — > Reply to this email directly, view it on GitHub > <#15573 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AC7IWC6JSJLHVTPADQ26BHLWHJ4KJANCNFSM6AAAAAARZRXYTE> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

stale · 2023-04-14T20:23:54Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

mosheliv added the needs triage Waiting to be triaged by maintainers label Nov 7, 2022

awaelchli added question Further information is requested and removed needs triage Waiting to be triaged by maintainers labels Nov 10, 2022

stale bot added the won't fix This will not be worked on label Apr 14, 2023

FarzanT mentioned this issue Apr 24, 2023

Dynamic/variable batch size support #16914

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please allow different batch sizes per gpu in ddp #15573

Please allow different batch sizes per gpu in ddp #15573

mosheliv commented Nov 7, 2022

yichaoshen-MS commented Nov 8, 2022

mosheliv commented Nov 8, 2022 via email

tchaton commented Nov 8, 2022

mosheliv commented Nov 8, 2022 via email

tchaton commented Nov 8, 2022

mosheliv commented Nov 9, 2022 via email

mosheliv commented Nov 9, 2022 via email

stale bot commented Apr 14, 2023

Please allow different batch sizes per gpu in ddp #15573

Please allow different batch sizes per gpu in ddp #15573

Comments

mosheliv commented Nov 7, 2022

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

If you enjoy Lightning, check out our other projects! ⚡

yichaoshen-MS commented Nov 8, 2022

mosheliv commented Nov 8, 2022 via email

tchaton commented Nov 8, 2022

mosheliv commented Nov 8, 2022 via email

tchaton commented Nov 8, 2022

mosheliv commented Nov 9, 2022 via email

mosheliv commented Nov 9, 2022 via email

stale bot commented Apr 14, 2023