-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please allow different batch sizes per gpu in ddp #15573
Comments
Can I ask a question that in these days, if I set the "batch_size==1" and "Trainer.strategy=="ddp", gpus==4", the total batch_size will be 4 or 1? |
4
…On Tue, 8 Nov 2022, 21:21 yichaoshen-MS, ***@***.***> wrote:
Can I ask a question that in these days, if I set the "batch_size==1" and
"Trainer.strategy=="ddp", gpus==4", the total batch_size will be 4 or 1?
—
Reply to this email directly, view it on GitHub
<#15573 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC7IWCZOMIMQ5X73EAXIXMLWHIEQFANCNFSM6AAAAAARZRXYTE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Hey @mosheliv, Technically, nothing prevents you to do this directly within your data loader and provides a different batch_size. However, you would need a custom-distributed sampler to take this into account The distributed sampler would need to ensure batches are uniquely seen by a single process and the total number of batches across all machines is the same. Best, |
Thank you for your prompt reply! much appreciated.
I think this is a pretty basic operation, so an example should be in place
in the "training on multi gpu", or at least a hint for the uninitiated. I
was honestly surprised that this was the behaviour. A bit more
clarification about the batch size is also needed. I have seen at least 4
people ask the same question - if I specify batch size 1 and i have 4 gpu,
with ddp - would the batch size be 1 or 4. Perhaps a few informative
messages?
Specifically for me, we are talking about one machine with two GPUs.
If I understand correctly (and I probably don't), currently each gpu gets
its own dataloader. How does the current implementation solves the
uniqueness problem? whats the difference between having batches with the
same size or not in this respect?
…On Tue, 8 Nov 2022 at 22:38, thomas chaton ***@***.***> wrote:
Hey @mosheliv <https://github.com/mosheliv>,
Technically, nothing prevents you to do this directly within your data
loader and provides a different batch_size.
However, you would need a custom-distributed sampler to take this into
account
The distributed sampler would need to ensure batches are uniquely seen by
a single process and the total number of batches across all machines is the
same.
Best,
T.C
—
Reply to this email directly, view it on GitHub
<#15573 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC7IWCZESG4WAFPTQS7B5F3WHINQTANCNFSM6AAAAAARZRXYTE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hey @mosheliv, When using DDP, each process associated to their GPU are loading their own batches and therefore the When using DP however, as there is a single process, it loads a single batch and scatter it across all GPUS.
If you provide different I hope it helps a bit. |
Just to understand, so it can happen (although unlikely) that the same
image appears twice in a batch on ddp?
dp is causing me errors that I can't understand with gathering statistics
and it seems to be heavily discouraged by lightning so i stopped trying. it
also seems to scatter the batch across multiple gpus symmetrically.
Is there any other strategy that supports what I am looking for? I couldn't
find if bagua has this option.
I am currently losing a substantial chunk of gpu ram because of this, which
is a shame.
…On Wed, 9 Nov 2022 at 05:17, thomas chaton ***@***.***> wrote:
Hey @mosheliv <https://github.com/mosheliv>,
When using DDP, each process associated to their GPU are loading their own
batches and therefore the batch_size is local. The total batch size is batch_size
* world_size. This is the default behaviour in PyTorch and we kept it
this way.
When using DP however, as there is a single process, it loads a single
batch and scatter it across all GPUS.
one machine with two GPUs
Yes, for your own use case and using DDP, there are 2 processes running as
you have 2 GPUS.
If you provide different batch_size between ranks, the ranks are going to
see duplicated data. But it might not be a problem from a convergence point
of view though if your dataset is large and you use the same batch size in
validation and test.
I hope it helps a bit.
—
Reply to this email directly, view it on GitHub
<#15573 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC7IWC6JSJLHVTPADQ26BHLWHJ4KJANCNFSM6AAAAAARZRXYTE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
On a more pragmatic note, assuming this is not implemented in any strategy,
can you please help with some of the details of the implementation and tell
me if i am doing something horribly wrong?
so, if i have a code like this:
class LitModel(LightningModule):
def train_dataloader(self):
loader = DataLoader(train_ds, batch_size = self.BS, shuffle =
True, num_workers=14)
return loader
if could, for example, change it to:
class LitModel(LightningModule):
def train_dataloader(self): loader = DataLoader(train_ds,
batch_size = self.BS[my_gpu()], shuffle = True, num_workers=14)
//my_gpu will return the gpu is of the gpu attached to this process?
return loader
and my dataset will return the relevant part of the batch. i.e the overall
batch will be 16, but my dataset, according to the gpu its on, will return
for gpu0 the first 6 of the batch and for gpu1 the other 10. I will need to
probably sort the shuffle to be synchronized but this is doable.
Questions:
Will this work? Are the gradients accumulated from all processed and then
processed or are they processed in every process and then somehow average
the averages?
How do I find what gpu was allocated to the current process in ddp?
…On Wed, 9 Nov 2022 at 13:17, Moshe Livne ***@***.***> wrote:
Just to understand, so it can happen (although unlikely) that the same
image appears twice in a batch on ddp?
dp is causing me errors that I can't understand with gathering statistics
and it seems to be heavily discouraged by lightning so i stopped trying. it
also seems to scatter the batch across multiple gpus symmetrically.
Is there any other strategy that supports what I am looking for? I
couldn't find if bagua has this option.
I am currently losing a substantial chunk of gpu ram because of this,
which is a shame.
On Wed, 9 Nov 2022 at 05:17, thomas chaton ***@***.***>
wrote:
> Hey @mosheliv <https://github.com/mosheliv>,
>
> When using DDP, each process associated to their GPU are loading their
> own batches and therefore the batch_size is local. The total batch size
> is batch_size * world_size. This is the default behaviour in PyTorch and
> we kept it this way.
>
> When using DP however, as there is a single process, it loads a single
> batch and scatter it across all GPUS.
>
> one machine with two GPUs
> Yes, for your own use case and using DDP, there are 2 processes running
> as you have 2 GPUS.
>
> If you provide different batch_size between ranks, the ranks are going
> to see duplicated data. But it might not be a problem from a convergence
> point of view though if your dataset is large and you use the same batch
> size in validation and test.
>
> I hope it helps a bit.
>
> —
> Reply to this email directly, view it on GitHub
> <#15573 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AC7IWC6JSJLHVTPADQ26BHLWHJ4KJANCNFSM6AAAAAARZRXYTE>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team! |
🚀 Feature
I propose adding instead of batch size a dictionary with batch size per GPU, for example {"cuda0":4, "cuda1", 6}
Motivation
I have gv100 (32gb) and 3090 (24gb). using the current muti gpu strategies, i can only use 24gb of memory from the gv100
Pitch
explained above
Alternatives
Automatic batch size sizing would be really nice as well for multi gpu, with different batch on different gpus
Additional context
If you enjoy Lightning, check out our other projects! ⚡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging PyTorch Lightning, Transformers, and Hydra.
The text was updated successfully, but these errors were encountered: