-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi GPU training (ddp) gets very slow when using list of tensors in Dataset #1925
Comments
did you try to compare the dataloaders from each version? do they have the same settings? |
Yes, the dataloaders are the same. They are plain PyTorch dataloaders and use the same number of workers etc. The issue seems to be that the distributed multiprocessing is opening a file pointer for each tensor in the list and then the access (opening all those 20000 file pointers) degrades the performance. This is also the reason why you need to increase the ulimit -n when running the list version. You can see that in the list version, starting each epoch takes a long time, so the data loading aspects just take much longer. |
thanks for bringing up. we’ll take a look at that example! Just at a glance you have not disabled the logger in lightning and are not using a logger in the PT version. Take a look at our parity test in the meantime. |
Thanks for taking a look. You are right that I did not disable the logging and also PyTorch Lightning is saving a checkpoint which my other code does not. The important thing for me however is not that my custom implementation is minimally faster, but rather that the PyTorch Lightning version gets very slow when I change nothing, but the dataset being a list of tensors rather than a big tensor storing other tensors. The biggest notable difference in the implementation is that in my custom implementation, I start 4 separate processes (one for each GPU) manually while PyTorch Lightning is using PyTorch multiprocessing spawn here. This is turn uses shared memory for tensors which in the case of a list of tensors creates many memory file pointers (one for each tensor) which seem to be the reason for this big slow down. Thanks for helping me out! I thus suspect it could be a general fall out of PyTorch multiprocessing and be unrelated to Lightning in which case I can also create an issue over there. |
so, you start 1 process per gpu individually and then run the script in each? what are you using ti train? yeah, probably better to ask at pytorch about this then |
I am using ddp, but instead of starting all the processes with spawn, I start them manually: python custom.py --world_size 4 --rank 0 |
I observe the same when using multiple workers. But this is a common problem with PyTorch when you use ddp + dataloader workers. Try your script again? |
Hi William, in all scenarios, I am using ddp + multiple dataloader workers. The only change is in the Dataset: class FakeDataset(Dataset):
def __init__(self, use_lists=False):
self.length = 20480
if use_lists:
self.list = []
for i in range(self.length):
self.list.append(torch.rand((3, 224, 224)))
else:
self.list = torch.rand((self.length, 3, 224, 224)) With |
i see. i need to run your script. But yeah, i think the problem is that spawn slows everything down. However, you can still use lightning in the way you are doing by running each script individually. MASTER_ADDR Is this on SLURM? GCP? AWS? |
This is on custom hardware, an Ubuntu server with 4 x GeForce RTX 2080 Ti. I tried it with num_workers = 0 with 2 GPUS. Results:
So in this case, the difference between lists and not lists is non-existing. However, in real life, my dataloaders do a lot more work and having no workers slows things down considerably. So by using these environment variables I can start 4 processes which are independent like in my custom.py implementation? |
yeah that's what i thought. The lists aren't the problem. The problem seems to be that pytorch spawn + num_workers does not play well... Set those vars and start the processes yourself. I ask because on SLURM clusters submitted with lighting we set it up properly where each script runs in its own process which is allocated by the SLURM scheduler. Resolution from this PR is to add a note about .spawn + num_workers (PS: PL adds about 600-1000 ms per epoch of overhead) |
Can you give me some pointers regarding the environment variables? I couldn't find much in the docs. I used this: MASTER_ADDR=127.0.0.1 MASTER_PORT=8899 LOCAL_RANK=0 NODE_RANK=0 WORLD_SIZE=2 python minimal.py But it starts training without waiting for me to start a second process and as far as I understand training on multiple GPUs with one process will be much slower than having one process for each GPU separately, right? I would expect that passing WORLD_SIZE=2 (while only passing a single GPU) would indicate that I will start another process which trains on the second GPU? |
Local rank is the process idx (1 per gpu). in your case (1 machine, 4 gpus): world rank = 4 |
Ok, i would start the script like this:
|
Thanks, that's what I thought with the ranks and world size just how I use it in my custom setup without Lightning, but it doesn't work like this in the Lightning context unfortunately. When I enter the first command like this (note that I don't need to specify distributed_backend, as in my script it's already set to ddp): it already starts training with 4 GPUs and outputs: So I cannot even start the next command (with LOCAL_RANK=1) as the first one is not waiting for another process. Same is if I limit the gpus: Any further ideas on this? |
Fixed on master.
Will re-open if still a problem |
This is working well now. The only thing is that it's not using shared memory anymore, so the memory usage (not of the GPU, but the system itself) is 4 times higher (duplicated for each process). Thank you for taking the time to fix this! |
Relevant PR: #2029 |
🐛 Bug
We are migrating to PyTorch Lightning from a custom implementation using Torchbearer before.
Our dataset stores a list of PyTorch tensors in memory, because the tensors are all of different dimensions. When migrating to PyTorch Lightning from a custom implementation, this seems to slow our training down in the multi GPU setup very significantly (training twice as long as before!).
To Reproduce
I built a repository which has just random data and a straightforward architecture to reproduce this both with (minimal.py) and without PyTorch Lightning (custom.py).
The repository and further details are located here: https://github.com/mpaepper/pytorch_lightning_multi_gpu_speed_analysis
When training with PyTorch Lightning for only 10 epochs, it takes 105 seconds when using a big PyTorch tensor (without a list), but it increases to 310 seconds (3x slower) when using a list of tensors.
The data size and model is exactly the same, it's just that one time it's stored differently.
When using my custom implementation, no such effect is observed (takes 97-98 seconds no matter if with or without lists).
To run the PyTorch Lightning version use:
One important thing to note: when using the list approach it seems that every tensor of that list is stored as a separate filesystem memory pointer, so you might need to increase your file limit:
ulimit -n 99999
.It seems that this is the issue that the DataLoader gets very slow as it needs to read so many files?
Is there a way around this?
Code sample
See https://github.com/mpaepper/pytorch_lightning_multi_gpu_speed_analysis
Expected behavior
I would expect the same dataset stored as a list of tensors to also train quickly.
Environment
The text was updated successfully, but these errors were encountered: