Run locally on multiple GPUs #29

maximotus · 2024-01-09T16:27:28Z

Hello,

great work! What are the minimal adaptions I need to apply to the code so I can run the narrator on multiple GPUs locally?
nn.DataParallel is not optimal since I would need to adapt the model classes.

Cheers,
Max

The text was updated successfully, but these errors were encountered:

zhaoyue-zephyrus · 2024-01-10T16:26:29Z

Hi @maximotus

Could you make the question more clear? If you are referring to running "inference", then you don't need to parallelism at all. If you mean running a "training" job, we use torch.nn.parallel.DistributedDataParallel

maximotus · 2024-01-11T09:29:40Z

Hi @zhaoyue-zephyrus,

sure. I was trying to run your demo script.
My goal is to produce captions for short video clips of 1 second.
python demo_narrator.py --video-path "../path/to/my/video"
If I do so without the --cuda flag, it works, but it needs about 70 seconds inference time per clip with nucleus k=10 on my device.
So I wanted to speed up using GPU(s).

But if I pass the cuda flag, so the command is like python demo_narrator.py --cuda --video-path "../path/to/my/video", my GPU with 10GB is not enough and I get a RuntimeError: CUDA error: out of memory.

However, I thought enabling parallelism could solve this since I have 4 GPUs with 10 GB each available.
But I could not manage to make this work with your code easily.

So I wonder now how I can manage to run the inference on more than one GPU so I will not get a RuntimeError: CUDA error: out of memory.

I tried out wrapping your model with torch.nn.DataParallel after line 57. In this case, I can observe that the model weights are being distributed on 2 GPUs, but when it comes to the specific function calls in line 74 and line 75, it fails since models wrapped with torch.nn.DataParallel are then only able to call the defaut forward method (compare https://discuss.pytorch.org/t/dataparallel-model-with-custom-functions/75053/10).

So I was thinking about adapting your code for these needs (so e.g. the custom methods like encode_image and generate are being passed to the default forward method and then there will be a case selection inside forward).

However, I thought it would be good asking you about this issue first since I may have overseen a more trivial solution.

Cheers,
Max

Anirudh257 · 2024-04-10T15:10:47Z

Hi @maximotus did you figure this out?

You will need to use at least a 20 GB GPU.

If not, I think that the issue is due to model parallelism. Look into https://www.deepspeed.ai/tutorials/pipeline/ for model parallelism.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run locally on multiple GPUs #29

Run locally on multiple GPUs #29

maximotus commented Jan 9, 2024

zhaoyue-zephyrus commented Jan 10, 2024

maximotus commented Jan 11, 2024 •

edited

Loading

Anirudh257 commented Apr 10, 2024

Run locally on multiple GPUs #29

Run locally on multiple GPUs #29

Comments

maximotus commented Jan 9, 2024

zhaoyue-zephyrus commented Jan 10, 2024

maximotus commented Jan 11, 2024 • edited Loading

Anirudh257 commented Apr 10, 2024

maximotus commented Jan 11, 2024 •

edited

Loading