[QST] Why are multiple executors per GPU not supported？ #6493

lw-cxn · 2022-09-02T08:43:04Z

lw-cxn
Sep 2, 2022

Another problem is that I run four different Spark Application on three GPUs. By reducing the memory (spark.rapids.memory.gpu.allocFraction=0.25) allocated to each Spark Application, the Executor of some Spark Application is assigned to the same GPU node and the tasks are successful. It's just an increase in runtime, but it's still a good gain compared to the CPU's parallel Application .
Therefore, the answer to this question https://nvidia.github.io/spark-rapids/docs/FAQ.html#why-are-multiple-executors-per-gpu-not-supported is only because multiple Executors run on the same GPU. Will the Spark Application execution time increase? Or is there a problem with technical implementation?

Answered by revans2

Sep 2, 2022

It is possible to have multiple executors share a GPU. The technical limitations vary from cluster to cluster, and are mostly around resource constraints. We do it as a part of our integration tests in local mode to speed up testing. We can get away with this because the amount of memory needed for each application is very small, and because we know exactly what is going to be using the GPU and take steps to avoid overloading the GPU.

As https://nvidia.github.io/spark-rapids/docs/FAQ.html#why-are-multiple-executors-per-gpu-not-supported pointed out Spark does not support scheduling partial GPUs. This means that if you want to use the GPU but not ask Spark to hand out the GPUs to your proc…

View full answer

revans2 · 2022-09-02T14:46:33Z

revans2
Sep 2, 2022
Maintainer

It is possible to have multiple executors share a GPU. The technical limitations vary from cluster to cluster, and are mostly around resource constraints. We do it as a part of our integration tests in local mode to speed up testing. We can get away with this because the amount of memory needed for each application is very small, and because we know exactly what is going to be using the GPU and take steps to avoid overloading the GPU.

As https://nvidia.github.io/spark-rapids/docs/FAQ.html#why-are-multiple-executors-per-gpu-not-supported pointed out Spark does not support scheduling partial GPUs. This means that if you want to use the GPU but not ask Spark to hand out the GPUs to your processes you would need to disable GPU scheduling. This can work in local mode or even in stand alone mode, so long as you can restrict the processes on the GPU in a different way. Yarn and Kubernetes have similar limitations and there is no good way to work around the issue with them. If you tie some part of the GPU to a set amount of CPU cores or CPU memory, then it should work out okay, but all of the applications have to agree on how this works. All of this is rather brittle so we don't officially support that. If you wanted to work with Apache Spark and us to try and support partial GPU scheduling, then it would be something that we could officially support.

The best way to work around these issues is to use MIG.
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#introduction
https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#mig-gpu-on-yarn

It is not perfect, but it is a currently supported way to deal with the kinds of things that you want to do.

3 replies

lw-cxn Sep 5, 2022
Author

OK, but we use V100 and T4 and do not support MIG. Will V100 support MIG in the future? In addition, does the P100 support spark-rapids?

revans2 Sep 6, 2022
Maintainer

We support what the RAPIDS project supports https://rapids.ai/start.html#requirements. A P100 is a supported GPU option. I don't think that a V100 will get MIG support, I think there is hardware support for MIG that is required, but I am not an expert on this at all.

I know "use MIG" is not a great answer, and I wish I had a better one for you. We could support it, if there was scheduler support of it. There isn't scheduler support for it. We could add scheduler support to YARN, standalone, and Kubernetes. It really comes down to a business decision on how to best allocate resources. Feel free to file an issue for this, but don't make the business decisions so I cannot promise anything.

lw-cxn Sep 7, 2022
Author

Okay, thanks for your reply.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Why are multiple executors per GPU not supported？ #6493

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[QST] Why are multiple executors per GPU not supported？ #6493

lw-cxn Sep 2, 2022

Replies: 1 comment · 3 replies

revans2 Sep 2, 2022 Maintainer

lw-cxn Sep 5, 2022 Author

revans2 Sep 6, 2022 Maintainer

lw-cxn Sep 7, 2022 Author

lw-cxn
Sep 2, 2022

Replies: 1 comment 3 replies

revans2
Sep 2, 2022
Maintainer

lw-cxn Sep 5, 2022
Author

revans2 Sep 6, 2022
Maintainer

lw-cxn Sep 7, 2022
Author