-
Another problem is that I run four different Spark Application on three GPUs. By reducing the memory (spark.rapids.memory.gpu.allocFraction=0.25) allocated to each Spark Application, the Executor of some Spark Application is assigned to the same GPU node and the tasks are successful. It's just an increase in runtime, but it's still a good gain compared to the CPU's parallel Application . |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
It is possible to have multiple executors share a GPU. The technical limitations vary from cluster to cluster, and are mostly around resource constraints. We do it as a part of our integration tests in local mode to speed up testing. We can get away with this because the amount of memory needed for each application is very small, and because we know exactly what is going to be using the GPU and take steps to avoid overloading the GPU. As https://nvidia.github.io/spark-rapids/docs/FAQ.html#why-are-multiple-executors-per-gpu-not-supported pointed out Spark does not support scheduling partial GPUs. This means that if you want to use the GPU but not ask Spark to hand out the GPUs to your processes you would need to disable GPU scheduling. This can work in local mode or even in stand alone mode, so long as you can restrict the processes on the GPU in a different way. Yarn and Kubernetes have similar limitations and there is no good way to work around the issue with them. If you tie some part of the GPU to a set amount of CPU cores or CPU memory, then it should work out okay, but all of the applications have to agree on how this works. All of this is rather brittle so we don't officially support that. If you wanted to work with Apache Spark and us to try and support partial GPU scheduling, then it would be something that we could officially support. The best way to work around these issues is to use MIG. It is not perfect, but it is a currently supported way to deal with the kinds of things that you want to do. |
Beta Was this translation helpful? Give feedback.
It is possible to have multiple executors share a GPU. The technical limitations vary from cluster to cluster, and are mostly around resource constraints. We do it as a part of our integration tests in local mode to speed up testing. We can get away with this because the amount of memory needed for each application is very small, and because we know exactly what is going to be using the GPU and take steps to avoid overloading the GPU.
As https://nvidia.github.io/spark-rapids/docs/FAQ.html#why-are-multiple-executors-per-gpu-not-supported pointed out Spark does not support scheduling partial GPUs. This means that if you want to use the GPU but not ask Spark to hand out the GPUs to your proc…