[QST] Working configuration for multi executor - multi GPU environment on Spark Standalone cluster #5366
-
Hello, The question is, what is the correct way to run the multiGPU worker on Spark Standalone Cluster? As soon as I define for example max cores configuration so more than one executor can be started on the worker all executors start failing because of out of memory:
My SC configuration:
My worker configurtion
When I run the SC without max cores config:
Only one executor of max cores is created and the job runs correctly. |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments
-
Did you follow the instructions at https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#spark-standalone-cluster ? I mostly want to know if we need to update the docs because it is not clear or if you didn't see them. I think your problem is that Spark is some how assigning multiple executors to a single GPU, so GPU scheduling is not working as expected. My guess is that you didn't request any GPUs for your executors. You need to set Another thing to do is to go to the Spark cluster UI and look at the resources per worker to be sure that Spark sees the GPUs. You can also look at the application UI page to be sure that the application has been assigned GPUs. |
Beta Was this translation helpful? Give feedback.
-
Thank you for reply.
I checked Job configuration on that website gives me Thanks, it may get me little closer to solving that issue. It seems that making worker see GPUs should resolve the issue. The question now is how to do it? As I wrote above, I set needed config options for the worker from that page. |
Beta Was this translation helpful? Give feedback.
-
The worker should be able to figure it out form the discovery script. Could you try to run it and see what it reports? |
Beta Was this translation helpful? Give feedback.
-
you are starting 8 workers on the same node or different nodes? |
Beta Was this translation helpful? Give feedback.
-
This is output of my script: {"name": "gpu", "addresses":["0","1","2","3","4","5","6","7"]}
I am starting one worker on DGX and configuring jobs to run more than one executor. I will let you know if I find any solution to this problem. |
Beta Was this translation helpful? Give feedback.
-
I found the problem. It seems it was very dumb thing. My spark-env file was not correctly loaded. Now, instead I loaded config files with Thank you for help. |
Beta Was this translation helpful? Give feedback.
I found the problem. It seems it was very dumb thing.
My spark-env file was not correctly loaded.
Now, instead I loaded config files with
--properties-file
option when starting a worker withspark.worker.resource.gpu.discoveryScript /spark/spark-3.0.1-bin-hadoop2.7/bin/getGpuResources.sh
in this file and it works. It was my fault.Thank you for help.