-
If I only want to use a certain number of GPUs on Spark standalone mode, what is the correct method to do so? As of now, I am using the following method:
It works, but it completely disables the GPU and sometimes, Spark still tries to create an extra executor that will fail when it finds out that the index of the executor is not found in nvidia-smi (although this doesn't affect the execution of the whole application). Is there a way to use a certain number of GPUs without disabling the GPU and without changing the mode to a mode that supports isolation like YARN,if it is indeed possible to do this with modes that support isolation? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
so for standalone mode, the easiest way is likely to change the discovery script. Ideally setting the number of gpu per workers would have done it but the spark community decided it gets overridden by what the discover script actually finds. So normally in Spark standalone when you launch your worker you do something like: The getGpusResources.sh script is what is returning the GPU indices available on the node that are used by the Spark worker. So if you want to just limit it to say 1 GPU you could modify that script to change the lookup to be something like:
Specifically in that command I added " | head -n 1", which truncates the number output to 1. So change 1 to be the number of GPUs you want to show up to the worker. If this doesn't work or isn't ideal for your situation let me know. Note you should be able to go the Spark master UI and see that your worker has a certain set of resources free, for example: gpu: Free: [0, 1] / Used: [] If that isn't showing the limited number then something isn't setup quite right. The other thing to keep in mind is just to make sure you specify the number of gpus per executor and per task. So after starting your standalone cluster, when you launch your Spark application use something like: --conf spark.executor.resource.gpu.amount=1 You want the task resource.gpu.amount to generally be 1/number of executore cores you allocated. If questions on that please take a look at our documentation: https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#spark-standalone-cluster |
Beta Was this translation helpful? Give feedback.
so for standalone mode, the easiest way is likely to change the discovery script. Ideally setting the number of gpu per workers would have done it but the spark community decided it gets overridden by what the discover script actually finds.
So normally in Spark standalone when you launch your worker you do something like:
SPARK_WORKER_OPTS="-Dspark.worker.resource.gpu.amount=4 -Dspark.worker.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/getGpusResources.sh"
The getGpusResources.sh script is what is returning the GPU indices available on the node that are used by the Spark worker. So if you want to just limit it to say 1 GPU you could modify that script to change the lookup to be so…