What is the correct way of using only a certain number of GPUs? #9440

an-ys · 2023-10-13T10:47:49Z

an-ys
Oct 13, 2023

If I only want to use a certain number of GPUs on Spark standalone mode, what is the correct method to do so?

As of now, I am using the following method:

Disable the GPU(s) using nvidia-smi
Set the number of executors to the updated number of GPUs
Set the number of GPUs on spark-env.sh (i.e. -Dspark.worker.resource.gpu.amount=n in SPARK_WORKER_OPTS) of the node(s) with the disabled GPU(s)
Restart the Spark worker(s)

It works, but it completely disables the GPU and sometimes, Spark still tries to create an extra executor that will fail when it finds out that the index of the executor is not found in nvidia-smi (although this doesn't affect the execution of the whole application).

Is there a way to use a certain number of GPUs without disabling the GPU and without changing the mode to a mode that supports isolation like YARN,if it is indeed possible to do this with modes that support isolation?

Answered by tgravescs

Oct 13, 2023

so for standalone mode, the easiest way is likely to change the discovery script. Ideally setting the number of gpu per workers would have done it but the spark community decided it gets overridden by what the discover script actually finds.

So normally in Spark standalone when you launch your worker you do something like:
SPARK_WORKER_OPTS="-Dspark.worker.resource.gpu.amount=4 -Dspark.worker.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/getGpusResources.sh"

The getGpusResources.sh script is what is returning the GPU indices available on the node that are used by the Spark worker. So if you want to just limit it to say 1 GPU you could modify that script to change the lookup to be so…

View full answer

tgravescs · 2023-10-13T14:33:07Z

tgravescs
Oct 13, 2023
Maintainer

so for standalone mode, the easiest way is likely to change the discovery script. Ideally setting the number of gpu per workers would have done it but the spark community decided it gets overridden by what the discover script actually finds.

So normally in Spark standalone when you launch your worker you do something like:
SPARK_WORKER_OPTS="-Dspark.worker.resource.gpu.amount=4 -Dspark.worker.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/getGpusResources.sh"

The getGpusResources.sh script is what is returning the GPU indices available on the node that are used by the Spark worker. So if you want to just limit it to say 1 GPU you could modify that script to change the lookup to be something like:

ADDRS=`nvidia-smi --query-gpu=index --format=csv,noheader | head -n 1 | sed -e ':a' -e 'N' -e'$!ba' -e 's/\n/","/g'`

Specifically in that command I added " | head -n 1", which truncates the number output to 1. So change 1 to be the number of GPUs you want to show up to the worker.

If this doesn't work or isn't ideal for your situation let me know.

Note you should be able to go the Spark master UI and see that your worker has a certain set of resources free, for example:

gpu: Free: [0, 1] / Used: []

If that isn't showing the limited number then something isn't setup quite right.

The other thing to keep in mind is just to make sure you specify the number of gpus per executor and per task. So after starting your standalone cluster, when you launch your Spark application use something like:

--conf spark.executor.resource.gpu.amount=1
--conf spark.task.resource.gpu.amount=0.25

You want the task resource.gpu.amount to generally be 1/number of executore cores you allocated. If questions on that please take a look at our documentation: https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#spark-standalone-cluster
And if questions feel free to ask.

3 replies

an-ys Nov 27, 2023
Author

Sorry for the late reply!

Thank you very much! I was able to get it working for my applications. I just needed to modify the spark.worker.resource.gpu.amount in the spark-env.sh file and then restart the Spark worker to get this working. I tried hard-coding the output before (asking), but it didn't work. Now, I can set the number of GPUs.

I have a follow-up question. It is possible to set the number of GPUs, but it is not possible to assign the size of the reserved memory (spark.rapids.memory.gpu.reserve) to each of the GPU I selected (unless I use MIG, which I am not sure whether this plugin supports?), right? I can only set a config like "spark.rapids.memory.gpu.reserve" or "allocFraction" globally?

tgravescs Nov 27, 2023
Maintainer

If you are asking if you can assign memory size from Spark, then the answer is no.

an-ys Nov 27, 2023
Author

Oh okay, I understand. Thank you very much for your reply.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the correct way of using only a certain number of GPUs? #9440

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

What is the correct way of using only a certain number of GPUs? #9440

an-ys Oct 13, 2023

Replies: 1 comment · 3 replies

tgravescs Oct 13, 2023 Maintainer

an-ys Nov 27, 2023 Author

tgravescs Nov 27, 2023 Maintainer

an-ys Nov 27, 2023 Author

an-ys
Oct 13, 2023

Replies: 1 comment 3 replies

tgravescs
Oct 13, 2023
Maintainer

an-ys Nov 27, 2023
Author

tgravescs Nov 27, 2023
Maintainer

an-ys Nov 27, 2023
Author