How do I set the rapids parameter so that all GPUs are used? #11177

onefanwu · 2024-07-12T12:12:48Z

onefanwu
Jul 12, 2024

Hello, I have 4 GPUs, but when I execute Spark Rapids, I only see GPU 0 being utilized. Could this be due to an error in my PySpark parameter settings?

python file:

# Initialize Spark session
spark = SparkSession.builder \
    .appName(experiment_name) \
    .config("spark.executor.memory", "80g") \
    .config("spark.driver.memory", "80g") \
    .config("spark.executor.cores", 4) \
    .config("spark.executor.instances", 32) \
    .config("spark.default.parallelism", 128) \
    .config("spark.cores.max", 128) \
    .config("spark.executor.resource.gpu.discoveryScript", gpu_script_path) \
    .config("spark.sql.execution.arrow.maxRecordsPerBatch", 10000) \
    .config("spark.plugins", "com.nvidia.spark.SQLPlugin") \
    .config("spark.rapids.sql.enabled", "true") \
    .config("spark.rapids.sql.explain", "ALL") \
    .config("spark.executor.resource.gpu.amount", 4) \
    .config("spark.rapids.sql.concurrentGpuTasks", 2) \
    .config("spark.rapids.memory.gpu.maxAllocFraction", 1) \
    .config("spark.rapids.memory.gpu.allocFraction", 0.2) \
    .config("spark.rapids.memory.gpu.minAllocFraction", 0.1) \
    .config("spark.rapids.sql.multiThreadedRead.numThreads", 128) \
    .config("spark.executor.extraClassPath", rapids_jar_path) \
    .config("spark.driver.extraClassPath", rapids_jar_path) \
    .getOrCreate()

getGpusResources.sh:

NUM_GPUS=4
ADDRS=$(seq -s ',' 0 $((NUM_GPUS - 1)) | sed 's/,/","/g')
echo '{"name": "gpu", "addresses":["'"$ADDRS"'"]}'

output:

{"name": "gpu", "addresses":["0","1","2","3"]}

Answered by revans2

Jul 12, 2024

    .config("spark.executor.resource.gpu.amount", 4) \

Should be 1 instead. We only support running with 1 GPU per executor. You also have not configured spark.task.resource.gpu.amount

https://docs.nvidia.com/spark-rapids/user-guide/latest/getting-started/overview.html#spark-gpu-scheduling-overview

Also just to be clear. We do not support using multiple GPUs in local mode. This is because we only support a single GPU per process right now and in local mode everything runs in a single process.

View full answer

revans2 · 2024-07-12T13:14:41Z

revans2
Jul 12, 2024
Maintainer

    .config("spark.executor.resource.gpu.amount", 4) \

Should be 1 instead. We only support running with 1 GPU per executor. You also have not configured spark.task.resource.gpu.amount

https://docs.nvidia.com/spark-rapids/user-guide/latest/getting-started/overview.html#spark-gpu-scheduling-overview

Also just to be clear. We do not support using multiple GPUs in local mode. This is because we only support a single GPU per process right now and in local mode everything runs in a single process.

1 reply

onefanwu Jul 12, 2024
Author

Thank you for your response. I realize that I have been using local mode. I will follow your advice to modify the parameters and switch to standalone mode. I have just found a configuration scheme for the system setup at the following link, which should work:
https://docs.nvidia.com/spark-rapids/user-guide/24.06/getting-started/on-premise.html#spark-standalone-cluster

revans2 · 2024-07-12T13:47:34Z

revans2
Jul 12, 2024
Maintainer

@onefanwu sounds good let me know if you run into any other problems.

0 replies

onefanwu · 2024-07-23T12:33:46Z

onefanwu
Jul 23, 2024
Author

@revans2 Hello, I followed your previous advice to connect PySpark with the worker in Spark Standalone mode. The worker is equipped with 3 GPUs and 128 CPU cores. However, I noticed that RAPIDS is still only using GPU 0 when executing my SQL query, and GPUs 1 and 2 are not being utilized. Could you please advise on how to make the remaining two GPUs also be used?

PySpark Configuration

max_threads = 128
vector_size = 10000

rapids_jar_path = "/workdir/AiQ-dev/spark-rapids-AiQ/dist/target/rapids-4-spark_2.12-24.06.0-cuda11.jar"
getGpusResources =  '/workdir/AiQ-dev/AiQ-benchmark/baseline/spark-RAPIDS/getGpusResources.sh'

# Function to stop the current Spark session
def stop_spark_session(spark):
    spark.stop()

# Function to create a new Spark session
def create_spark_session():    
    return SparkSession.builder \
        .appName(experiment_name) \
        .master("spark://localhost:7077") \
        .config("spark.executor.memory", "80g") \
        .config("spark.driver.memory", "80g") \
        .config("spark.worker.resource.gpu.amount", 3)\
        .config("spark.executor.resource.gpu.amount", 1) \
        .config("spark.task.resource.gpu.amount", 1/4)\
        .config("spark.executor.cores", 4) \
        .config("spark.executor.instances", 32) \
        .config("spark.default.parallelism", max_threads) \
        .config("spark.cores.max", max_threads) \
        .config("spark.sql.execution.arrow.maxRecordsPerBatch", vector_size) \
        .config("spark.plugins", "com.nvidia.spark.SQLPlugin") \
        .config("spark.rapids.sql.enabled", "true") \
        .config("spark.rapids.sql.explain", "ALL") \
        .config("spark.dynamicAllocation.enabled", "false") \
        .config("spark.sql.adaptive.enabled", "true") \
        .config("spark.rapids.sql.concurrentGpuTasks", 2) \
        .config("spark.rapids.memory.gpu.maxAllocFraction", 1) \
        .config("spark.rapids.memory.gpu.allocFraction", 0.2) \
        .config("spark.rapids.memory.gpu.minAllocFraction", 0.1) \
        .config("spark.rapids.sql.multiThreadedRead.numThreads", max_threads) \
        .config("spark.executor.extraClassPath", rapids_jar_path) \
        .config("spark.driver.extraClassPath", rapids_jar_path) \
        .config("spark.worker.resource.gpu.discoveryScript", getGpusResources) \
        .getOrCreate()

My SQL Query

query = f"""
WITH base_text AS
( SELECT text_embedding('OLAP Database') AS embedding )
SELECT
h.id,
h.title,
L2Distance(t.embedding, text_embedding(h.title)) AS distance
FROM {dataset} AS h, base_text AS t
ORDER BY distance ASC
LIMIT 10
"""

getGpusResources.sh:

# copy from https://github.com/apache/spark/blob/master/examples/src/main/scripts/getGpusResources.sh
ADDRS=`nvidia-smi --query-gpu=index --format=csv,noheader | sed -e ':a' -e 'N' -e'$!ba' -e 's/\n/","/g'`
echo {\"name\": \"gpu\", \"addresses\":[\"$ADDRS\"]}

output:
{"name": "gpu", "addresses":["0","1","2"]}

Snapshot

only the process (PID = 1901141) is used by Spark RAPIDS

0 replies

revans2 · 2024-07-23T15:12:11Z

revans2
Jul 23, 2024
Maintainer

Please look at the Spark UI while it is running. Not the job UI typically on port 4040, but the master UI on port 8080 by default. It should show which GPUs are assigned to your application along with which ones are free. It should help us see what the limiting factor is, because it looks like only a single worker process was launched, which might indicate that Spark thinks it is out of host memory or CPU cores so it didn't launch more workers.

0 replies

onefanwu · 2024-07-24T02:15:18Z

onefanwu
Jul 24, 2024
Author

@revans2 Hello, thank you very much for your guidance and suggestions. Your analysis was very accurate. Initially, I set the worker memory limit to 80GB. Also, I put the executor memory to 80GB, which resulted in only one executor being launched and, thus, only one GPU being utilized, as shown in the first image.

Then, I changed the worker memory limit to 256GB and kept the executor memory at 80 GB. I successfully launched three executors, and each executor utilized one GPU, effectively using all of my GPUs, as shown in the second and third images.

Thank you so much. You are truly very professional.

The first image:

The second image:

The third image:

0 replies

revans2 · 2024-07-30T13:34:37Z

revans2
Jul 30, 2024
Maintainer

@onefanwu happy to help.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I set the rapids parameter so that all GPUs are used? #11177

{{title}}

Replies: 6 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How do I set the rapids parameter so that all GPUs are used? #11177

onefanwu Jul 12, 2024

Replies: 6 comments · 1 reply

revans2 Jul 12, 2024 Maintainer

onefanwu Jul 12, 2024 Author

revans2 Jul 12, 2024 Maintainer

onefanwu Jul 23, 2024 Author

PySpark Configuration

My SQL Query

getGpusResources.sh:

Snapshot

revans2 Jul 23, 2024 Maintainer

onefanwu Jul 24, 2024 Author

revans2 Jul 30, 2024 Maintainer

onefanwu
Jul 12, 2024

Replies: 6 comments 1 reply

revans2
Jul 12, 2024
Maintainer

onefanwu Jul 12, 2024
Author

revans2
Jul 12, 2024
Maintainer

onefanwu
Jul 23, 2024
Author

revans2
Jul 23, 2024
Maintainer

onefanwu
Jul 24, 2024
Author

revans2
Jul 30, 2024
Maintainer