Support for Spark DL notebooks with PyTriton on Databricks/Dataproc #483

rishic3 · 2025-01-16T23:58:40Z

Support for running DL Inference notebooks on CSP environments.

Refactored Triton sections to use PyTriton, a Python API for the Triton inference server which avoids Docker. Once this PR is merged, Triton sections no longer need to be skipped in the CI pipeline @YanxuanLiu .
Updated notebooks with instructions to run on Databricks/Dataproc
Updated Torch notebooks with best practices for ahead-of-time TensorRT compilation.
Cleaned up README, removing instructions to start Jupyter with PySpark (we need a cell to attach to standalone for CI/CD anyway, so hoping to reduce confusion for user).

Notebook outputs are saved from running locally, but all notebooks were tested on Databricks/Dataproc.

Signed-off-by: Rishi Chandra <[email protected]>

eordentlich

Looks good overall. A few comments.

In a future optimization we can look at something like https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_cudashm_client.py or for regular shm to reduce data copy (if I'm interpreting these correctly).

eordentlich · 2025-01-24T20:58:13Z

examples/ML+DL-Examples/Spark-DL/dl_inference/databricks/setup/init_spark_dl_tf.sh

+sudo /databricks/python3/bin/pip3 install --upgrade --force-reinstall -r temp_requirements.txt
+rm temp_requirements.txt
+
+set +x


Add a carriage return at the end of last line in all files this symbol appears.

Deleted, also merged the tf/torch scripts into one for convenience.

eordentlich · 2025-01-24T22:06:22Z

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb

-    "df = spark.read.parquet(\"imdb_test\").limit(100).cache()"
+    "def _use_stage_level_scheduling(spark, rdd):\n",
+    "\n",
+    "    if spark.version < \"3.4.0\":\n",


This check probably not needed since predict_batch_udf is also not in spark < 3.4

eordentlich · 2025-01-24T23:05:44Z

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = spark.read.parquet(data_path).limit(256).repartition(8)"


Is limit and repartition needed? And is this the right order? And why these numbers? A comment might be in order. Propagate any changes to other notebooks.

This was intended to test the minimal scenario of 1 batch per task—especially with tensorflow, too high of a number can be really slow (>1 min). (In previous versions we were limiting to 100 rows: https://github.com/NVIDIA/spark-rapids-examples/blob/branch-23.06/examples/ML%2BDL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation.ipynb?short_path=d3949f8#L1208)

eordentlich · 2025-01-24T23:12:40Z

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb

   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 56,
+   "execution_count": null,


fyi, spark.stop() below might be bad for databricks. It puts the cluster in a bad state. (at least in older versions like 13.3 from what I've seen).

Yup, issue persists on latest runtime - addressed

eordentlich · 2025-01-25T01:37:09Z

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb

-    "def stop_triton(it):\n",
-    "    import docker\n",
-    "    import time\n",
+    "def stop_triton(pids):\n",


Can this along with all the other triton related code that is common across the notebooks be moved to a single python file triton_utils.py that gets shipped via pyfiles with each Spark job and then imported in the notebooks? Would avoid a lot of repetition.

rishic3 · 2025-01-27T17:20:07Z

Looks good overall. A few comments.

In a future optimization we can look at something like https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_cudashm_client.py or for regular shm to reduce data copy (if I'm interpreting these correctly).

Good idea, will definitely follow-up with this improvement. Note per pytriton team—with shm, there still will be an additional inter-process data copy (until Triton 3 release):
shm -> python backend -> (copy input) -> pytriton server -> (copy output) -> python backend -> shm
but per their benchmarks this is a few ms of latency (for ~4MB inputs — with larger inputs it might be more significant but still likely within the range of noise).

rishic3 added 4 commits January 16, 2025 15:52

Add notebooks with runs

1152443

Add image, update readme/requirements

4efa8e4

Add dataproc instructions

5f6268f

Add databricks instructions

7e0c1e3

Signed-off-by: Rishi Chandra <[email protected]>

rishic3 marked this pull request as ready for review January 17, 2025 00:36

eordentlich reviewed Jan 25, 2025

View reviewed changes

rishic3 added 4 commits January 27, 2025 11:57

Combine init scripts for databricks

b076e26

Move common PyTriton funcs to utils

916d05d

Use https path to pyfile

aeb3a36

cleanup

8088c58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Spark DL notebooks with PyTriton on Databricks/Dataproc #483

Support for Spark DL notebooks with PyTriton on Databricks/Dataproc #483

rishic3 commented Jan 16, 2025

eordentlich left a comment

eordentlich Jan 24, 2025

rishic3 Jan 28, 2025

eordentlich Jan 24, 2025

rishic3 Jan 28, 2025

eordentlich Jan 24, 2025

rishic3 Jan 27, 2025 •

edited

Loading

eordentlich Jan 24, 2025

rishic3 Jan 28, 2025

eordentlich Jan 25, 2025

rishic3 Jan 28, 2025

rishic3 commented Jan 27, 2025

Support for Spark DL notebooks with PyTriton on Databricks/Dataproc #483

Are you sure you want to change the base?

Support for Spark DL notebooks with PyTriton on Databricks/Dataproc #483

Conversation

rishic3 commented Jan 16, 2025

Support for running DL Inference notebooks on CSP environments.

eordentlich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rishic3 Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rishic3 commented Jan 27, 2025

rishic3 Jan 27, 2025 •

edited

Loading