Spark XGBoost examples here showcase the need for ETL+Training pipeline GPU acceleration. The Scala based XGBoost examples here use DMLC’s version. The pyspark based XGBoost examples requires installing RAPIDS via pip. Most data scientists spend a lot of time not only on Training models but also processing the large amounts of data needed to train these models. As you can see below, Pyspark+XGBoost training on GPUs can be up to 13X and data processing using RAPIDS Accelerator can also be accelerated with an end-to-end speed-up of 11X on GPU compared to CPU. In the public cloud, better performance can lead to significantly lower costs as demonstrated in this blog.
Note that the Training test result is based on 4 years Fannie Mea Single-Family Loan Performance Data with a 8 A100 GPU and 1024 CPU vcores cluster, the performance is affected by many aspects, including data size and type of GPU.
In this folder, there are three blue prints for users to learn about using Spark XGBoost and RAPIDS Accelerator on GPUs :
- Mortgage Prediction
- Agaricus Classification
- Taxi Fare Prediction
For each of these examples we have prepared a sample dataset in this folder for testing. These datasets are only provided for convenience. In order to test for performance, please download the larger dataset from their respectives sources.
There are three sections in this readme section. In the first section, we will list the notebooks that can be run on Jupyter with Python or Scala (Spylon Kernel or Apache Toree Kernel).
In the second section, we have sample jar files and source code if users would like to build and run this as a Scala or a PySpark Spark-XGBoost application.
In the last section, we provide basic “Getting Started Guides” for setting up GPU Spark-XGBoost on different environments based on the Apache Spark scheduler such as YARN, Standalone or Kubernetes.
- Mortgage Notebooks
- Agaricus Notebooks
- Taxi Notebook
- Python
- Scala
The first step to build a Spark application is preparing packages and datasets needed to build the jars. Please use the instructions below for building the
In addition, we have the source code for building reference applications. Below are source codes for the example Spark jobs:
Please follow below steps to run the example Spark jobs in different Spark environments:
- Getting started on on-premises clusters
- Getting started on cloud service providers
- Amazon AWS
- Databricks
- GCP
Please follow below steps to run the example notebooks in different notebook environments:
- Getting started for Jupyter Notebook applications
Note:
Update the default value of spark.sql.execution.arrow.maxRecordsPerBatch
to a larger number(such as 200000) will
significantly improve performance by accelerating data transfer between JVM and Python process.
For the CrossValidator job, we need to set spark.task.resource.gpu.amount=1
to allow only 1 training task running on 1 GPU(executor),
otherwise the customized CrossValidator may schedule more than 1 xgboost training tasks into one executor simultaneously and trigger
issue-131.
For XGBoost job, if the number of shuffle stage tasks before training is less than the num_worker,
the training tasks will be scheduled to run on part of nodes instead of all nodes due to Spark Data Locality feature.
The workaround is to increase the partitions of the shuffle stage by setting spark.sql.files.maxPartitionBytes=RightNum
.
If you are running XGBoost scala notebooks on Dataproc, please make sure to update below configs to avoid job failure:
spark.dynamicAllocation.enabled=false
spark.task.resource.gpu.amount=1