Skip to content

Commit

Permalink
Merge pull request #223 from NVIDIA/branch-22.08
Browse files Browse the repository at this point in the history
merge branch-22.08 to main branch
  • Loading branch information
nvliyuan authored Aug 31, 2022
2 parents d056339 + da21bae commit 998abfb
Show file tree
Hide file tree
Showing 63 changed files with 3,235 additions and 1,664 deletions.
8 changes: 4 additions & 4 deletions .github/workflows/auto-merge.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ name: auto-merge HEAD to BASE
on:
pull_request_target:
branches:
- branch-22.06
- branch-22.08
types: [closed]

jobs:
Expand All @@ -29,13 +29,13 @@ jobs:
steps:
- uses: actions/checkout@v2
with:
ref: branch-22.06 # force to fetch from latest upstream instead of PR ref
ref: branch-22.08 # force to fetch from latest upstream instead of PR ref

- name: auto-merge job
uses: ./.github/workflows/auto-merge
env:
OWNER: NVIDIA
REPO_NAME: spark-rapids-examples
HEAD: branch-22.06
BASE: branch-22.08
HEAD: branch-22.08
BASE: branch-22.10
AUTOMERGE_TOKEN: ${{ secrets.AUTOMERGE_TOKEN }} # use to merge PR
Binary file removed datasets/mortgage-small.tar.gz
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@ Two files are required by PySpark:

+ *samples.zip*

the package including all example code
the package including all example code.
Executing the above build commands generates the samples.zip file in 'spark-rapids-examples/examples/XGBoost-Examples' folder

+ *main.py*

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ cluster.

- [Databricks 10.4 LTS
ML](https://docs.databricks.com/release-notes/runtime/9.1ml.html#system-environment) has CUDA 11
installed. Users will need to use 22.06.0 or later on Databricks 10.4 LTS ML. In this case use
installed. Users will need to use 22.04.0 or later on Databricks 10.4 LTS ML. In this case use
[generate-init-script-10.4.ipynb](generate-init-script-10.4.ipynb) which will install
the RAPIDS Spark plugin.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
"source": [
"%sh\n",
"cd ../../dbfs/FileStore/jars/\n",
"sudo wget -O rapids-4-spark_2.12-22.06.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.06.0/rapids-4-spark_2.12-22.06.0.jar\n",
"sudo wget -O rapids-4-spark_2.12-22.08.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.08.0/rapids-4-spark_2.12-22.08.0.jar\n",
"sudo wget -O xgboost4j-gpu_2.12-1.6.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-gpu_2.12/1.6.1/xgboost4j-gpu_2.12-1.6.1.jar\n",
"sudo wget -O xgboost4j-spark-gpu_2.12-1.6.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark-gpu_2.12/1.6.1/xgboost4j-spark-gpu_2.12-1.6.1.jar\n",
"ls -ltr\n",
Expand Down Expand Up @@ -60,7 +60,7 @@
"sudo rm -f /databricks/jars/spark--maven-trees--ml--10.x--xgboost-gpu--ml.dmlc--xgboost4j-spark-gpu_2.12--ml.dmlc__xgboost4j-spark-gpu_2.12__1.5.2.jar\n",
"\n",
"sudo cp /dbfs/FileStore/jars/xgboost4j-gpu_2.12-1.6.1.jar /databricks/jars/\n",
"sudo cp /dbfs/FileStore/jars/rapids-4-spark_2.12-22.06.0.jar /databricks/jars/\n",
"sudo cp /dbfs/FileStore/jars/rapids-4-spark_2.12-22.08.0.jar /databricks/jars/\n",
"sudo cp /dbfs/FileStore/jars/xgboost4j-spark-gpu_2.12-1.6.1.jar /databricks/jars/\"\"\", True)"
]
},
Expand Down Expand Up @@ -133,7 +133,7 @@
"1. Edit your cluster, adding an initialization script from `dbfs:/databricks/init_scripts/init.sh` in the \"Advanced Options\" under \"Init Scripts\" tab\n",
"2. Reboot the cluster\n",
"3. Go to \"Libraries\" tab under your cluster and install `dbfs:/FileStore/jars/xgboost4j-spark-gpu_2.12-1.6.1.jar` in your cluster by selecting the \"DBFS\" option for installing jars\n",
"4. Import the mortgage example notebook from `https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.06/examples/XGBoost-Examples/mortgage/notebooks/python/mortgage-gpu.ipynb`\n",
"4. Import the mortgage example notebook from `https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.08/examples/XGBoost-Examples/mortgage/notebooks/python/mortgage-gpu.ipynb`\n",
"5. Inside the mortgage example notebook, update the data paths\n",
" `train_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-train.csv')`\n",
" `trans_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-trans.csv')`"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
"source": [
"%sh\n",
"cd ../../dbfs/FileStore/jars/\n",
"sudo wget -O rapids-4-spark_2.12-22.06.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.06.0/rapids-4-spark_2.12-22.06.0.jar\n",
"sudo wget -O rapids-4-spark_2.12-22.08.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.08.0/rapids-4-spark_2.12-22.08.0.jar\n",
"sudo wget -O xgboost4j-gpu_2.12-1.6.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-gpu_2.12/1.6.1/xgboost4j-gpu_2.12-1.6.1.jar\n",
"sudo wget -O xgboost4j-spark-gpu_2.12-1.6.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark-gpu_2.12/1.6.1/xgboost4j-spark-gpu_2.12-1.6.1.jar\n",
"ls -ltr\n",
Expand Down Expand Up @@ -60,7 +60,7 @@
"sudo rm -f /databricks/jars/spark--maven-trees--ml--9.x--xgboost-gpu--ml.dmlc--xgboost4j-spark-gpu_2.12--ml.dmlc__xgboost4j-spark-gpu_2.12__1.4.1.jar\n",
"\n",
"sudo cp /dbfs/FileStore/jars/xgboost4j-gpu_2.12-1.6.1.jar /databricks/jars/\n",
"sudo cp /dbfs/FileStore/jars/rapids-4-spark_2.12-22.06.0.jar /databricks/jars/\n",
"sudo cp /dbfs/FileStore/jars/rapids-4-spark_2.12-22.08.0.jar /databricks/jars/\n",
"sudo cp /dbfs/FileStore/jars/xgboost4j-spark-gpu_2.12-1.6.1.jar /databricks/jars/\"\"\", True)"
]
},
Expand Down Expand Up @@ -133,7 +133,7 @@
"1. Edit your cluster, adding an initialization script from `dbfs:/databricks/init_scripts/init.sh` in the \"Advanced Options\" under \"Init Scripts\" tab\n",
"2. Reboot the cluster\n",
"3. Go to \"Libraries\" tab under your cluster and install `dbfs:/FileStore/jars/xgboost4j-spark-gpu_2.12-1.6.1.jar` in your cluster by selecting the \"DBFS\" option for installing jars\n",
"4. Import the mortgage example notebook from `https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.06/examples/XGBoost-Examples/mortgage/notebooks/python/mortgage-gpu.ipynb`\n",
"4. Import the mortgage example notebook from `https://github.com/NVIDIA/spark-rapids-examples/blob/branch-22.08/examples/XGBoost-Examples/mortgage/notebooks/python/mortgage-gpu.ipynb`\n",
"5. Inside the mortgage example notebook, update the data paths\n",
" `train_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-train.csv')`\n",
" `trans_data = reader.schema(schema).option('header', True).csv('/data/mortgage/csv/small-trans.csv')`"
Expand Down
22 changes: 22 additions & 0 deletions docs/get-started/xgboost-examples/dataset/mortgage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# How to download the Mortgage dataset



## Steps to download the data

1. Go to the [Fannie Mae](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data) website
2. Click on [Single-Family Loan Performance Data](https://datadynamics.fanniemae.com/data-dynamics/?&_ga=2.181456292.2043790680.1657122341-289272350.1655822609#/reportMenu;category=HP)
* Register as a new user if you are using the website for the first time
* Use the credentials to login
3. Select [HP](https://datadynamics.fanniemae.com/data-dynamics/#/reportMenu;category=HP)
4. Click on **Download Data** and choose *Single-Family Loan Performance Data*
5. You will find a tabular list of 'Acquisition and Performance' files sorted based on year and quarter. Click on the file to download `Eg: 2017Q1.zip`
6. Unzip the downlad file to extract the csv file `Eg: 2017Q1.csv`
7. Copy only the csv files to a new folder for the ETL to read

## Notes
1. Refer to the [Loan Performance Data Tutorial](https://capitalmarkets.fanniemae.com/media/9066/display) for more details.
2. Note that *Single-Family Loan Performance Data* has 2 componenets. However, the Mortgage ETL requires only the first one (primary dataset)
* Primary Dataset: Acquisition and Performance Files
* HARP Dataset
3. Use the [Resources](https://datadynamics.fanniemae.com/data-dynamics/#/resources/HP) section to know more about the dataset
30 changes: 29 additions & 1 deletion docs/get-started/xgboost-examples/notebook/python-notebook.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@ and the home directory for Apache Spark respectively.

3. Launch the notebook:

Note: For ETL jobs, Set `spark.task.resource.gpu.amount` to `1/spark.executor.cores`.

For ETL:

``` bash
PYSPARK_DRIVER_PYTHON=jupyter \
PYSPARK_DRIVER_PYTHON_OPTS=notebook \
Expand All @@ -28,14 +32,38 @@ and the home directory for Apache Spark respectively.
--jars ${RAPIDS_JAR},${XGBOOST4J_JAR},${XGBOOST4J_SPARK_JAR}\
--py-files ${XGBOOST4J_SPARK_JAR},${SAMPLE_ZIP} \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.memory.gpu.pooling.enabled=false \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.executor.cores=10 \
--conf spark.task.resource.gpu.amount=0.1 \
--conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer \
--conf spark.rapids.sql.hasNans=false \
--conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
--files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh
```

For XGBoost:

``` bash
PYSPARK_DRIVER_PYTHON=jupyter \
PYSPARK_DRIVER_PYTHON_OPTS=notebook \
pyspark \
--master ${SPARK_MASTER} \
--jars ${RAPIDS_JAR},${XGBOOST4J_JAR},${XGBOOST4J_SPARK_JAR}\
--py-files ${XGBOOST4J_SPARK_JAR},${SAMPLE_ZIP} \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.memory.gpu.pool=NONE \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.executor.cores=10 \
--conf spark.task.resource.gpu.amount=1 \
--conf spark.rapids.sql.hasNans=false \
--conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
--files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh
```



4. Launch ETL Part

- Mortgage ETL Notebook: [Python](../../../../examples/XGBoost-Examples/mortgage/notebooks/python/MortgageETL.ipynb)
- Taxi ETL Notebook: [Python](../../../../examples/XGBoost-Examples/taxi/notebooks/python/taxi-ETL.ipynb)
- Note: Agaricus does not have ETL part.
Expand Down
25 changes: 23 additions & 2 deletions docs/get-started/xgboost-examples/notebook/toree.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,18 +29,39 @@ and the home directory for Apache Spark respectively.

4. Install a new kernel with gpu enabled and launch the notebook

Note: For ETL jobs, Set `spark.task.resource.gpu.amount` to `1/spark.executor.cores`.

For ETL:
``` bash
jupyter toree install \
--spark_home=${SPARK_HOME} \
--user \
--toree_opts='--nosparkcontext' \
--kernel_name="XGBoost4j-Spark" \
--kernel_name="ETL-Spark" \
--spark_opts='--master ${SPARK_MASTER} \
--jars ${RAPIDS_JAR},${SAMPLE_JAR} \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.executor.extraClassPath=${RAPIDS_JAR} \
--conf spark.executor.cores=10 \
--conf spark.task.resource.gpu.amount=0.1 \
--conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
--files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh'
```

For XGBoost:
``` bash
jupyter toree install \
--spark_home=${SPARK_HOME} \
--user \
--toree_opts='--nosparkcontext' \
--kernel_name="XGBoost-Spark" \
--spark_opts='--master ${SPARK_MASTER} \
--jars ${RAPIDS_JAR},${SAMPLE_JAR} \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.executor.extraClassPath=${RAPIDS_JAR} \
--conf spark.rapids.memory.gpu.pooling.enabled=false \
--conf spark.rapids.memory.gpu.pool=NONE \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.executor.cores=10 \
--conf spark.task.resource.gpu.amount=1 \
--conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
--files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh'
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ export SPARK_DOCKER_IMAGE=<gpu spark docker image repo and name>
export SPARK_DOCKER_TAG=<spark docker image tag>

pushd ${SPARK_HOME}
wget https://github.com/NVIDIA/spark-rapids-examples/raw/branch-22.06/dockerfile/Dockerfile
wget https://github.com/NVIDIA/spark-rapids-examples/raw/branch-22.08/dockerfile/Dockerfile

# Optionally install additional jars into ${SPARK_HOME}/jars/

Expand All @@ -60,9 +60,10 @@ on cluster filesystems like HDFS, or in [object stores like S3 and GCS](https://
Note that using [application dependencies](https://spark.apache.org/docs/latest/running-on-kubernetes.html#dependency-management) from
the submission client’s local file system is currently not yet supported.

Note: the `mortgage_eval_merged.csv` and `mortgage_train_merged.csv` are not Mortgage raw data,
they are the data produced by Mortgage ETL job. If user wants to use a larger size Mortgage data, please refer to [Launch ETL job](#etl).
Taxi ETL job is the same. But Agaricus does not have ETL process, it is combined with XGBoost as there is just a filter operation.
#### Note:
1. Mortgage and Taxi jobs have ETLs to generate the processed data.
2. For convenience, a subset of [Taxi](/datasets/) dataset is made available in this repo that can be readily used for launching XGBoost job. Use [ETL](#etl) to generate larger datasets for trainig and testing.
3. Agaricus does not have an ETL process, it is combined with XGBoost as there is just a filter operation.

Save Kubernetes Template Resources
----------------------------------
Expand All @@ -89,35 +90,41 @@ to execute using a GPU which is already in use -- causing undefined behavior and

<span id="etl">Launch Mortgage or Taxi ETL Part</span>
---------------------------
Use the ETL app to process raw Mortgage data. You can either use this ETLed data to split into training and evaluation data or run the ETL on different subsets of the dataset to produce training and evaluation datasets.

Note: For ETL jobs, Set `spark.task.resource.gpu.amount` to `1/spark.executor.cores`.

Run spark-submit

``` bash
${SPARK_HOME}/bin/spark-submit \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.memory.gpu.pooling.enabled=false \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.task.resource.gpu.amount=1 \
--conf spark.executor.cores=10 \
--conf spark.task.resource.gpu.amount=0.1 \
--conf spark.rapids.sql.incompatibleDateFormats.enabled=true \
--conf spark.rapids.sql.csv.read.double.enabled=true \
--conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
--conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer \
--conf spark.rapids.sql.hasNans=false \
--files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh \
--jars ${RAPIDS_JAR} \
--master <k8s://ip:port or k8s://URL> \
--deploy-mode ${SPARK_DEPLOY_MODE} \
--num-executors ${SPARK_NUM_EXECUTORS} \
--driver-memory ${SPARK_DRIVER_MEMORY} \
--executor-memory ${SPARK_EXECUTOR_MEMORY} \
--class ${EXAMPLE_CLASS} \
--class com.nvidia.spark.examples.mortgage.ETLMain \
$SAMPLE_JAR \
-format=csv \
-dataPath="perf::${SPARK_XGBOOST_DIR}/mortgage/perf-train/" \
-dataPath="acq::${SPARK_XGBOOST_DIR}/mortgage/acq-train/" \
-dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/out/train/"

# if generating eval data, change the data path to eval as well as the corresponding perf-eval and acq-eval data
# -dataPath="perf::${SPARK_XGBOOST_DIR}/mortgage/perf-eval"
# -dataPath="acq::${SPARK_XGBOOST_DIR}/mortgage/acq-eval"
# -dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/out/eval/"
-dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/" \
-dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/train/" \
-dataPath="tmp::${SPARK_XGBOOST_DIR}/mortgage/output/tmp/"

# if generating eval data, change the data path to eval
# -dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/"
# -dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/eval/"
# -dataPath="tmp::${SPARK_XGBOOST_DIR}/mortgage/output/tmp/"
# if running Taxi ETL benchmark, change the class and data path params to
# -class com.nvidia.spark.examples.taxi.ETLMain
# -dataPath="raw::${SPARK_XGBOOST_DIR}/taxi/your-path"
Expand Down Expand Up @@ -163,9 +170,9 @@ export SPARK_DRIVER_MEMORY=4g
export SPARK_EXECUTOR_MEMORY=8g

# example class to use
export EXAMPLE_CLASS=com.nvidia.spark.examples.mortgage.GPUMain
# or change to com.nvidia.spark.examples.taxi.GPUMain to run Taxi Xgboost benchmark
# or change to com.nvidia.spark.examples.agaricus.GPUMain to run Agaricus Xgboost benchmark
export EXAMPLE_CLASS=com.nvidia.spark.examples.mortgage.Main
# or change to com.nvidia.spark.examples.taxi.Main to run Taxi Xgboost benchmark
# or change to com.nvidia.spark.examples.agaricus.Main to run Agaricus Xgboost benchmark

# tree construction algorithm
export TREE_METHOD=gpu_hist
Expand All @@ -176,9 +183,10 @@ Run spark-submit:
``` bash
${SPARK_HOME}/bin/spark-submit \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.memory.gpu.pooling.enabled=false \
--conf spark.rapids.memory.gpu.pool=NONE \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.task.resource.gpu.amount=1 \
--conf spark.rapids.sql.hasNans=false \
--conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
--files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh \
--jars ${RAPIDS_JAR} \
Expand All @@ -192,9 +200,9 @@ ${SPARK_HOME}/bin/spark-submit
--conf spark.kubernetes.executor.podTemplateFile=${TEMPLATE_PATH} \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
${SAMPLE_JAR} \
-dataPath=train::${DATA_PATH}/mortgage/csv/train/mortgage_train_merged.csv \
-dataPath=trans::${DATA_PATH}/mortgage/csv/test/mortgage_eval_merged.csv \
-format=csv \
-dataPath=train::${SPARK_XGBOOST_DIR}/mortgage/output/train/ \
-dataPath=trans::${SPARK_XGBOOST_DIR}/mortgage/output/eval/ \
-format=parquet \
-numWorkers=${SPARK_NUM_EXECUTORS} \
-treeMethod=${TREE_METHOD} \
-numRound=100 \
Expand Down
Loading

0 comments on commit 998abfb

Please sign in to comment.