Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding XGBoost on GPU quickstart materials #45

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

sfc-gh-ebotwick
Copy link

No description provided.

"name": "Intro",
"collapsed": false
},
"source": "# GPU Based XGBoost Training\n## In the following notebook we will leverage Snowpark Container Services (SPCS) to run a notebook within Snowflake on a series of GPUs\n\n### * Workflow* \n- Inspect GPU resources available - for this exercise we will use four NVIDIA A10G GPUs\n- Load in data from Snowflake table\n- Set up data for modeling\n- Train two XGBoost models - one trained with CPUs and one leveraging our GPU cluster\n- Compare runtimes and results of our models\n\n\n### * Key Takeaways* \n- SPCS allows users to run notebook workloads that execute on containers, rather than virtual warehouses in Snowflake\n- GPUs can greatly speed up model training jobs 🔥\n- Bringing in third party python libraries offers flexibility to leverage great contirbutions to the OSS ecosystem\n\n\n### Note - In order to successfully run !pip installs make sure you have enabled the external access integration with pypi\n- Do so by clicking on the drop down of the 🟢 Active kernel settings button, clicking Edit Compute Settings, then turning on the PYPI_ACCESS_INTEGRATION radio button in the external access tab"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PYPI_ACCESS_INTEGRATION network integration is not a preset network integration and is created by account admin. So, better link to how it's done.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done so in setup.sql

"codeCollapsed": false,
"collapsed": false
},
"source": "#Load in data from Snowflake table into a Snowpark dataframe\ntable = \"XGB_GPU_DATABASE.XGB_GPU_SCHEMA.VEHICLES_TABLE\"\ndf = session.table(table)\ndf.count(), len(df.columns)",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XGB_GPU_DATABASE. XGB_GPU_SCHEMA.VEHICLES_TABLE
would not exist in customer account by default.
Could we make it self-contained by having a cell downloading and generating the dataset in snowflake table? Could be in appendix to not disrupt the flow.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in setup.sql

Copy link

@sfc-gh-halu sfc-gh-halu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall.
Left two comments.

Comment on lines +11 to +16
CREATE OR REPLACE DATABASE XGB_GPU_DATABASE;
CREATE OR REPLACE SCHEMA XGB_GPU_SCHEMA;

-- create external stage with the csv format to stage the dataset
CREATE STAGE IF NOT EXISTS XGB_GPU_DATABASE.XGB_GPU_SCHEMA.VEHICLES
URL = 's3://sfquickstarts/misc/demos/vehicles.csv';

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a mix of dataset set up SQLs and SPCS, network integration steups.

Could we consolidate data preparation in sequentially(create db/schema, create stage, create table, copyinto)? It would be easier for people to follow and selectively copy and apply for the setup. (e.g., for me who have other things set up, only need to apply data prep SQL)

"name": "model_training_takeaways",
"collapsed": false
},
"source": "## While results aren't entirely determinstic, you should have seen a 3-4x speedup in model training from CPU to GPU training. \n### Investigate in the logs from the two above cells where you see the message *[RayXGBoost] Finished XGBoost training* and look to the end of the line to see the pure training time for that model"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep "Investigate in the logs from the two above cells where you see the message [RayXGBoost] Finished XGBoost training and look to the end of the line to see the pure training time for that model"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants