-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding XGBoost on GPU quickstart materials #45
base: main
Are you sure you want to change the base?
Adding XGBoost on GPU quickstart materials #45
Conversation
"name": "Intro", | ||
"collapsed": false | ||
}, | ||
"source": "# GPU Based XGBoost Training\n## In the following notebook we will leverage Snowpark Container Services (SPCS) to run a notebook within Snowflake on a series of GPUs\n\n### * Workflow* \n- Inspect GPU resources available - for this exercise we will use four NVIDIA A10G GPUs\n- Load in data from Snowflake table\n- Set up data for modeling\n- Train two XGBoost models - one trained with CPUs and one leveraging our GPU cluster\n- Compare runtimes and results of our models\n\n\n### * Key Takeaways* \n- SPCS allows users to run notebook workloads that execute on containers, rather than virtual warehouses in Snowflake\n- GPUs can greatly speed up model training jobs 🔥\n- Bringing in third party python libraries offers flexibility to leverage great contirbutions to the OSS ecosystem\n\n\n### Note - In order to successfully run !pip installs make sure you have enabled the external access integration with pypi\n- Do so by clicking on the drop down of the 🟢 Active kernel settings button, clicking Edit Compute Settings, then turning on the PYPI_ACCESS_INTEGRATION radio button in the external access tab" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PYPI_ACCESS_INTEGRATION
network integration is not a preset network integration and is created by account admin. So, better link to how it's done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done so in setup.sql
"codeCollapsed": false, | ||
"collapsed": false | ||
}, | ||
"source": "#Load in data from Snowflake table into a Snowpark dataframe\ntable = \"XGB_GPU_DATABASE.XGB_GPU_SCHEMA.VEHICLES_TABLE\"\ndf = session.table(table)\ndf.count(), len(df.columns)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XGB_GPU_DATABASE. XGB_GPU_SCHEMA.VEHICLES_TABLE
would not exist in customer account by default.
Could we make it self-contained by having a cell downloading and generating the dataset in snowflake table? Could be in appendix to not disrupt the flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in setup.sql
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall.
Left two comments.
CREATE OR REPLACE DATABASE XGB_GPU_DATABASE; | ||
CREATE OR REPLACE SCHEMA XGB_GPU_SCHEMA; | ||
|
||
-- create external stage with the csv format to stage the dataset | ||
CREATE STAGE IF NOT EXISTS XGB_GPU_DATABASE.XGB_GPU_SCHEMA.VEHICLES | ||
URL = 's3://sfquickstarts/misc/demos/vehicles.csv'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a mix of dataset set up SQLs and SPCS, network integration steups.
Could we consolidate data preparation in sequentially(create db/schema, create stage, create table, copyinto)? It would be easier for people to follow and selectively copy and apply for the setup. (e.g., for me who have other things set up, only need to apply data prep SQL)
"name": "model_training_takeaways", | ||
"collapsed": false | ||
}, | ||
"source": "## While results aren't entirely determinstic, you should have seen a 3-4x speedup in model training from CPU to GPU training. \n### Investigate in the logs from the two above cells where you see the message *[RayXGBoost] Finished XGBoost training* and look to the end of the line to see the pure training time for that model" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to keep "Investigate in the logs from the two above cells where you see the message [RayXGBoost] Finished XGBoost training and look to the end of the line to see the pure training time for that model"?
No description provided.