-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an ML pipeline tutorial #16632
base: main
Are you sure you want to change the base?
Add an ML pipeline tutorial #16632
Conversation
--- | ||
|
||
In the [Extract data from websites](/v3/tutorials/scraping) tutorial, you learned how to handle data dependencies and ingest large amounts of data. | ||
Now, you'll learn how to train a machine learning model using your data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have an overview of the steps we're going to go through here. Like a table of contents so that we can orient people
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The right nav effectively serves as a table of contents already, so how do you feel about showing a sequence diagram here instead?
Co-authored-by: Brendan O'Leary <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like the idea of the tutorial, i have a couple prelim questions
docs/v3/tutorials/ml.mdx
Outdated
with open("train.py", "w") as f: | ||
f.write(training_script) | ||
|
||
@task(cache_policy=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the builtin python None
is not actually a valid cache policy, i think you want cache_policies.NONE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs/v3/tutorials/ml.mdx
Outdated
f"""Create the training script dynamically""" | ||
training_script = """import argparse | ||
import boto3 | ||
import os | ||
import json | ||
import pandas as pd | ||
import numpy as np | ||
import xgboost as xgb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on first read, i don't like the idea of writing a script as a string. can you explain why we're doing this? I suspect there's a more first class way to define this script as a normal python file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zzstoatzz How do you feel about putting the template in a separate file instead? I agree it doesn't need to be inline.
6ff3d1b#diff-d52c09e66611f9f6f96e69993797b019afe33be9425b8081bbf2613f1b3bde36R116-R126
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only constraint is that this script is run by SageMaker, not Prefect, so we need to be able to give it a Python script to use as an entry point.
6a922c2
to
9f76aec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff! Minor nits.
} | ||
} | ||
} | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this to close the JSON code block starting on line 469.
Co-authored-by: Jeff Hale <[email protected]>
Closes https://linear.app/prefect/issue/DOC-118/create-an-author-tutorial-for-model-training
Depends on PrefectHQ/demos#4 (before merging this PR, replace the inlined code with imports from the demo repo so we're not duplicating code)
This is the final tutorial in the data engineering tutorial series. It shows how to use Prefect webhooks and automations to automatically train a model with SageMaker whenever training data is uploaded to S3.
Preview: https://prefect-bd373955-add_an_ml_tutorial.mintlify.app/v3/tutorials/ml
Checklist
<link to issue>
"If this pull request adds new functionality, it includes unit tests that cover the changesIf this pull request removes docs files, it includes redirect settings inmint.json
.If this pull request adds functions or classes, it includes helpful docstrings.