Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an ML pipeline tutorial #16632

Open
wants to merge 35 commits into
base: main
Choose a base branch
from
Open

Add an ML pipeline tutorial #16632

wants to merge 35 commits into from

Conversation

daniel-prefect
Copy link
Contributor

@daniel-prefect daniel-prefect commented Jan 7, 2025

Closes https://linear.app/prefect/issue/DOC-118/create-an-author-tutorial-for-model-training

Depends on PrefectHQ/demos#4 (before merging this PR, replace the inlined code with imports from the demo repo so we're not duplicating code)

This is the final tutorial in the data engineering tutorial series. It shows how to use Prefect webhooks and automations to automatically train a model with SageMaker whenever training data is uploaded to S3.

Preview: https://prefect-bd373955-add_an_ml_tutorial.mintlify.app/v3/tutorials/ml

Checklist

  • This pull request references any related issue by including "closes <link to issue>"
  • If this pull request adds new functionality, it includes unit tests that cover the changes
  • If this pull request removes docs files, it includes redirect settings in mint.json.
  • If this pull request adds functions or classes, it includes helpful docstrings.

@daniel-prefect daniel-prefect self-assigned this Jan 7, 2025
@daniel-prefect daniel-prefect marked this pull request as ready for review January 7, 2025 23:43
docs/v3/tutorials/ml.mdx Outdated Show resolved Hide resolved
docs/v3/tutorials/ml.mdx Outdated Show resolved Hide resolved
---

In the [Extract data from websites](/v3/tutorials/scraping) tutorial, you learned how to handle data dependencies and ingest large amounts of data.
Now, you'll learn how to train a machine learning model using your data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have an overview of the steps we're going to go through here. Like a table of contents so that we can orient people

Copy link
Contributor Author

@daniel-prefect daniel-prefect Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The right nav effectively serves as a table of contents already, so how do you feel about showing a sequence diagram here instead?

9f76aec

Copy link
Collaborator

@zzstoatzz zzstoatzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like the idea of the tutorial, i have a couple prelim questions

with open("train.py", "w") as f:
f.write(training_script)

@task(cache_policy=None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the builtin python None is not actually a valid cache policy, i think you want cache_policies.NONE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 91 to 98
f"""Create the training script dynamically"""
training_script = """import argparse
import boto3
import os
import json
import pandas as pd
import numpy as np
import xgboost as xgb
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on first read, i don't like the idea of writing a script as a string. can you explain why we're doing this? I suspect there's a more first class way to define this script as a normal python file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zzstoatzz How do you feel about putting the template in a separate file instead? I agree it doesn't need to be inline.

6ff3d1b#diff-d52c09e66611f9f6f96e69993797b019afe33be9425b8081bbf2613f1b3bde36R116-R126

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only constraint is that this script is run by SageMaker, not Prefect, so we need to be able to give it a Python script to use as an entry point.

https://github.com/PrefectHQ/prefect/pull/16632/files#diff-d52c09e66611f9f6f96e69993797b019afe33be9425b8081bbf2613f1b3bde36R145

Copy link
Contributor

@discdiver discdiver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff! Minor nits.

docs/v3/tutorials/ml.mdx Outdated Show resolved Hide resolved
docs/v3/tutorials/ml.mdx Outdated Show resolved Hide resolved
}
}
}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```

Copy link
Contributor Author

@daniel-prefect daniel-prefect Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this to close the JSON code block starting on line 469.

docs/v3/tutorials/ml.mdx Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants