Add an ML pipeline tutorial #16632

daniel-prefect · 2025-01-07T18:41:00Z

Closes https://linear.app/prefect/issue/DOC-118/create-an-author-tutorial-for-model-training

Depends on PrefectHQ/demos#4 (before merging this PR, replace the inlined code with imports from the demo repo so we're not duplicating code)

This is the final tutorial in the data engineering tutorial series. It shows how to use Prefect webhooks and automations to automatically train a model with SageMaker whenever training data is uploaded to S3.

Preview: https://prefect-bd373955-add_an_ml_tutorial.mintlify.app/v3/tutorials/ml

Checklist

This pull request references any related issue by including "closes <link to issue>"
~~If this pull request adds new functionality, it includes unit tests that cover the changes~~
~~If this pull request removes docs files, it includes redirect settings in mint.json.~~
~~If this pull request adds functions or classes, it includes helpful docstrings.~~

… bucket

docs/v3/tutorials/ml.mdx

olearycrew · 2025-01-08T14:19:50Z

docs/v3/tutorials/ml.mdx

+---
+
+In the [Extract data from websites](/v3/tutorials/scraping) tutorial, you learned how to handle data dependencies and ingest large amounts of data.
+Now, you'll learn how to train a machine learning model using your data.


I think we should have an overview of the steps we're going to go through here. Like a table of contents so that we can orient people

The right nav effectively serves as a table of contents already, so how do you feel about showing a sequence diagram here instead?

9f76aec

Co-authored-by: Brendan O'Leary <[email protected]>

…ng tutorial

zzstoatzz

i like the idea of the tutorial, i have a couple prelim questions

zzstoatzz · 2025-01-08T17:54:00Z

docs/v3/tutorials/ml.mdx

+    with open("train.py", "w") as f:
+        f.write(training_script)
+
+@task(cache_policy=None)


the builtin python None is not actually a valid cache policy, i think you want cache_policies.NONE

zzstoatzz · 2025-01-08T17:56:11Z

docs/v3/tutorials/ml.mdx

+    f"""Create the training script dynamically"""
+    training_script = """import argparse
+import boto3
+import os
+import json
+import pandas as pd
+import numpy as np
+import xgboost as xgb


on first read, i don't like the idea of writing a script as a string. can you explain why we're doing this? I suspect there's a more first class way to define this script as a normal python file

@zzstoatzz How do you feel about putting the template in a separate file instead? I agree it doesn't need to be inline.

6ff3d1b#diff-d52c09e66611f9f6f96e69993797b019afe33be9425b8081bbf2613f1b3bde36R116-R126

The only constraint is that this script is run by SageMaker, not Prefect, so we need to be able to give it a Python script to use as an entry point.

https://github.com/PrefectHQ/prefect/pull/16632/files#diff-d52c09e66611f9f6f96e69993797b019afe33be9425b8081bbf2613f1b3bde36R145

…e SageMaker script

discdiver

Good stuff! Minor nits.

docs/v3/tutorials/ml.mdx

discdiver · 2025-01-08T20:46:46Z

docs/v3/tutorials/ml.mdx

+    }
+  }
+}
+```


Suggested change

```

We need this to close the JSON code block starting on line 469.

docs/v3/tutorials/ml.mdx

Co-authored-by: Jeff Hale <[email protected]>

djsauble and others added 9 commits December 11, 2024 15:13

Add skeleton for the machine learning tutorial

26322c6

Merge branch 'main' into add_an_ml_tutorial

26af78c

Add roughed-in instructions for setting up S3, EventBridge, and webhooks

5c27566

Merge branch 'main' into add_an_ml_tutorial

07245b0

Add more instructions for how to configure a GPU-enabled work pool

bdf4aff

Tweak instructions

218a28d

Merge branch 'main' into add_an_ml_tutorial

0d66477

Merge branch 'main' into add_an_ml_tutorial

def4ba1

Update instructions to match what actually ended up working

b92875c

daniel-prefect self-assigned this Jan 7, 2025

Merge branch 'main' into add_an_ml_tutorial

e22d9b8

github-actions bot added the docs label Jan 7, 2025

mintlify bot deployed to staging January 7, 2025 18:42 View deployment

daniel-prefect added 5 commits January 7, 2025 10:43

Include a link to the Iris dataset

b559ce4

Fix case in expandable titles

0a5487f

Improve naming of Prefect resources

19348ef

Improve wording

fae0c5c

A few more formatting improvements

2a12fe6

daniel-prefect mentioned this pull request Jan 7, 2025

Add flows which train a model and run inference from the trained model PrefectHQ/demos#4

Open

daniel-prefect added 6 commits January 7, 2025 14:17

Relocate instructions for setting up a work pool and worker

388da26

Specify where you need to go to create webhooks and automations

3ed3a90

More fixes

a7145bb

Add a mild event suppression to automations

d51a580

Update code snippets

b3dc9e8

Clarify that the S3 Bucket block is for the model bucket not the data…

0d532ac

… bucket

daniel-prefect marked this pull request as ready for review January 7, 2025 23:43

daniel-prefect requested review from discdiver, cicdw, desertaxle and zzstoatzz as code owners January 7, 2025 23:43

Merge branch 'main' into add_an_ml_tutorial

f816837

daniel-prefect requested review from olearycrew, kevingrismore and EmilRex January 7, 2025 23:43

olearycrew requested changes Jan 8, 2025

View reviewed changes

daniel-prefect and others added 2 commits January 8, 2025 08:12

Apply suggestions from code review

fd5ce04

Co-authored-by: Brendan O'Leary <[email protected]>

Clarify that the ML tutorial doesn't use the data from the web scrapi…

1030df7

…ng tutorial

zzstoatzz reviewed Jan 8, 2025

View reviewed changes

Add a sequence diagram to show the pipeline you'll construct

9f76aec

daniel-prefect force-pushed the add_an_ml_tutorial branch from 6a922c2 to 9f76aec Compare January 8, 2025 18:21

daniel-prefect added 2 commits January 8, 2025 10:45

Update the model training flow to use a separate template file for th…

6ff3d1b

…e SageMaker script

Use the correct cache policy object

07acafa

daniel-prefect requested review from olearycrew and zzstoatzz January 8, 2025 19:00

daniel-prefect added 5 commits January 8, 2025 11:01

Merge branch 'main' into add_an_ml_tutorial

7b8b864

Update title since the webhooks are not related to S3

08a9dea

Straight up say that this tutorial is about using webhooks

69f49ae

Position notes over the participant that they apply to

55c9a21

Remove the device parameter since it isn't used

1c01bef

discdiver reviewed Jan 8, 2025

View reviewed changes

daniel-prefect and others added 2 commits January 9, 2025 08:23

Apply suggestions from code review

6f07b94

Co-authored-by: Jeff Hale <[email protected]>

Apply more suggestions from code review

e230862

daniel-prefect requested a review from discdiver January 9, 2025 16:26

Merge branch 'main' into add_an_ml_tutorial

d8e50c6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an ML pipeline tutorial #16632

Add an ML pipeline tutorial #16632

daniel-prefect commented Jan 7, 2025 •

edited

Loading

olearycrew Jan 8, 2025

daniel-prefect Jan 8, 2025 •

edited

Loading

zzstoatzz left a comment

zzstoatzz Jan 8, 2025

daniel-prefect Jan 8, 2025

zzstoatzz Jan 8, 2025

daniel-prefect Jan 8, 2025

daniel-prefect Jan 8, 2025

discdiver left a comment

discdiver Jan 8, 2025

daniel-prefect Jan 9, 2025 •

edited

Loading

Add an ML pipeline tutorial #16632

Are you sure you want to change the base?

Add an ML pipeline tutorial #16632

Conversation

daniel-prefect commented Jan 7, 2025 • edited Loading

Checklist

olearycrew Jan 8, 2025

Choose a reason for hiding this comment

daniel-prefect Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

zzstoatzz left a comment

Choose a reason for hiding this comment

zzstoatzz Jan 8, 2025

Choose a reason for hiding this comment

daniel-prefect Jan 8, 2025

Choose a reason for hiding this comment

zzstoatzz Jan 8, 2025

Choose a reason for hiding this comment

daniel-prefect Jan 8, 2025

Choose a reason for hiding this comment

daniel-prefect Jan 8, 2025

Choose a reason for hiding this comment

discdiver left a comment

Choose a reason for hiding this comment

discdiver Jan 8, 2025

Choose a reason for hiding this comment

daniel-prefect Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

daniel-prefect commented Jan 7, 2025 •

edited

Loading

daniel-prefect Jan 8, 2025 •

edited

Loading

daniel-prefect Jan 9, 2025 •

edited

Loading