`link_existing_data_as_artifact` function #3053

avishniakov · 2024-10-04T08:25:49Z

Describe changes

I implemented link_existing_data_as_artifact function to support use-cases when users want to just link already existing data inside the artifact store scope manually saved by some other tools.

Docs section has quite some examples of the usage in various scenarios.

Example use-case: Pytorch-Lightning has the ability to store checkpoints on a given remote storage, so there is no need to do save_artifact in the end epoch callback or similar, but rather just link existing checkpoint to be available as a ZenML artifact for the future use.

Pre-requisites

Please ensure you have done the following:

I have read the CONTRIBUTING.md document.
If my change requires a change to docs, I have updated the documentation accordingly.
I have added tests to cover my changes.
I have based my new branch on develop and the open PR is targeting develop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.
If my changes require changes to the dashboard, these changes are communicated/requested.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Other (add details above)

coderabbitai · 2024-10-04T08:25:56Z

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

htahir1

Left some comments as most people are out today :-)

htahir1 · 2024-10-04T09:19:43Z

docs/book/user-guide/starter-guide/manage-artifacts.md

+
+# Define the model and fit it
+model = ...
+trainer = Trainer(default_root_dir="s3://my_bucket/my_model_data/")


might make sense to somehow use some internal methods to get the bucket from the artifact store, like active_stack.artifact_store.uri because then it can work in the local case as well without hardcoding?

htahir1 · 2024-10-04T09:20:07Z

docs/book/user-guide/starter-guide/manage-artifacts.md

+finally:
+    # We now link those checkpoints in ZenML as an artifact
+    # This will create a new artifact version
+    link_folder_as_artifact(folder_uri="s3://my_bucket/my_model_data/ckpts", name="my_model_ckpts")


cant we also have callbacks to do this one by one in case training gets interrupted? might make sense to have link_file_as_artifact for that

How will versioning work in this case?

In this case, the versioning would not work too much, but I added an extra example to this section.
In fact, link_file_as_artifact would mean the need to move data around, since our materialized relies on folders, AFAIK. I would wait for @schustmi to confirm on Monday, but even with the current approach on folders, I'm rather satisfied with how this experience could look like under a Callback of Lightning directly.

I'm guessing the callback will create files with the epoch name in the checkpoints directory? In that case, the repeated calls to this method would create artifact versions, but they would all point to the same directory which will have all the checkpoint files inside?

Currently when ZenML creates an artifact, we usually point the path to a directory. However we might be able to workaround that by simply creating an artifact which points to a specific file, as long as the materializer handles it correctly? Not sure if there are additional checks and whether we get the exact checkpoint file URL inside the handler, but worth a try I guess.

If that's not possible, it's somehow not that useful that actually call that at the end of every epoch, right? Just doing it once before the training call would actually be good enough I assume?

I depicted how this shall be used in the example in docs, but you are all right - pointing to a file is more intuitive for the checkpointing case (and for others maybe). I will try to switch gears here and point to the file or directory, depending on the input, instead.

I made it available for single files as well

avishniakov · 2024-10-04T14:18:15Z

@wjayesh since we had a word about this on Discord, you might want to look into this before we merge as well.

schustmi · 2024-10-07T14:19:37Z

src/zenml/artifacts/load_directory_materializer.py

+from zenml.materializers.base_materializer import BaseMaterializer
+
+
+class LoaderDirectoryMaterializer(BaseMaterializer):


If I remember correctly, I implemented a DirectoryMaterializer which works to both save and load directories when an artifact was of type Path. Any specific reason we do not implement the save(...) method for this materializer as well?

I might be wrong here, but because this materializer has the pathlib.Path as it's associated type, it will even be registered for this type in the materializer registry in case it gets imported somehow? Which would then break all returns of type pathlib.Path.

The save was not implemented to protect from misuse of this specific materializer by intent. And this implementation echos the DirectoryMaterializer you created in projects.

Yes, it works with pathlib.Path, shall I create a child class from pathlib.Path to protect this case?

I see, I think my preferred option would be somehow allowing special materializers without associated types that don't get picked up as default for anything in the registry. I'm not sure however how much effort this is.

Something I think we should do in any case, even if we keep it as-is: The NotImplementedError that gets raised in the save(...) method should explain that this is a special materializer and should not be used for regular artifacts.

I made it through the PreexistingArtifactPath derived from the Path, otherwise logic of input artifacts blows up. Please check the recent changes again 🙂

src/zenml/artifacts/load_directory_materializer.py

link_folder_as_artifact function

955eed2

avishniakov requested review from schustmi and bcdurak October 4, 2024 08:25

github-actions bot added internal To filter out internal PRs and issues enhancement New feature or request labels Oct 4, 2024

avishniakov added 3 commits October 4, 2024 10:26

add tests

05bdcd4

add to init

ea779de

add docs

e719a21

htahir1 requested changes Oct 4, 2024

View reviewed changes

avishniakov added 5 commits October 4, 2024 13:30

lint

d678c91

update docs

3d805a7

improved docs

b3545b2

fix docs cut section

5cd6481

fix docs cut section

cf062cb

avishniakov added the run-slow-ci label Oct 4, 2024

Merge branch 'develop' into feature/PRD-659-link-data-as-artifact

03d3da3

schustmi requested changes Oct 7, 2024

View reviewed changes

support file as artifacts as well

ac1d9d8

avishniakov changed the title ~~link_folder_as_artifact function~~ link_existing_data_as_artifact function Oct 7, 2024

avishniakov requested a review from schustmi October 7, 2024 15:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`link_existing_data_as_artifact` function #3053

`link_existing_data_as_artifact` function #3053

avishniakov commented Oct 4, 2024 •

edited

Loading

coderabbitai bot commented Oct 4, 2024

Review skipped

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

htahir1 left a comment

htahir1 Oct 4, 2024

avishniakov Oct 4, 2024

htahir1 Oct 4, 2024

htahir1 Oct 4, 2024

avishniakov Oct 4, 2024

schustmi Oct 7, 2024

avishniakov Oct 7, 2024 •

edited

Loading

avishniakov Oct 7, 2024

avishniakov commented Oct 4, 2024

schustmi Oct 7, 2024

avishniakov Oct 7, 2024

schustmi Oct 7, 2024

avishniakov Oct 7, 2024

		from zenml.materializers.base_materializer import BaseMaterializer


		class LoaderDirectoryMaterializer(BaseMaterializer):

link_existing_data_as_artifact function #3053

Are you sure you want to change the base?

link_existing_data_as_artifact function #3053

Conversation

avishniakov commented Oct 4, 2024 • edited Loading

Describe changes

Pre-requisites

Types of changes

coderabbitai bot commented Oct 4, 2024

Review skipped

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

htahir1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avishniakov Oct 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avishniakov commented Oct 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

`link_existing_data_as_artifact` function #3053

`link_existing_data_as_artifact` function #3053

avishniakov commented Oct 4, 2024 •

edited

Loading

avishniakov Oct 7, 2024 •

edited

Loading