Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

link_existing_data_as_artifact function #3053

Open
wants to merge 11 commits into
base: develop
Choose a base branch
from

Conversation

avishniakov
Copy link
Contributor

@avishniakov avishniakov commented Oct 4, 2024

Describe changes

I implemented link_existing_data_as_artifact function to support use-cases when users want to just link already existing data inside the artifact store scope manually saved by some other tools.

Docs section has quite some examples of the usage in various scenarios.

Example use-case: Pytorch-Lightning has the ability to store checkpoints on a given remote storage, so there is no need to do save_artifact in the end epoch callback or similar, but rather just link existing checkpoint to be available as a ZenML artifact for the future use.

Pre-requisites

Please ensure you have done the following:

  • I have read the CONTRIBUTING.md document.
  • If my change requires a change to docs, I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • I have based my new branch on develop and the open PR is targeting develop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.
  • If my changes require changes to the dashboard, these changes are communicated/requested.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Other (add details above)

Copy link
Contributor

coderabbitai bot commented Oct 4, 2024

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions bot added internal To filter out internal PRs and issues enhancement New feature or request labels Oct 4, 2024
Copy link
Contributor

@htahir1 htahir1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments as most people are out today :-)


# Define the model and fit it
model = ...
trainer = Trainer(default_root_dir="s3://my_bucket/my_model_data/")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might make sense to somehow use some internal methods to get the bucket from the artifact store, like active_stack.artifact_store.uri because then it can work in the local case as well without hardcoding?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked

finally:
# We now link those checkpoints in ZenML as an artifact
# This will create a new artifact version
link_folder_as_artifact(folder_uri="s3://my_bucket/my_model_data/ckpts", name="my_model_ckpts")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cant we also have callbacks to do this one by one in case training gets interrupted? might make sense to have link_file_as_artifact for that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will versioning work in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the versioning would not work too much, but I added an extra example to this section.
In fact, link_file_as_artifact would mean the need to move data around, since our materialized relies on folders, AFAIK. I would wait for @schustmi to confirm on Monday, but even with the current approach on folders, I'm rather satisfied with how this experience could look like under a Callback of Lightning directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing the callback will create files with the epoch name in the checkpoints directory? In that case, the repeated calls to this method would create artifact versions, but they would all point to the same directory which will have all the checkpoint files inside?

Currently when ZenML creates an artifact, we usually point the path to a directory. However we might be able to workaround that by simply creating an artifact which points to a specific file, as long as the materializer handles it correctly? Not sure if there are additional checks and whether we get the exact checkpoint file URL inside the handler, but worth a try I guess.

If that's not possible, it's somehow not that useful that actually call that at the end of every epoch, right? Just doing it once before the training call would actually be good enough I assume?

Copy link
Contributor Author

@avishniakov avishniakov Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I depicted how this shall be used in the example in docs, but you are all right - pointing to a file is more intuitive for the checkpointing case (and for others maybe). I will try to switch gears here and point to the file or directory, depending on the input, instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it available for single files as well

@avishniakov
Copy link
Contributor Author

@wjayesh since we had a word about this on Discord, you might want to look into this before we merge as well.

from zenml.materializers.base_materializer import BaseMaterializer


class LoaderDirectoryMaterializer(BaseMaterializer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, I implemented a DirectoryMaterializer which works to both save and load directories when an artifact was of type Path. Any specific reason we do not implement the save(...) method for this materializer as well?

I might be wrong here, but because this materializer has the pathlib.Path as it's associated type, it will even be registered for this type in the materializer registry in case it gets imported somehow? Which would then break all returns of type pathlib.Path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The save was not implemented to protect from misuse of this specific materializer by intent. And this implementation echos the DirectoryMaterializer you created in projects.

Yes, it works with pathlib.Path, shall I create a child class from pathlib.Path to protect this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I think my preferred option would be somehow allowing special materializers without associated types that don't get picked up as default for anything in the registry. I'm not sure however how much effort this is.

Something I think we should do in any case, even if we keep it as-is: The NotImplementedError that gets raised in the save(...) method should explain that this is a special materializer and should not be used for regular artifacts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it through the PreexistingArtifactPath derived from the Path, otherwise logic of input artifacts blows up. Please check the recent changes again 🙂

@avishniakov avishniakov changed the title link_folder_as_artifact function link_existing_data_as_artifact function Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request internal To filter out internal PRs and issues run-slow-ci
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants