Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor pipeline interface #901

Merged
merged 8 commits into from
Apr 5, 2024
Merged

Conversation

mrchtr
Copy link
Contributor

@mrchtr mrchtr commented Mar 11, 2024

Feature branch for refactoring the pipeline interface.

Going to create separate PRs to maintain an overview of all changes and avoid breaking the main branch.

@RobbeSneyders
Copy link
Member

FYI, the docs pipeline is failing because we reference the Pipeline class in the docs.

@mrchtr mrchtr mentioned this pull request Mar 27, 2024
Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mrchtr, great work!

I haven't reviewed the tests yet (the PR is quite big 😅), but already left my comments on the code.

docs/pipeline.md Show resolved Hide resolved
Comment on lines 908 to 909
f"Found multiple instantiated datasets in {module_str}. Use the first dataset to start "
f"the execution."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"Found multiple instantiated datasets in {module_str}. Use the first dataset to start "
f"the execution."
f"Found multiple instantiated datasets in {module_str}. Using the first dataset to start "
f"the execution."

Shouldn't we use the last dataset? Or should it not matter? And if it doesn't matter, should we even log this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked it again. We don't need it since we don't create a deepcopy of the graph. Every dataset will contain the same computational graph after applying a new operations. We are not able to create a deepcopy since the graph contains some mappingproxy objects. Shouldn't be a problem for now, if we just allow a single dataset object in a module.

For the case we allow something like this:

dataset_1 = Dataset.read(...)
dataset_2 = dataset_1.apply(...)
dataset_3 = dataset_2.apply(...)

we will run into issues. For now we could add a validation that every dataset of a module contains the same graph, wdyt?

src/fondant/component/data_io.py Outdated Show resolved Hide resolved
src/fondant/core/component_spec.py Outdated Show resolved Hide resolved
src/fondant/core/exceptions.py Outdated Show resolved Hide resolved
src/fondant/core/schema.py Outdated Show resolved Hide resolved
src/fondant/dataset/dataset.py Outdated Show resolved Hide resolved
src/fondant/dataset/dataset.py Outdated Show resolved Hide resolved
src/fondant/dataset/dataset.py Outdated Show resolved Hide resolved
src/fondant/dataset/dataset.py Outdated Show resolved Hide resolved
mrchtr and others added 4 commits April 4, 2024 11:06
First steps for the implementation of the new dataset interface: 
- Removed Pipeline class
- Added Workspace singleton to hold pipeline name, base_path, etc. ..
(shouldn't be the focus of this PR)
- Moved `Pipeline.read(..)` to Dataset class
Implements the new workflow including initialising datasets from
manifest.

- removes workspace again, using working_directory argument now to
enable something like `fondant run ... --working_directory gs://...`
- removes base_path from the manifest and changes the field location to
absolute paths
- changes the executor to generate absolute field locations
- add `Dataset.read()` for initialising dataset from manifest path
- adjust compiler code


Try to create sketch out the flow and changes. Still have to fix the
test cases and test the changes end to end for all runners.


![image](https://github.com/ml6team/fondant/assets/15777729/8c330daf-2076-493c-aea2-b7d3f85884b8)
@GeorgesLorre GeorgesLorre force-pushed the feature/refactore-pipeline-interface branch from 4ac7a0b to 59f6de3 Compare April 4, 2024 09:06
@mrchtr mrchtr force-pushed the feature/refactore-pipeline-interface branch from 59f6de3 to a6561be Compare April 5, 2024 06:32
@mrchtr mrchtr merged commit d9b5496 into main Apr 5, 2024
7 checks passed
@mrchtr mrchtr deleted the feature/refactore-pipeline-interface branch April 5, 2024 06:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants