-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable eager execution #830
Comments
I quickly went through your PoC code for the Christmas project. Here are a few thoughts from my end. When it comes to notebooks, in my opinion, we should aim for super-fast execution of components. We should be able to execute a cell and see results immediately. I would execute component in the interactive environment directly. I don't think we need to develop a new runner. If we apply some limitations to notebooks, only allow one Python version, and install all dependencies within the environment, we can come up with something else. I considered constructing two additional classes, e.g., dataset = Pipeline(...).read("some_component") In the case of lightweight components, we can directly execute the component code. We must ensure that all extra requirements are installed locally (if not, we can install them using subprocess calls). In the case of reusable components, we should be able to load the component code and execute it. Currently, the components are part of the source folder. The difference is that we would have to use the yaml specification to evaluate the schema. I'm unsure if this is scalable when we don't have all available reusable components in our repository, e.g., a community member pushing components to a different Docker Hub namespace, etc. However, this isn't possible at the moment either. In both cases, the distinction from the existing classes would be executing the component immediately instead of just generating a ComponentOp. I propose adding an option to limit the read operation. This would load only a single partition or a restricted size of the dataset, facilitating faster development iterations. Another reason to use InteractiveDataset is that it allows us to contain a pandas dataframe and provide methods to share the dataset with others. We could even override some functions, such as class InteractiveDataset(Dataset):
def __init__(self, dataframe: dd.DataFrame, pipeline):
self.dataframe = dataframe
self.pipeline = pipeline
def _repr_html_(self):
# using compute() here for demo purpose
# if we work with small dataframes in interactive manner it should be fine
# maybe we find a other way to handle this
return self.dataframe.compute()._repr_html_()
def apply(self, ref):
# build ComponentOp and extend to self.pipeline using super methods
# evolved_dataset = self._apply(...)
# self.dataset = evolved_dataset
component = ref()
dataframe = component.transform(self.dataframe)
return InteractiveDataset(dataframe=dataframe, pipeline=self.pipeline)
def view(self):
return self.dataframe.compute() Using the The same strategy I would apply to the During the call of A big downside of this approach is that the implementation differs from the real pipeline execution. We wouldn't write files to the However, I would argue that this isn't so dramatic since it is a development tool for notebook users. I expect people could use this with small datasets to build fast pipelines, execute specific steps sequentially (eliminating the need for caching), and utilize pandas features within the notebook environment (like the data explorer feature). We could recommend starting development in a notebook, testing on a small sample, and then scaling it out using Vertex or Sagemaker. Here are some small code snippets in a Colab to show what it could look like: https://colab.research.google.com/drive/1H3KbEkypUDyKyBx4zVuXjbpiCLEvjHkt?usp=sharing |
Thx @mrchtr! You are on the right track but I would be careful on how we make everything possible while keeping our code clean and single responsible: I see 3 things we could tackle separately:
|
In order to further optimize the development cycle eager execution will be big feature. The idea is that you can run partial pipelines / single components easily and get instant feedback on how your data is moving through your pipeline.
This is a interactive feature which makes the most sense in a notebook like environment which allows for partial code execution. I see a couple of blocks we need to solve:
Execution environment
Where will the code (eagerly) run. Can we use the runners for this? Or is this too slow and we will need a virtual runner?
Interface
How will we design the interface to allow this feature ? Ideally we do not disturb the current fondant pipeline definition code AKA we should be able to parts of a pipeline eagerly while still preserving a full pipeline definition.
Some ideas
Some pseudo pipeline code
We could call execute on the intermediate datasets
We could have a way to pass dummy data to the execution to avoid having to run all previous steps or/and we can handle the dependencies smartly by leveraging caching.
I did some experimentation on this for the xmas project week (see here)
Note:
Tasks
consumes
and inferproduces
for Lightweight Python components #752The text was updated successfully, but these errors were encountered: