-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for non-pure computations #37
Comments
That's a good observation. One problem with that is that we can/should not count on developers to annotate all pure functions in a graph. In the case of a hand rolled dask graph, I'm not even aware of a way to annotate the fact that a computation is pure. For those reasons, we optimistically assume that all computations are pure, and ask the user to pass those keys that they do not wish to cache. (EDIT: The choice of whether to cache a computation or not does not change the fact that graphchain assumes all computations are pure.) What we can do is add an option that, if set, automatically caches only pure computations by default. Would that be a satisfactory solution? |
I think that if you are using graphchain on a manually constructed graph, then there's no way to know. But if I'm using dask smart constructors |
The problem is, I don't know if the dask graph that you get in the optimiser actually preserves the purity attribute? Does the HighLevelGraph preserve it? |
I think the One question we need to address is what should happen if a pure function depends on the input of a non-pure function. In that case, we are in principle forced to compute the non-pure function and use its output as input for the pure function. With a However, that is not compatible with graphchain's current caching mechanism: a computation's result is cached based on a chain of hashes of the preceding computations' source and the root inputs. If there is a single non-pure computation in between, then that breaks the chain of hashes and we can no longer determine the cache key to retrieve. In effect, graphchain assumes the full dask graph is pure at this time. If we want to add support for non-pure computations, that would require some refactoring. |
Impurity is infectious. If a pure function depends on an impure input, then it is also impure. I fully expect this. So if I put a impure computation at the beginning of a pipeline, I expect graphchain to assume pretty much the whole graph is impure. However if I put the impure computation near the end, then only that should be recomputed as its inputs are pure and cached. |
In Dask, whether you set pure to be true or not affects the way the computational graph is constructed. In fact enabling pure will allow dask to optimise prior to constructing the graph.
I've noticed that graphchain appears to assume all computational nodes are pure, and caches them. But if some nodes are not pure, they shouldn't be cached.
The text was updated successfully, but these errors were encountered: