Add support for non-pure computations #37

CMCDragonkai · 2019-04-30T02:23:50Z

In Dask, whether you set pure to be true or not affects the way the computational graph is constructed. In fact enabling pure will allow dask to optimise prior to constructing the graph.

I've noticed that graphchain appears to assume all computational nodes are pure, and caches them. But if some nodes are not pure, they shouldn't be cached.

lsorber · 2019-04-30T19:11:55Z

That's a good observation. One problem with that is that we can/should not count on developers to annotate all pure functions in a graph. In the case of a hand rolled dask graph, I'm not even aware of a way to annotate the fact that a computation is pure.

For those reasons, we optimistically assume that all computations are pure, and ask the user to pass those keys that they do not wish to cache. (EDIT: The choice of whether to cache a computation or not does not change the fact that graphchain assumes all computations are pure.)

What we can do is add an option that, if set, automatically caches only pure computations by default. Would that be a satisfactory solution?

CMCDragonkai · 2019-05-01T05:56:46Z

I think that if you are using graphchain on a manually constructed graph, then there's no way to know. But if I'm using dask smart constructors dask.delayed, and I have set a node to be pure or not pure (and by default dask sets it to be False which is impure), then I expect that graphchain optimiser should respect this option and not cache the nodes which are impure.

CMCDragonkai · 2019-05-01T05:57:16Z

The problem is, I don't know if the dask graph that you get in the optimiser actually preserves the purity attribute? Does the HighLevelGraph preserve it?

lsorber · 2019-05-04T08:03:19Z

I think the pure attribute is part of the Delayed computation, so yes, we would have access to it.

One question we need to address is what should happen if a pure function depends on the input of a non-pure function. In that case, we are in principle forced to compute the non-pure function and use its output as input for the pure function. With a joblib.Memory-style cache, we could hash the inputs of the pure function and see if we have the pure function's result cached.

However, that is not compatible with graphchain's current caching mechanism: a computation's result is cached based on a chain of hashes of the preceding computations' source and the root inputs. If there is a single non-pure computation in between, then that breaks the chain of hashes and we can no longer determine the cache key to retrieve.

In effect, graphchain assumes the full dask graph is pure at this time. If we want to add support for non-pure computations, that would require some refactoring.

CMCDragonkai · 2019-05-04T08:04:28Z

Impurity is infectious. If a pure function depends on an impure input, then it is also impure. I fully expect this. So if I put a impure computation at the beginning of a pipeline, I expect graphchain to assume pretty much the whole graph is impure.

However if I put the impure computation near the end, then only that should be recomputed as its inputs are pure and cached.

lsorber changed the title ~~Does graphchain optimiser automatically assume all computational nodes are pure?~~ Add support for non-pure computations May 4, 2019

lsorber added the enhancement New feature or request label May 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for non-pure computations #37

Add support for non-pure computations #37

CMCDragonkai commented Apr 30, 2019

lsorber commented Apr 30, 2019 •

edited

Loading

CMCDragonkai commented May 1, 2019

CMCDragonkai commented May 1, 2019

lsorber commented May 4, 2019

CMCDragonkai commented May 4, 2019 •

edited

Loading

Add support for non-pure computations #37

Add support for non-pure computations #37

Comments

CMCDragonkai commented Apr 30, 2019

lsorber commented Apr 30, 2019 • edited Loading

CMCDragonkai commented May 1, 2019

CMCDragonkai commented May 1, 2019

lsorber commented May 4, 2019

CMCDragonkai commented May 4, 2019 • edited Loading

lsorber commented Apr 30, 2019 •

edited

Loading

CMCDragonkai commented May 4, 2019 •

edited

Loading