Development and dispatch for genetics methods over PyData libraries #698
Replies: 3 comments
-
(Posted by @eric-czech) More along the lines of developing the methods, I've also found this process to be helpful:
I wanted to mention that @alimanfoo since you alluded to it on our last call. I have no allegiances to that idea but I thought I'd share it as one possibility. If nothing else, keeping the signatures as simple as possible in the user-facing API seems like a good idea since it's likely to be implemented by several different things and keeping the docs in sync for all of them is a headache. |
Beta Was this translation helpful? Give feedback.
-
(Posted by @alimanfoo) Hi @eczech, just to say that your last comment resonates. I had been wondering about two API layers, where the lower level API comprises functions that operate on ndarray and/or scalar inputs and return ndarray and/or scalar outputs, and then a higher level API operates on xarray datasets. The work I did on the scikit-allel v2 prototype was I think much closer to this lower-level API concept. At that level I liked having a purely functional API (no classes), where inputs and outputs are all vanilla ndarray-like objects and/or scalars. Certainly good for unit testing. Would be great to explore this some more. |
Beta Was this translation helpful? Give feedback.
-
(Posted by @alimanfoo) Just revisiting this, and one thought possibly worth noting is that we might want to be wary of "macro-like" functions in the API, at least initially. They obviously introduce complexity, particularly around dispatching. In general I think we can afford to push a little of the complexity towards the user, in order to maintain a code base that is simpler and easier to maintain. We can also mitigate complexity with examples, tutorials and documentation. |
Beta Was this translation helpful? Give feedback.
-
(Posted by @eric-czech)
I thought it would be a good idea to start a thread on how to organize development of genetics methods over different array backends or other PyData tools.
Some things to figure out are:
dask_backend.py
file for genetics methods or collections of those methods that need it)pc_relate
, the method is just a bunch of linear algebra operators. In this case, we don't really have to worry about the backends b/c the "dispatching" should be handled by__array_function__
in theory, or if necessary we could potentially use something like uarray to coerce to a target underlying array type.A good example of this is the
ld_prune
function I've been working on. It looks like this:All of
axis_intervals
,ld_matrix
andmaximal_independent_set
are functions with different backends determining whether they should be chunked via dask, run on a cpu, or run on a gpu. Some best practices for calling this would be:intervals
when using a bpwindow
)networkx_backend
, to handle one chunk, that doesn't produce the same results as scikit-allel and the "numba_backend", but finds a different maximal independent set and meets the same contract. This conflates the idea of different implementations of the same algorithm and entirely different algorithms a bit, but is still reasonable in the context of LD pruning IMO given that it's a heuristic.A proposal I'd like to make is that we make it possible to transparently handle at least some of those decisions above. That adds a good deal of complexity since it detaches the "frontend" of a function from the backend implementations and something needs to sit in the middle to decide everything, but uarray and multipledispatch do the same thing. The loss of static analysis possible because of this seems to be pretty huge though.
xr.Dataset
makes this somewhat more complicated, but I think it would be reasonable to dispatch to methods using logic like this:contig
andpos
vectors and a Dataset is given where any one of those things is a Dask array, dispatch to some backend function that uses dask APIsOne sweeping simplification I can see for all of this would be to make the "frontend" functions, which are only connected to actual functions at runtime by uarray, multipledispatch, and what I'm proposing, into methods on ABCs with individual Dask, cuda, sparse, etc. subclasses that users invoke directly. This would make a lot more static analysis possible and make our lives easier. The obvious downside then is that we'd lose any ability to make decisions about backends automatically. Perhaps there is a middle ground where users can always invoke backend methods directly if they want, but we still keep them disconnected so that we have hooks for improving the more "macro-like" functions.
Beta Was this translation helpful? Give feedback.
All reactions