Development and dispatch for genetics methods over PyData libraries #698

hammer · 2021-10-02T21:18:50Z

hammer
Oct 2, 2021
Maintainer

I thought it would be a good idea to start a thread on how to organize development of genetics methods over different array backends or other PyData tools.

Some things to figure out are:

What packages do we make optional?
- I like the idea of making Dask, CuPy, sparse, etc. all optional in the environment, but Xarray/pandas/numpy required
- There seems to be no sane way to do this other than to make sure that everything related to each optional dependency has its own module (e.g. a dask_backend.py file for genetics methods or collections of those methods that need it)
What does it mean to "dispatch" to a backend?
- For genetics methods like pc_relate, the method is just a bunch of linear algebra operators. In this case, we don't really have to worry about the backends b/c the "dispatching" should be handled by __array_function__ in theory, or if necessary we could potentially use something like uarray to coerce to a target underlying array type.
- For methods with custom computations via numba (njit or cuda.jit), the dispatching logic is no longer something we can hand off. I think this isn't that complicated though since the only possibilities that make sense with custom computations in my mind are:
  - dask + numba - You want to use chunked computations and have chunks processed via CPU
  - dask + cuda - Same but individual chunks are processed on GPU
  - numba or cuda alone - You don't have Dask installed or you don't care to use it and you simply want to switch on in-memory processing over CPUs or GPUs
  - Note: I've had some success above in making numba/cuda implementations that will work as part of either chunked computations or whole-array computations, but it is a bit harder to think about
How does any of this apply to higher-level, almost macro-like functions?
- A good example of this is the ld_prune function I've been working on. It looks like this:
```
def ld_prune(ds: Dataset, window, threshold):
  intervals = axis_intervals(ds, window, ...)
  ldm = ld_matrix(ds, intervals=intervals, threshold, ...)
  return maximal_independent_set(ldm)
```
- All of axis_intervals, ld_matrix and maximal_independent_set are functions with different backends determining whether they should be chunked via dask, run on a cpu, or run on a gpu. Some best practices for calling this would be:
  1. axis_intervals should run without chunking if the number of variants in the dataset isn't well into the 10s of millions or higher
  2. ld_matrix should almost always be run with chunking on real datasets, but the choice of gpu vs cpu processing for each chunk depends largely on how many GPUs and CPUs are available and more importantly, how much slower the GPU version becomes using smaller chunks to accommodate less GPU memory (which depends on the variant density info available in intervals when using a bp window)
  3. maximal_independent_set should also generally be chunked and I have for example a networkx_backend, to handle one chunk, that doesn't produce the same results as scikit-allel and the "numba_backend", but finds a different maximal independent set and meets the same contract. This conflates the idea of different implementations of the same algorithm and entirely different algorithms a bit, but is still reasonable in the context of LD pruning IMO given that it's a heuristic.
- A proposal I'd like to make is that we make it possible to transparently handle at least some of those decisions above. That adds a good deal of complexity since it detaches the "frontend" of a function from the backend implementations and something needs to sit in the middle to decide everything, but uarray and multipledispatch do the same thing. The loss of static analysis possible because of this seems to be pretty huge though.
How much can we do instead with overloads based on argument type rather than a more dynamic multiple dispatch system?
- Passing data around in an xr.Dataset makes this somewhat more complicated, but I think it would be reasonable to dispatch to methods using logic like this:
  - If a method requires say, a 2D call array as well as contig and pos vectors and a Dataset is given where any one of those things is a Dask array, dispatch to some backend function that uses dask APIs
  - If all of the above are numpy arrays and the user has not explicitly overridden this behavior, do everything with in-memory backends
- I can't see a way to literally do that with multipledispatch but it might be possible

One sweeping simplification I can see for all of this would be to make the "frontend" functions, which are only connected to actual functions at runtime by uarray, multipledispatch, and what I'm proposing, into methods on ABCs with individual Dask, cuda, sparse, etc. subclasses that users invoke directly. This would make a lot more static analysis possible and make our lives easier. The obvious downside then is that we'd lose any ability to make decisions about backends automatically. Perhaps there is a middle ground where users can always invoke backend methods directly if they want, but we still keep them disconnected so that we have hooks for improving the more "macro-like" functions.

hammer · 2021-10-02T21:19:04Z

hammer
Oct 2, 2021
Maintainer Author

(Posted by @eric-czech)

More along the lines of developing the methods, I've also found this process to be helpful:

Start writing functions with signatures having individual vector/array inputs
- For example,
  - _maximal_independent_set(index1: ndarray, index2: ndarray, weight: ndarray)
    instead of:
  - maximal_independent_set(ds: Dataset)
- I found it easier to write the methods and tests for them this way since you don't have to think about the containers the data is delivered in
Add an adaptor method that handles calling the internal method based on what's in an xr.Dataset
Connect the adaptor method to the user-facing API, which has much simpler signatures

I wanted to mention that @alimanfoo since you alluded to it on our last call. I have no allegiances to that idea but I thought I'd share it as one possibility. If nothing else, keeping the signatures as simple as possible in the user-facing API seems like a good idea since it's likely to be implemented by several different things and keeping the docs in sync for all of them is a headache.

0 replies

hammer · 2021-10-02T21:19:15Z

hammer
Oct 2, 2021
Maintainer Author

(Posted by @alimanfoo)

Hi @eczech, just to say that your last comment resonates.

I had been wondering about two API layers, where the lower level API comprises functions that operate on ndarray and/or scalar inputs and return ndarray and/or scalar outputs, and then a higher level API operates on xarray datasets.

The work I did on the scikit-allel v2 prototype was I think much closer to this lower-level API concept. At that level I liked having a purely functional API (no classes), where inputs and outputs are all vanilla ndarray-like objects and/or scalars. Certainly good for unit testing.

Would be great to explore this some more.

0 replies

hammer · 2021-10-02T21:19:25Z

hammer
Oct 2, 2021
Maintainer Author

(Posted by @alimanfoo)

Just revisiting this, and one thought possibly worth noting is that we might want to be wary of "macro-like" functions in the API, at least initially. They obviously introduce complexity, particularly around dispatching. In general I think we can afford to push a little of the complexity towards the user, in order to maintain a code base that is simpler and easier to maintain. We can also mitigate complexity with examples, tutorials and documentation.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development and dispatch for genetics methods over PyData libraries #698

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Development and dispatch for genetics methods over PyData libraries #698

hammer Oct 2, 2021 Maintainer

Replies: 3 comments

hammer Oct 2, 2021 Maintainer Author

hammer Oct 2, 2021 Maintainer Author

hammer Oct 2, 2021 Maintainer Author

hammer
Oct 2, 2021
Maintainer

hammer
Oct 2, 2021
Maintainer Author

hammer
Oct 2, 2021
Maintainer Author

hammer
Oct 2, 2021
Maintainer Author