Problems with NumPy #700

hammer · 2021-10-02T21:23:50Z

hammer
Oct 2, 2021
Maintainer

There's an interesting conversation on the OSS Data Discourse that Wes McKinney set up recently that may have relevance for us.

On the A dataframe protocol for the PyData ecosystem topic, Wes made two interesting comments about the problems he has faced trying to build a data frame library on top of NumPy:

March 9 comment

NumPy arrays have been shown (can add references) to be inadequate as a memory model (in general) for structured data analytics and for very large datasets. Some reasons:

Lack of coherent/consistent missing value representation (pandas does its own bespoke things)

Strings an afterthought relative to numeric data

Lack of support for nested types (values that are themselves arrays, structs, or unions)

March 10 comment

I think it’s important to carefully consider the things that would be traded away by continuing to build on top of NumPy in 2020:

Missing values (NumPy doesn’t have this, except via numpy.ma which few people use)

Native strings (people use heap PyString/PyBytes, fixed size strings are too limited – these are slow to process)

Nested data (people use PyObject arrays of dicts, lists, tuples, etc., which use a lot of memory and are slow to process)

Memory mapping (for all data types – including strings – not just numbers)

Categorical data (pandas implements synthetic categorical data on top of NumPy)

I don't think there's action to take right now, but it may be interesting to think about the missing values and categorical data issues he considers. I'm not sure we care as much about strings, nested data, and memory mapping, but I could be wrong.

He also references The Arrow C data interface. I know Zarr has migrated from a Python library to a specification with implementations in multiple languages, including C. I don't think this level of standardization and optimization is important in the near future for our work, but it's worth keeping in mind that projects that gain adoption tend to follow this path.

hammer · 2021-10-02T21:24:04Z

hammer
Oct 2, 2021
Maintainer Author

(Posted by @alimanfoo)

Thanks @hammer, very interesting. A few anecdotes...

Re missing data, we've so far avoided using numpy.ma, preferring to use special missing values. E.g., for genotype call arrays we use the convention of negative integers to represent a missing allele call. This is workable but does require conventions to be clearly documented and communicated.

Re strings, this is relevant when storing alleles, particularly if there are indel alleles. Also variant annotations are often strings. The numpy approach of having object arrays containing pointers to Python strings is awkward and inefficient to process in memory. I would love to have a first-class string array type.

It also causes challenges for storage, e.g., in zarr there is currently no single standard approach to encoding arrays of strings, rather a choice of several codecs. I think this is an area that needs to be addressed in zarr v3.0 protocol, probably by defining a standard utf-8 string data type and memory layout (probably following Arrow).

Nested types are relevant I think particularly for ragged arrays, i.e., arrays with one or more fixed length dimensions followed by a variable length dimension. E.g., genotype likelihood arrays have a variable number of values depending on the number of alleles. Numpy support for this is again via object arrays containing pointers to numpy arrays, not ideal especially for large arrays (we've avoided them so far). Another project with this need and looking to provide better solutions is awkward array.

More complex nested types could also be relevant for some awkward variant annotations, e.g., SNPEFF annotations.

Thanks for the pointer to the Array C data interface, very relevant to current discussions of the zarr v3.0 core protocol (xref zarr-speces#61).

0 replies

hammer · 2021-10-02T21:24:44Z

hammer
Oct 2, 2021
Maintainer Author

(Posted by @mrocklin)

Note that the Numpy devs at UC Berkeley BIDS are currently adding extensible dtypes. I would expect them to have something released in 12-18 months. It should handle some of these issues, like missing values and categoricals.

Historically, Wes has been burned by the Numpy community, and is generally fairly negative about them. That's not unwarranted. There was a decade where Numpy didn't move at all. They're a bit better now, but still slow.

0 replies

hammer · 2021-10-02T21:25:03Z

hammer
Oct 2, 2021
Maintainer Author

Thanks @mrocklin! What's the best place to start reading about extensible dtypes? I see NEP 40 and NEP 41. Any other documents to read over? It looks like Sebastian Berg is the person to follow for this work. Anyone else working on it? Something expected to land in 12-18 months I just assume will take at least 3 years and will therefore be a bit out of scope for us, unfortunately, but it will be interesting to follow along with the work.

I also see that NEP 12 and related missing values NEPs have all been deferred. Do you know anything about why that happened?

0 replies

hammer · 2021-10-02T21:25:34Z

hammer
Oct 2, 2021
Maintainer Author

(Posted by @mrocklin)

Something expected to land in 12-18 months I just assume will take at least 3 years and will therefore be a bit out of scope for us, unfortunately, but it will be interesting to follow along with the work.

I'm already factoring in delays here. I would expect Sebastian to have something by the fall, and for it to be in a release early next year with probability about 60%. There is still a solid 20% chance that nothing ever happens. That chance is always present with Numpy unfortunately. They seem pretty engaged on this topic though.

I see NEP 40 and NEP 41. Any other documents to read over?

I'm not sure. I'll ping Sebastian.

I also see that NEP 12 and related missing values NEPs have all been deferred. Do you know anything about why that happened?

It looks like that was in 2011, which were dark years for Numpy where there wasn't really any leadership or ability to make decisions. This is also the time when Wes was around trying to get them to do things. Things are slightly better now. There are a core set of people who seem to feel comfortable making at least mildly disruptive changes.

0 replies

hammer · 2021-10-02T21:25:54Z

hammer
Oct 2, 2021
Maintainer Author

(Posted by @mrocklin)

Response from Sebastian. Given this I'd say a year is still a decent time estimate, but it might also be slightly optimistic. They're more active when engaged though. It might make sense for someone like Eric to sit in on one of the Numpy maintainer meetings. It's a surprisingly friendly community to influence socially. I get the sense that most folks are scared off by Numpy as a project, and so don't engage as much as they should with the individuals.

good to here about interest in new DTypes!

With NEP 41 merged, most of the next implementation steps are not
directly useful to potential DType creators unfortunately.
That is, I have to slowly revise the internal API, which compromises
multiple large chunks of works (e.g. ufunc machinery, array coercion).

In general, I have a pretty clear picture of how things will eventually
look like, most of which is described in NEP 42:
numpy/numpy#15507

which I will polish up more now to propose to provisional/partially
accept it soon. (I think its OK to read for technically adapt, just
nobody should hesitate to ask me if something is not clear.)

Unfortunately, there is nothing to play with. The problem I have is
that providing even a partial prototype is difficult without doing the
internal revision groundwork (especially if that API is not a messy mix
of new and old that keeps changing quickly).

However, I am more than happy to discuss or answer questions directly
e.g. on how certain DTypes will/can be implemented. And now that stones
are loosening up a bit, I hope there will be more regular news.

In general questions/meetings with potential users are also helpful for
me, since it provides examples, may help clarify smaller points, and is
also motivating :).

0 replies

hammer · 2021-10-02T21:26:03Z

hammer
Oct 2, 2021
Maintainer Author

@ravwojdyla @eczech related to your discussion on how to handle missing values in our genomics toolkit, I wanted to point out that Sebastian Berg has been regularly updating the NEP 42 PR. @mrocklin claims this work will handle some issues related to missing values, but I do not see any explicit commentary about that in the NEP 42 PR.

Perhaps you could check it out and see if there's anything there that could be useful to us, and if not, comment on the PR with what we might want from this NEP?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with NumPy #700

{{title}}

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Problems with NumPy #700

hammer Oct 2, 2021 Maintainer

Replies: 6 comments

hammer Oct 2, 2021 Maintainer Author

hammer Oct 2, 2021 Maintainer Author

hammer Oct 2, 2021 Maintainer Author

hammer Oct 2, 2021 Maintainer Author

hammer Oct 2, 2021 Maintainer Author

hammer Oct 2, 2021 Maintainer Author

hammer
Oct 2, 2021
Maintainer

hammer
Oct 2, 2021
Maintainer Author

hammer
Oct 2, 2021
Maintainer Author

hammer
Oct 2, 2021
Maintainer Author

hammer
Oct 2, 2021
Maintainer Author

hammer
Oct 2, 2021
Maintainer Author

hammer
Oct 2, 2021
Maintainer Author