Problems with NumPy #700
Replies: 6 comments
-
(Posted by @alimanfoo) Thanks @hammer, very interesting. A few anecdotes... Re missing data, we've so far avoided using numpy.ma, preferring to use special missing values. E.g., for genotype call arrays we use the convention of negative integers to represent a missing allele call. This is workable but does require conventions to be clearly documented and communicated. Re strings, this is relevant when storing alleles, particularly if there are indel alleles. Also variant annotations are often strings. The numpy approach of having object arrays containing pointers to Python strings is awkward and inefficient to process in memory. I would love to have a first-class string array type. It also causes challenges for storage, e.g., in zarr there is currently no single standard approach to encoding arrays of strings, rather a choice of several codecs. I think this is an area that needs to be addressed in zarr v3.0 protocol, probably by defining a standard utf-8 string data type and memory layout (probably following Arrow). Nested types are relevant I think particularly for ragged arrays, i.e., arrays with one or more fixed length dimensions followed by a variable length dimension. E.g., genotype likelihood arrays have a variable number of values depending on the number of alleles. Numpy support for this is again via object arrays containing pointers to numpy arrays, not ideal especially for large arrays (we've avoided them so far). Another project with this need and looking to provide better solutions is awkward array. More complex nested types could also be relevant for some awkward variant annotations, e.g., SNPEFF annotations. Thanks for the pointer to the Array C data interface, very relevant to current discussions of the zarr v3.0 core protocol (xref zarr-speces#61). |
Beta Was this translation helpful? Give feedback.
-
(Posted by @mrocklin) Note that the Numpy devs at UC Berkeley BIDS are currently adding extensible dtypes. I would expect them to have something released in 12-18 months. It should handle some of these issues, like missing values and categoricals. Historically, Wes has been burned by the Numpy community, and is generally fairly negative about them. That's not unwarranted. There was a decade where Numpy didn't move at all. They're a bit better now, but still slow. |
Beta Was this translation helpful? Give feedback.
-
Thanks @mrocklin! What's the best place to start reading about extensible dtypes? I see NEP 40 and NEP 41. Any other documents to read over? It looks like Sebastian Berg is the person to follow for this work. Anyone else working on it? Something expected to land in 12-18 months I just assume will take at least 3 years and will therefore be a bit out of scope for us, unfortunately, but it will be interesting to follow along with the work. I also see that NEP 12 and related missing values NEPs have all been deferred. Do you know anything about why that happened? |
Beta Was this translation helpful? Give feedback.
-
(Posted by @mrocklin)
I'm already factoring in delays here. I would expect Sebastian to have something by the fall, and for it to be in a release early next year with probability about 60%. There is still a solid 20% chance that nothing ever happens. That chance is always present with Numpy unfortunately. They seem pretty engaged on this topic though. I'm not sure. I'll ping Sebastian.
It looks like that was in 2011, which were dark years for Numpy where there wasn't really any leadership or ability to make decisions. This is also the time when Wes was around trying to get them to do things. Things are slightly better now. There are a core set of people who seem to feel comfortable making at least mildly disruptive changes. |
Beta Was this translation helpful? Give feedback.
-
(Posted by @mrocklin) Response from Sebastian. Given this I'd say a year is still a decent time estimate, but it might also be slightly optimistic. They're more active when engaged though. It might make sense for someone like Eric to sit in on one of the Numpy maintainer meetings. It's a surprisingly friendly community to influence socially. I get the sense that most folks are scared off by Numpy as a project, and so don't engage as much as they should with the individuals.
|
Beta Was this translation helpful? Give feedback.
-
@ravwojdyla @eczech related to your discussion on how to handle missing values in our genomics toolkit, I wanted to point out that Sebastian Berg has been regularly updating the NEP 42 PR. @mrocklin claims this work will handle some issues related to missing values, but I do not see any explicit commentary about that in the NEP 42 PR. Perhaps you could check it out and see if there's anything there that could be useful to us, and if not, comment on the PR with what we might want from this NEP? |
Beta Was this translation helpful? Give feedback.
-
There's an interesting conversation on the OSS Data Discourse that Wes McKinney set up recently that may have relevance for us.
On the A dataframe protocol for the PyData ecosystem topic, Wes made two interesting comments about the problems he has faced trying to build a data frame library on top of NumPy:
March 9 comment
March 10 comment
I don't think there's action to take right now, but it may be interesting to think about the missing values and categorical data issues he considers. I'm not sure we care as much about strings, nested data, and memory mapping, but I could be wrong.
He also references The Arrow C data interface. I know Zarr has migrated from a Python library to a specification with implementations in multiple languages, including C. I don't think this level of standardization and optimization is important in the near future for our work, but it's worth keeping in mind that projects that gain adoption tend to follow this path.
Beta Was this translation helpful? Give feedback.
All reactions