Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(feat): [WIP] support physical quantities via Pint #1481

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mruffalo
Copy link

@mruffalo mruffalo commented Apr 24, 2024

Hello!

The HuBMAP Consortium makes extensive use of AnnData, writing gene expression count matrices (and more) as .h5ad files for public download.

As we're continuing to extend our computational pipelines to handle sequencing spatial transcriptomics with associated imaging data (e.g., Visium), with spatial coordinates registered to pixel coordinates, it's become clear that it would be very useful to store physical units directly in the AnnData/.h5ad structures. We've had a very pleasant experience using the Pint package in other contexts, and this package interoperates naturally with NumPy and similar (though not as nicely with SciPy sparse matrices, unfortunately).

This draft PR implements serialization of pint.Quantity objects to H5AD and Zarr, separately writing the magnitude and unit of each Quantity, and then instantiating a new Quantity object when reading from disk.

There are a few caveats to this -- the pint package allows instantiating multiple UnitRegistry objects, and quantities associated with one registry cannot be compared to quantities from another registry. As such, the anndata package instantiates a UnitRegistry on import, accessible at anndata.units and used to instantiate Quantity objects when loading data. Users should almost definitely use that anndata.units registry when creating their own Quantity objects, to interoperate with Quantity objects created by the package.

This PR adds pint as an unconditional requirement for anndata.

I added a test method which demonstrates serialization and loading of pint.Quantity-wrapped arrays and scalars, which works as expected. I ran the full test suite on my machine and encountered no new failures (10 tests failed on current main and on my branch, on macOS 14.4.1, and 29 failed on Ubuntu 22.04).

This PR does not add documentation; I'm happy to do so but wanted to check whether this approach looked reasonable to everyone before expending that effort.

  • Closes #
  • Tests added
  • Release note added (or unnecessary)

Copy link

codecov bot commented Apr 24, 2024

Codecov Report

Attention: Patch coverage is 62.50000% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 84.19%. Comparing base (20535bf) to head (7a51205).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1481      +/-   ##
==========================================
- Coverage   84.22%   84.19%   -0.04%     
==========================================
  Files          35       35              
  Lines        5685     5699      +14     
==========================================
+ Hits         4788     4798      +10     
- Misses        897      901       +4     
Files Coverage Δ
src/anndata/__init__.py 92.00% <100.00%> (ø)
src/anndata/_core/anndata.py 85.00% <100.00%> (+0.03%) ⬆️
src/anndata/_io/specs/methods.py 87.01% <53.84%> (-0.99%) ⬇️

... and 1 file with indirect coverage changes

@flying-sheep
Copy link
Member

Hmm, generally units are a great idea. Two questions:

  1. What’s the effect of using a global unit registry? The idea seems to be that if have two units with the same name defined in two different anndata on-disk representations, loading both will result in interoperability. But isn’t the idea of the registry concept that that shouldn’t happen by accident?

    Does pint have a global unit registry for standard (SI) units? What do you use it for?

  2. Does pint allow to use these units in arrays? I don’t see a huge advantage in just having a handful of constants saved in .uns, but again: maybe it would help to see what you use them for

@mruffalo
Copy link
Author

mruffalo commented Apr 25, 2024

Hmm, generally units are a great idea. Two questions:

Thanks!

  1. What’s the effect of using a global unit registry? The idea seems to be that if have two units with the same name defined in two different anndata on-disk representations, loading both will result in interoperability. But isn’t the idea of the registry concept that that shouldn’t happen by accident?

What I'd expect as an anndata user is for the units to be interoperable between different AnnData objects loaded by the package -- a millimeter is a millimeter, after all, and see the example code I included below as an answer to your second question.

The utility of having multiple UnitRegistry objects only starts to apply when users want to modify the defaults (like using imperial instead of SI as the base units) or adding new custom units, etc.

I referred to the suggestions in https://pint.readthedocs.io/en/stable/getting/pint-in-your-projects.html#having-a-shared-registry when considering how to handle needing an instantiated UnitRegistry object to do any unit handling.

Does pint have a global unit registry for standard (SI) units? What do you use it for?

The pint package does not instantiate a "global" unit registry itself; users (including packages embedding pint like anndata) need to do this manually.

  1. Does pint allow to use these units in arrays? I don’t see a huge advantage in just having a handful of constants saved in .uns, but again: maybe it would help to see what you use them for

Absolutely! The test code I implemented adds one scalar Quantity to .uns (e.g., diameter of a capture bead for a Visium dataset), and the array of spatial coordinates in .obsm['X_spatial'] has units of mm.

I believe this would even allow anndata.concat to operate on objects with spatial coordinates using different units, as long as something like np.vstack is used:

In [1]: import numpy as np

In [2]: from pint import UnitRegistry

In [3]: reg = UnitRegistry()

In [4]: coords_1 = np.array([[1, 2], [2, 1]]) * reg.mm

In [5]: coords_1
Out[5]: 
array([[1, 2],
       [2, 1]]) <Unit('millimeter')>

In [6]: coords_2 = np.array([[0, 3], [4, 0]]) * reg.um

In [7]: coords_2
Out[7]: 
array([[0, 3],
       [4, 0]]) <Unit('micrometer')>

In [8]: np.vstack([coords_1, coords_2])
Out[8]: 
array([[1.   , 2.   ],
       [2.   , 1.   ],
       [0.   , 0.003],
       [0.004, 0.   ]]) <Unit('millimeter')>

(Whether or not it's useful to mix spatial coordinates between two different experiments, it's still an improvement to have that concatenation be unit-aware.)

@mruffalo
Copy link
Author

@flying-sheep After writing my comment and re-reading yours, I could be persuaded that anndata should make it hard to blindly concatenate AnnData objects that have spatial information, if they're read separately because they're from separate experiments. Coordinate (2mm, 2mm) in Visium dataset 1 has no relationship to the same location in another Visium dataset, for example, and one shouldn't plot capture beads at those same locations on top of each other in the same plot. Maybe anndata.concat should grow a new parameter harmonize_units or something, to force converting everything to the same UnitRegistry if a user knows this is what they want to do.

My initial vision for this functionality was to make it easy on users to deal with physical quantities in AnnData objects, hence allowing mixing units between AnnDatas, to allow checking things like "of these two sequencing spatial transcriptomics datasets I have in memory, which has smaller size of capture bead?"

I would be relatively annoyed if I couldn't compare those quantities easily as an anndata user:

>>> (2 * UnitRegistry().mm) < (5 * UnitRegistry().um)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mruffalo/.opt/anaconda3/lib/python3.11/site-packages/pint/facets/plain/quantity.py", line 1381, in <lambda>
    __lt__ = lambda self, other: self.compare(other, op=operator.lt)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mruffalo/.opt/anaconda3/lib/python3.11/site-packages/pint/facets/plain/quantity.py", line 100, in wrapped
    return f(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mruffalo/.opt/anaconda3/lib/python3.11/site-packages/pint/facets/plain/quantity.py", line 1369, in compare
    raise ValueError(
ValueError: Cannot operate with Quantity and Quantity of different registries.

and I would probably be left with the impression that unit handling is an obstacle rather than an enhancement.

@flying-sheep
Copy link
Member

Yeah, those are the considerations. Basically the big question is “shall the registry live in each AnnData object or be global”

There’s also mixes possible, like having a context manager that overrides the (otherwise global) registry. @ivirshup @ilan-gold what do you think about this?

@ivirshup
Copy link
Member

Two main things:

  • I think this should be discussed in an issue
  • the use case sound like spatialdata, which already has coordinate spaces and transforms. Why not just use those?

@mruffalo
Copy link
Author

  • I think this should be discussed in an issue

Works for me. I'll make an issue when I get a chance, unless someone beats me to it.

  • the use case sound like spatialdata, which already has coordinate spaces and transforms. Why not just use those?

As noted on https://spatialdata.scverse.org/en/latest/, SpatialData is under active development and doesn't yet claim to offer a stable API. We want to distribute the results of HuBMAP computational pipelines in a stable (enough) format, and 'plain' H5AD is how we're already publishing sc/snRNA-seq processing results.

Our use case is sharing VIsium data in .h5ad format, with spatial coordinates in .obsm['X_spatial'] -- we are already doing this, and wanted to enhance these outputs by including physical units for these coordinates, not to mention metadata like the diameter of each capture bead.

Using a library for handling physical units eliminates a whole class of bugs, and using Pint has simplified a lot of our code, including in handling of physical pixel sizes read from OME-XML metadata. I think there's value in allowing users to use pint.Quantity objects in any appropriate AnnData fields, and supporting serializing those to disk -- users who load such a file can't miss the fact that there are physical sizes associated with some of the values, making it harder to handle these quantities incorrectly.

@@ -47,6 +47,7 @@ dependencies = [
"exceptiongroup; python_version<'3.11'",
"natsort",
"packaging>=20.0",
"pint",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might lean towards putting this in its own optional dependency section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flying-sheep thoughts?

Comment on lines +313 to +314
_writer.write_elem(g, "magnitude", v.magnitude, dataset_kwargs=dataset_kwargs)
_writer.write_elem(g, "units", str(v.units), dataset_kwargs=dataset_kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the units should be written into the attrs instead of being given their own array. It is just a short string, right? This would also make it easier for interoperability purposes. Arrays with this extra metadata wouldn't break existing readers.

uns={"size": 100 * ad.units["um"]},
obsm={"X_spatial": X_arr * ad.units.mm},
)
write(tmp_path / name, adata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd follow the pattern from above, unless I'm missing a good reason not to i.e.,

read = lambda pth: getattr(ad, f"read_{diskfmt}")(pth)
write = lambda adata, pth: getattr(adata, f"write_{diskfmt}")(pth)

and the corresponding fixtures for the diskfmt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants