Recommendations on future usage of ASDF-in-FITS #1635

TallJimbo · 2023-08-02T14:25:56Z

TallJimbo
Aug 2, 2023

I've been keeping an eye on ASDF in FITS for a while as a potential thing Rubin Observatory might be able to adopt. Right now we've got something built on top of pure FITS, but the data model is not the kind of thing I'd want to impose on a bunch of science users (let alone document well enough for that). Dropping FITS entirely is a non-starter, but I'd love to be able to add some proper hierarchical metadata in a standardized way.
So I was a bit worried to see that the reference implementation of asdf is deprecating its support for ASDF-in-FITS, in favor of code in stdatamodels. Anyone have any background on this decision, or thoughts on how important ASDF-in-FITS is to the broader ASDF effort these days?

More specifically, we've got a system for serializing arbitrary objects - PSFs, WCSs, transmission curves, aperture correction maps - to our image FITS files that we don't really like. It stuffs them into FITS binary table HDUs in a not-very-intuitive way, it's hard to make self-documenting, and it's a bit wasteful for small objects that don't have much to store. It's also written in C++ and we'd like to move the implementation to Python. One thing that it does nicely (and which we rely on in some important cases) is maintain internal pointers through serialization, much like pickle does.

So my ideal for our replacement serialization framework would be to target a file format that has both a flexible YAML or JSON tree and a place to put strided-memory arrays of various shapes, which sounds a lot like ASDF. But we're pretty locked in to FITS (there are formal requirements way above my head that I don't want to try to change). Implementing it as ASDF-in-FITS seemed pretty viable to me, and I had some thoughts about how we could use a schema-definition system to make sure the FITS metadata wasn't completely bare, even if the YAML block would be authoritative and the only the we care about reading.

I wasn't certain whether we'd want to actually use the asdf library to implement this (I'm open to opinions), so the removal of ASDF-in-FITS here isn't necessarily a problem. But the standard moving away from ASDF-in-FITS or a lack of third-party general-purpose ASDF-in-FITS readers would certainly make selecting ASDF-in-FITS less advantageous for us.

My guess is that this calculus is similar to what happened with stdatamodels, in that ASDF-in-FITS can work well for a particular fleshed-out data model, but is hard to do in general? If so, is that mostly about it being difficult to take arbitrary ASDF YAML and construct good FITS headers from that, and would that mean less-ambitious general-purpose ASDF-in-FITS support might still be viable?

braingram · 2023-08-02T16:38:21Z

braingram
Aug 2, 2023
Maintainer

Thanks for the detailed issue. It is good to hear that you're considering ASDF-in-FITS. It is very helpful to hear details of your application to better understand what options might best fit your needs.

As you've described, ASDF has many benefits for data storage due to it's flexibility and readability. Some of these benefits are compromised when writing to a FITS format. For the hybrid ASDF-in-FITS there are many options for how the two formats can coexist (for example do arrays get stored in separate extensions? do keywords map to tree data?). stdatamodels currently has 2 ways of working with ASDF-in-FITS files that make different choices about these options.

Datamodels are tailored to usage with JWST but allow a pretty flexible mapping between DataModel attributes and FITS header keywords, extensions and ASDF data
asdf_in_fits is more limited (similar to the previous AsdfInFits in the core asdf library) and only attempts to match array data between the ASDF tree and FITS extensions.

In both cases the ASDF tree is serialized to a single FITS extension (named 'ASDF') that contains a 1 dimensional sequence of bytes representing both the tree and any binary data that is not otherwise stored in the other FITS extensions. The existence of two options probably speaks to your question about a general solution. Given the flexibility of both file formats and the varied requirements of different use cases it seems unlikely that a general solution for ASDF-in-FITS could be made.

Would you elaborate on your point about the YAML block being the only part you care about reading? Are you envisioning something where during writing, portions of the ASDF tree (or other in memory object) are used to generate FITS extensions, data and headers, then the tree is written to a separate extension (let's say 'ASDF')? On read, the embedded 'ASDF' extension is read and the other FITS data discarded?

More generally, do either of the above options provided by stdatamodels sound suitable for your uses?

Also, do you have any example data or examples of the objects you'd like to store? Seeing these would be helpful to better understand how you're hoping to use the format.

0 replies

perrygreenfield · 2023-08-02T22:27:15Z

perrygreenfield
Aug 2, 2023
Maintainer

To elaborate or clarify some of the point made below.

On Aug 2, 2023, at 12:38 PM, Brett Graham ***@***.***> wrote: Thanks for the detailed issue. It is good to hear that you're considering ASDF-in-FITS. It is very helpful to hear details of your application to better understand what options might best fit your needs. As you've described, ASDF has many benefits for data storage due to it's flexibility and readability. Some of these benefits are compromised when writing to a FITS format. For the hybrid ASDF-in-FITS there are many options for how the two formats can coexist (for example do arrays get stored in separate extensions? do keywords map to tree data?). stdatamodels currently has 2 ways of working with ASDF-in-FITS files that make different choices about these options. • Datamodels are tailored to usage with JWST but allow a pretty flexible mapping between DataModel attributes and FITS header keywords, extensions and ASDF data • asdf_in_fits is more limited (similar to the previous AsdfInFits in the core asdf library) and only attempts to match array data between the ASDF tree and FITS extensions. In both cases the ASDF tree is serialized to a single FITS extension (named 'ASDF') that contains a 1 dimensional sequence of bytes representing both the tree and any binary data that is not otherwise stored in the other FITS extensions. The existence of two options probably speaks to your question about a general solution. Given the flexibility of both file formats and the varied requirements of different use cases it seems unlikely that a general solution for ASDF-in-FITS could be made. Would you elaborate on your point about the YAML block being the only part you care about reading? Are you envisioning something where during writing, portions of the ASDF tree (or other in memory object) are used to generate FITS extensions, data and headers, then the tree is written to a separate extension (let's say 'ASDF')? On read, the embedded 'ASDF' extension is read and the other FITS data discarded? More generally, do either of the above options provided by stdatamodels sound suitable for your uses? Also, do you have any example data or examples of the objects you'd like to store? Seeing these would be helpful to better understand how you're hoping to use the format.

I’d like to emphasize that what we are discouraging (because of its complex implementation) is the linking of arrays in the ASDF extension with other FITS extensions to the point that when the ASDF is extracted, the linked extensions are used to populate arrays within the ASDF structure. This feature was developed to support JWST requirements that are not present for Roman. We would prefer to keep such features linked only to JWST software. If, on the other hand, you are interested only in inserting and extracting ASDF content into and from a FITS extension without the JWST behaviors of relocating ASDF arrays to FITS extensions, this is far easier to support. I tend to think that we should provide a simple means of doing so, preferably not bound to the ASDF-in-FITS implementation. That’s why we are interested in the details of how you would like to use ASDF FITS extensions. Regards, Perry

0 replies

TallJimbo · 2023-08-03T14:48:24Z

TallJimbo
Aug 3, 2023
Author

Would you elaborate on your point about the YAML block being the only part you care about reading? Are you envisioning something where during writing, portions of the ASDF tree (or other in memory object) are used to generate FITS extensions, data and headers, then the tree is written to a separate extension (let's say 'ASDF')? On read, the embedded 'ASDF' extension is read and the other FITS data discarded?

I misspoke slightly: I was thinking that we'd read the YAML tree from the last HDU and read binary data from all preceding FITS HDUs, and that there would be no ASDF binary block in the last HDU. The ASDF YAML tree would refer to arrays in those preceding FITS HDUs. But we'd ignore all FITS headers, aside from the bare minimum necessary to index into those HDUs.

That's because on write, we'd knowingly only write FITS header metadata that duplicates something in the ASDF tree - the example in my head is a near-standard TAN-SIP FITS WCS that is really just an approximation to the real composed WCS model embedded in the ASDF tree. The idea is to satisfy the spirit of our FITS requirement as much as we reasonably can, not just its letter, especially in cases where the FITS standard has a way to describe a concept. Most of the binary data in the FITS HDUs would be standard images that the ASDF tree would only annotate slightly, but it'd be nice go further and e.g. save a background model as a binned image HDU where the ASDF tree specifies the interpolant.

So I think I am asking about @perrygreenfield's hard case, but I'm wondering if a read-support-only general version of that might only be moderately hard. In other words, I know writing out our data model to both FITS and ASDF in the same file as described above is a complex problem, and I'm not bothered by having a complex implementation specific to our data model, because the definition of that data model also limits its complexity. But I was hoping that a general implementation for reading that - by reading the ASDF YAML tree and doing partial reads of the preceding FITS HDUs to return a nested dictionary with numpy arrays - wouldn't be too bad. And moreover that this wouldn't be too much to ask of ASDF implementations in other languages, where using the code that knows about our data model would not be an option.

But I also get the impression that this was more or less the approach taken for JWST and you've decided not to go that route for Roman. So maybe we should consider a model in which the ASDF tree has a binary block of its own that's used for everything but the full-scale images, which we definitely have to put in the regular FITS HDUs, but for which references from the ASDF tree are extremely simple, if they're even necessary.

0 replies

eslavich · 2023-08-07T02:31:33Z

eslavich
Aug 7, 2023
Maintainer

One of the lessons I took away from working with the JWST code is that duplicating metadata between the ASDF tree and the FITS keywords causes more trouble than it's worth -- it's easy for an unsuspecting user or code to update only one copy of the metadata, and detecting/managing drift between the two introduces complexity. If I were to start over, I'd store whatever I could in the conventional FITS keywords, and only write to ASDF the metadata that can't be easily represented there. On top of that would be a Python data model class that knows how to map each metadata property to the correct location.

The ASDF-in-FITS feature that was recently evicted from this library was a hairball, both in concept and in implementation. On disk the ASDF was embedded in the FITS file, but in Python it was constructed the other way around: the ASDF object was the outermost container, with internal references to a FITS object. I was never able to develop a clear mental model of what was going on. We also struggled with implementation details of the FITS library leaking into the ASDF tree in unexpected ways.

I see value in a library that provides a framework for working with hybrid FITS/ASDF files, including conveniences for defining mappings between data model properties and their (ASDF or FITS) storage locations. That sounds a lot like stdatamodels, and it is, but IMO there were early choices made there that should be re-examined with the benefit of hindsight.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommendations on future usage of ASDF-in-FITS #1635

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Recommendations on future usage of ASDF-in-FITS #1635

TallJimbo Aug 2, 2023

Replies: 4 comments

braingram Aug 2, 2023 Maintainer

perrygreenfield Aug 2, 2023 Maintainer

TallJimbo Aug 3, 2023 Author

eslavich Aug 7, 2023 Maintainer

TallJimbo
Aug 2, 2023

braingram
Aug 2, 2023
Maintainer

perrygreenfield
Aug 2, 2023
Maintainer

TallJimbo
Aug 3, 2023
Author

eslavich
Aug 7, 2023
Maintainer