More intelligent treatment of netcdf `valid_range/min/max` attributes #5356

pp-mo · 2023-06-21T14:20:05Z

pp-mo
Jun 21, 2023
Maintainer

As demonstrated in #5348 , there are some problems with the existing handling of these attributes in netcdf data.

Broadly,

at present these are not treated specially but passed from (netcdf) input to (netcdf) output unchanged.
... which means they will generally be 'inherited' by cubes and coordinates (etc) which are in some way derived from the input ones, and applied to those when written out.
... which is a problem, because the content of such 'derived' things may have completely different meanings, units etc, and hence its valid value range
... hence the original 'valid range' may not be appropriate for the saved data.
... this can inappropriately affect the results when such data is read back in

(for context : 'derived' could mean regridding, cube arithmetic, statistics, units conversion or whatever.
In the #5348 testcase, the use of 'intersection' produces a modified longitude coordinate.
)

Some background

I already took a long look into the various groups of netcdf attributes which Iris handles in different ways, in the context of the ongoing project to improve handling of global+local attributes.
My previous summary in this comment

Behaviour of current code (v3.6.0)

The "valid_range" attribute, along with "valid_min" and "valid_max" ...

are not excluded from Iris attribute dictionaries,
on input (netcdf load) are accepted + stored in attribute dictionaries -- i.e. not discarded (as e.g. scale_factor and add_offset are)
on output (netcdf save) are dtype-cast if needed and checked for consistency (but not re-calculated in any way)

netcdf4 behaviour

In fact the netcdf4-python library docs on this (e.g. here) don't make clear that it respects "valid_range" in this way, but actual practice and this code show that it definitely does, treating it as equivalent to valid_min/max (which it does document).
The effect is simply that, on read, points outside a valid range are masked.

Proposed changes

In short, I think it is arguable that these attributes are part of the low-level encoding and interpretation of netcdf variable data, and as such "ought" to be treated by Iris in the same way as scale_factor and add_offset.

What that might mean ...

on load, a valid-range causes a netcdf data-fetch to identify additional 'masked' points,
so logically, this is our internal Iris representation, i.e. as masked points, which we should rely on.
on save, Iris can't automatically determine a 'valid range' as a separate concept from the masked points, and so I think should never write these attributes -- except possibly by specific user request.

So I propose that:

on load, we should discard these attributes, exactly as we do for scale_factor, add_offset and _FillValue.
on save, we should not create these by default

Remaining queries

Feedback wanted on these

(1) user overrides

In order to support the "by specific request" idea, it would also be logical to add the valid_xxx as possible entries in the 'packing' keyword of the netcdf Saver.write method.
N.B. however, just as with scale_factor/add_offset, this naturally restricts the usage to cube data, and does not apply to coordinates, as originally raised in #5348 .

In the cause of treating them "just like scale_factor and offset", it would also seem logical to disallow these attributes in Iris attribute dictionaries.

However, I'm not sure that the rationale here is really the same as for scaling/offset ...
If you define scale_factor or add_offset for a variable -- typically along with a different dtype -- then that affects how the data is actually stored.
But AFAICT it would in these case also be "safe" to allow the user to explicitly set valid-range attributes in the attributes dictionary, and these would then just be written with the variable.
In that approach, we would not exclude them from attributes dictionaries, and we don't need to support them in 'packing' either. That certainly seems simpler and it is applicable to coords etc as well as cubes.

(2) compatibility control

The above schemes will not be fully backwards-compatible, but do seem like an improved standard behaviour for the future.
So we should probably consider introducing a FUTURE switch for it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More intelligent treatment of netcdf `valid_range/min/max` attributes #5356

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

More intelligent treatment of netcdf valid_range/min/max attributes #5356

pp-mo Jun 21, 2023 Maintainer

Some background

Behaviour of current code (v3.6.0)

netcdf4 behaviour

Proposed changes

Remaining queries

(1) user overrides

(2) compatibility control

Replies: 0 comments

More intelligent treatment of netcdf `valid_range/min/max` attributes #5356

pp-mo
Jun 21, 2023
Maintainer