FixedScaleOffset for handling NaN inputs #511

mullenkamp · 2024-02-22T02:28:36Z

Hi,

I was wondering if you all would be interested in a FixedScaleOffset that can handle np.nan inputs? In the style of HDF/netcdf, having a fill value to replace np.nan with an appropriate integer. This could be either user defined or determined automatically based on the astype integer size (assign it the smallest possible integer value).
I can either modify the existing FixedScaleOffset class, or I could create another class. It's a very simple change, though there may be concerns of more memory usage due to boolean masking.

Also, is there any reason why dtype shouldn't always be a float and astype shouldn't always be an integer?

Thanks

martindurant · 2024-02-22T03:12:49Z

is there any reason why dtype shouldn't always be a float and astype shouldn't always be an integer

I can certainly imagine an offset for integers, but a scale less so. Of course, you also get to specify the bitsize of each, and it can be important whether you save as uint8 and load into float32 or something bigger.

This commit addresses issue zarr-developers#511 by adding support for handling NaN inputs in the FixedScaleOffset class. The changes include: - Introduced a check to ensure that when a fill_value is provided, the input dtype must be floating-point. - Prevented the use of integer dtypes for fill_value, which cannot encode NaN values. - Updated type and casting validation to ensure that fill_value is correctly cast to the specified astype. - Only support float -> int -> float transformations, as float -> float already natively support NaNs without fill_value - Added tests for fill_value options References: zarr-developers#511

mps01060 · 2024-09-18T16:01:16Z

We have a lot of use for this, especially to move away from using attributes "add_offset" and "scale_factor" in xarray, and instead using the zarr encoding directly. I’ve made some preliminary changes to address this issue in my fork. You can check them out in this branch: fixedscaleoffset-nans.

I added to the tests, except the test_backwards_compatibility. This appears to create files that do not clean up after each run, which will interfere with the nan/fill_value cases because the codecs get run on all the previous datasets saved. This causes an issue when a scale/offset is used on a "old" dataset that has values that do not make sense to use with the current codec's offset/scale. I am probably overlooking a way to handle this:

https://github.com/zarr-developers/numcodecs/blob/main/numcodecs/tests/common.py#L259-L276

    # save fixture data
    for i, arr in enumerate(arrays):
        arr_fn = os.path.join(fixture_dir, f'array.{i:02d}.npy')
        if not os.path.exists(arr_fn):  # pragma: no cover
            np.save(arr_fn, arr)

    # load fixture data
    for arr_fn in glob(os.path.join(fixture_dir, 'array.*.npy')):
        # setup
        i = int(arr_fn.split('.')[-2])
        arr = np.load(arr_fn, allow_pickle=True)
        arr_bytes = arr.tobytes(order='A')
        if arr.flags.f_contiguous:
            order = 'F'
        else:
            order = 'C'

        for j, codec in enumerate(codecs):

Would this be something worth going forward with? Thank you for the help/suggestions.

This commit addresses issue zarr-developers#511 by adding support for handling NaN inputs in the FixedScaleOffset class. The changes include: - Introduced a check to ensure that when a fill_value is provided, the input dtype must be floating-point. - fill_value must be an integer dtype - Updated type and casting validation to ensure that fill_value is correctly cast to the specified astype (eg. fill_value of 3000 cannot cast to int8) - Only support float -> int -> float transformations, as float -> float already natively support NaNs without fill_value - Added tests for fill_value options References: zarr-developers#511

This commit addresses issue zarr-developers#511 by adding support for handling NaN inputs in the FixedScaleOffset class. The changes include: - Introduced a check to ensure that when a fill_value is provided, the input dtype must be floating-point. - fill_value must be an integer dtype - Updated type and casting validation to ensure that fill_value is correctly cast to the specified astype (eg. fill_value of 3000 cannot cast to int8) - Only support float -> int -> float transformations, as float -> float already natively support NaNs without fill_value - Added tests for fill_value options - Added fixtures for fill_value version of fixedscaleoffset References: zarr-developers#511

mps01060 · 2024-10-07T15:36:50Z

I had some confusion on my previous post. I've added some fixtures for the case where a fill_value is present. Would this be worth a PR, or should this "fill_value" FixedScaleOffset be added as a different filter?

This commit addresses issue zarr-developers#511 by adding support for handling NaN inputs in the FixedScaleOffset class. The changes include: - Introduced a check to ensure that when a fill_value is provided, the input dtype must be floating-point. - fill_value must be an integer dtype - Updated type and casting validation to ensure that fill_value is correctly cast to the specified astype (eg. fill_value of 3000 cannot cast to int8) - Only support float -> int -> float transformations, as float -> float already natively support NaNs without fill_value - Added tests for fill_value options - Added fixtures for fill_value version of fixedscaleoffset References: zarr-developers#511

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FixedScaleOffset for handling NaN inputs #511

FixedScaleOffset for handling NaN inputs #511

mullenkamp commented Feb 22, 2024

martindurant commented Feb 22, 2024

mps01060 commented Sep 18, 2024

mps01060 commented Oct 7, 2024

FixedScaleOffset for handling NaN inputs #511

FixedScaleOffset for handling NaN inputs #511

Comments

mullenkamp commented Feb 22, 2024

martindurant commented Feb 22, 2024

mps01060 commented Sep 18, 2024

mps01060 commented Oct 7, 2024