Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FixedScaleOffset for handling NaN inputs #511

Open
mullenkamp opened this issue Feb 22, 2024 · 3 comments
Open

FixedScaleOffset for handling NaN inputs #511

mullenkamp opened this issue Feb 22, 2024 · 3 comments

Comments

@mullenkamp
Copy link

Hi,

I was wondering if you all would be interested in a FixedScaleOffset that can handle np.nan inputs? In the style of HDF/netcdf, having a fill value to replace np.nan with an appropriate integer. This could be either user defined or determined automatically based on the astype integer size (assign it the smallest possible integer value).
I can either modify the existing FixedScaleOffset class, or I could create another class. It's a very simple change, though there may be concerns of more memory usage due to boolean masking.

Also, is there any reason why dtype shouldn't always be a float and astype shouldn't always be an integer?

Thanks

@martindurant
Copy link
Member

is there any reason why dtype shouldn't always be a float and astype shouldn't always be an integer

I can certainly imagine an offset for integers, but a scale less so. Of course, you also get to specify the bitsize of each, and it can be important whether you save as uint8 and load into float32 or something bigger.

mps01060 added a commit to mps01060/numcodecs that referenced this issue Sep 18, 2024
This commit addresses issue zarr-developers#511 by adding support for handling NaN
inputs in the FixedScaleOffset class. The changes include:

- Introduced a check to ensure that when a fill_value is provided, the
  input dtype must be floating-point.
- Prevented the use of integer dtypes for fill_value, which cannot
  encode NaN values.
- Updated type and casting validation to ensure that fill_value is
  correctly cast to the specified astype.
- Only support float -> int -> float transformations, as float -> float
  already natively support NaNs without fill_value
- Added tests for fill_value options

References: zarr-developers#511
@mps01060
Copy link

We have a lot of use for this, especially to move away from using attributes "add_offset" and "scale_factor" in xarray, and instead using the zarr encoding directly. I’ve made some preliminary changes to address this issue in my fork. You can check them out in this branch: fixedscaleoffset-nans.

I added to the tests, except the test_backwards_compatibility. This appears to create files that do not clean up after each run, which will interfere with the nan/fill_value cases because the codecs get run on all the previous datasets saved. This causes an issue when a scale/offset is used on a "old" dataset that has values that do not make sense to use with the current codec's offset/scale. I am probably overlooking a way to handle this:

https://github.com/zarr-developers/numcodecs/blob/main/numcodecs/tests/common.py#L259-L276

    # save fixture data
    for i, arr in enumerate(arrays):
        arr_fn = os.path.join(fixture_dir, f'array.{i:02d}.npy')
        if not os.path.exists(arr_fn):  # pragma: no cover
            np.save(arr_fn, arr)

    # load fixture data
    for arr_fn in glob(os.path.join(fixture_dir, 'array.*.npy')):
        # setup
        i = int(arr_fn.split('.')[-2])
        arr = np.load(arr_fn, allow_pickle=True)
        arr_bytes = arr.tobytes(order='A')
        if arr.flags.f_contiguous:
            order = 'F'
        else:
            order = 'C'

        for j, codec in enumerate(codecs):

Would this be something worth going forward with? Thank you for the help/suggestions.

mps01060 added a commit to mps01060/numcodecs that referenced this issue Sep 18, 2024
This commit addresses issue zarr-developers#511 by adding support for handling NaN
inputs in the FixedScaleOffset class. The changes include:

- Introduced a check to ensure that when a fill_value is provided, the
  input dtype must be floating-point.
- fill_value must be an integer dtype
- Updated type and casting validation to ensure that fill_value is
  correctly cast to the specified astype (eg. fill_value of 3000 cannot
  cast to int8)
- Only support float -> int -> float transformations, as float -> float
  already natively support NaNs without fill_value
- Added tests for fill_value options

References: zarr-developers#511
mps01060 added a commit to mps01060/numcodecs that referenced this issue Oct 7, 2024
This commit addresses issue zarr-developers#511 by adding support for handling NaN
inputs in the FixedScaleOffset class. The changes include:

- Introduced a check to ensure that when a fill_value is provided, the
  input dtype must be floating-point.
- fill_value must be an integer dtype
- Updated type and casting validation to ensure that fill_value is
  correctly cast to the specified astype (eg. fill_value of 3000 cannot
  cast to int8)
- Only support float -> int -> float transformations, as float -> float
  already natively support NaNs without fill_value
- Added tests for fill_value options
- Added fixtures for fill_value version of fixedscaleoffset

References: zarr-developers#511
mps01060 added a commit to mps01060/numcodecs that referenced this issue Oct 7, 2024
This commit addresses issue zarr-developers#511 by adding support for handling NaN
inputs in the FixedScaleOffset class. The changes include:

- Introduced a check to ensure that when a fill_value is provided, the
  input dtype must be floating-point.
- fill_value must be an integer dtype
- Updated type and casting validation to ensure that fill_value is
  correctly cast to the specified astype (eg. fill_value of 3000 cannot
  cast to int8)
- Only support float -> int -> float transformations, as float -> float
  already natively support NaNs without fill_value
- Added tests for fill_value options
- Added fixtures for fill_value version of fixedscaleoffset

References: zarr-developers#511
@mps01060
Copy link

mps01060 commented Oct 7, 2024

I had some confusion on my previous post. I've added some fixtures for the case where a fill_value is present. Would this be worth a PR, or should this "fill_value" FixedScaleOffset be added as a different filter?

mps01060 added a commit to mps01060/numcodecs that referenced this issue Oct 29, 2024
This commit addresses issue zarr-developers#511 by adding support for handling NaN
inputs in the FixedScaleOffset class. The changes include:

- Introduced a check to ensure that when a fill_value is provided, the
  input dtype must be floating-point.
- fill_value must be an integer dtype
- Updated type and casting validation to ensure that fill_value is
  correctly cast to the specified astype (eg. fill_value of 3000 cannot
  cast to int8)
- Only support float -> int -> float transformations, as float -> float
  already natively support NaNs without fill_value
- Added tests for fill_value options
- Added fixtures for fill_value version of fixedscaleoffset

References: zarr-developers#511
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants