mfdataset compat and memory explosion #9172

lbesnard · 2024-06-25T07:32:41Z

lbesnard
Jun 25, 2024

Hi, I'm having various issues concatenating multiple CF NetCDF files together with xarray. I'm seeking a bit of guidance! I'm happy to create some github issues if necessary but I might be the one doing wrong stuff.

1) mfdataset "compat" option

The doc is not clear to me if I'm using the option for example:
compat = "override” # skip comparing and pick variable from first dataset
Will the variable attributes such as scale_factor and add_offset will be the ones used from the first NetCDF ? This is what the doc implies, but from my testing, I'd say the answer is No, and those attributes are kept per file. But I'd like confirmation and I think the doc should clarify this.

2) compat="override" Bug ??

if I'm running the following code:

import s3fs
import xarray as xr

s3 = s3fs.S3FileSystem(anon=True)

s3path = 's3://imos-data/IMOS/SRS/SST/ghrsst/L3S-1d/dn/2024/*'
remote_files = s3.glob(s3path)

# Iterate through remote_files to create a fileset
fileset = [s3.open(file) for file in remote_files]
fileset = fileset[118:130]

ds = xr.open_mfdataset(
    fileset[1:3],
    engine='h5netcdf',
    concat_characters=True,
    mask_and_scale=True,
    decode_cf=True,
    decode_times=True,
    use_cftime=True,
    parallel=True,
    decode_coords=True,
    compat="override",
    lock=False
)

I get a

ValueError: Cannot specify both coords='different' and compat='override'.

But I don't specify anything for coords. If I remove the compat option, the same code works fine for this set of files.

I'm not sure if it's a bug, or me completing misunderstanding what I'm doing.

3) Memory explosion

If I'm running the same code, but on a slightly different set of files (2). All my local memory is being used (I'd like to note here as well that If i'm running the same code on a remote cluster with dask distributed and coiled from my local machine, then my local machine memory is being used to saturation as well, which doesn't make any sense, but I'm not sure if it's an xarray/dask bug).

I'd like to understand why:

The compute seems to be happening while this should be lazy
How to solve this issue since there are no error message.

import s3fs
import xarray as xr

s3 = s3fs.S3FileSystem(anon=True)

s3path = 's3://imos-data/IMOS/SRS/SST/ghrsst/L3S-1d/dn/2024/*'
remote_files = s3.glob(s3path)

# Iterate through remote_files to create a fileset
fileset = [s3.open(file) for file in remote_files]
fileset = fileset[118:130]

# This blues out the mem
ds = xr.open_mfdataset(
    fileset[3:5],
    engine='h5netcdf',
    concat_characters=True,
    mask_and_scale=True,
    decode_cf=True,
    decode_times=True,
    use_cftime=True,
    parallel=True,
    decode_coords=True,
    #compat="override",
    lock=False
)

My env

INSTALLED VERSIONS
------------------
commit: None
python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 6.5.0-1023-oem
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: en_IE.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.9.3-development

xarray: 2024.6.0
pandas: 2.2.2
numpy: 1.26.4
scipy: 1.14.0
netCDF4: 1.6.5
pydap: None
h5netcdf: 1.3.0
h5py: 3.11.0
zarr: 2.18.2
cftime: 1.6.4
nc_time_axis: 1.4.1
iris: None
bottleneck: 1.4.0
dask: 2024.6.2
distributed: 2024.6.2
matplotlib: 3.9.0
cartopy: None
seaborn: 0.13.2
numbagg: 0.8.1
fsspec: 2024.6.0
cupy: None
pint: None
sparse: None
flox: 0.9.8
numpy_groupies: 0.11.1
setuptools: 70.1.0
pip: 24.0
conda: None
pytest: 8.2.2
mypy: 1.10.1
IPython: 7.34.0
sphinx: None

4) Thanks :)

Answered by keewis

Jun 25, 2024

Thanks a lot for the clear questions, @lbesnard!

question 1: the decoding is applied to each individual dataset separately before concatenating, so unless you pass decode_cf=False that's correct. In general, we'd be happy to merge a PR that makes the documentation clearer!

question 2: the defaults are compat="equals", coords="different", data_vars="different". This means that if you switch to compat="override", you also have to change coords and data_vars (annoying, I know. See #8778 for a proposal to change this)

question 3: not sure about the actual reason, but could you try with compat="override", coords="minimal", data_vars="minimal"? At least on my machine that completes without memo…

View full answer

keewis · 2024-06-25T09:44:00Z

keewis
Jun 25, 2024
Maintainer

Thanks a lot for the clear questions, @lbesnard!

question 1: the decoding is applied to each individual dataset separately before concatenating, so unless you pass decode_cf=False that's correct. In general, we'd be happy to merge a PR that makes the documentation clearer!

question 2: the defaults are compat="equals", coords="different", data_vars="different". This means that if you switch to compat="override", you also have to change coords and data_vars (annoying, I know. See #8778 for a proposal to change this)

question 3: not sure about the actual reason, but could you try with compat="override", coords="minimal", data_vars="minimal"? At least on my machine that completes without memory explosion. The only difference other than compat is that I've created a LocalCluster:

from distributed import Client

client = Client()
print(client.dashboard_link)

1 reply

lbesnard Jun 26, 2024
Author

thanks heaps @keewis , that really unlocked me! I'll look into creating a PR for the doc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mfdataset compat and memory explosion #9172

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

mfdataset compat and memory explosion #9172

lbesnard Jun 25, 2024

1) mfdataset "compat" option

2) compat="override" Bug ??

3) Memory explosion

My env

4) Thanks :)

Replies: 1 comment · 1 reply

keewis Jun 25, 2024 Maintainer

lbesnard Jun 26, 2024 Author

lbesnard
Jun 25, 2024

Replies: 1 comment 1 reply

keewis
Jun 25, 2024
Maintainer

lbesnard Jun 26, 2024
Author