-
Hi, I'm having various issues concatenating multiple CF NetCDF files together with xarray. I'm seeking a bit of guidance! I'm happy to create some github issues if necessary but I might be the one doing wrong stuff. 1) mfdataset "compat" optionThe doc is not clear to me if I'm using the option for example: 2) compat="override" Bug ??if I'm running the following code: import s3fs
import xarray as xr
s3 = s3fs.S3FileSystem(anon=True)
s3path = 's3://imos-data/IMOS/SRS/SST/ghrsst/L3S-1d/dn/2024/*'
remote_files = s3.glob(s3path)
# Iterate through remote_files to create a fileset
fileset = [s3.open(file) for file in remote_files]
fileset = fileset[118:130]
ds = xr.open_mfdataset(
fileset[1:3],
engine='h5netcdf',
concat_characters=True,
mask_and_scale=True,
decode_cf=True,
decode_times=True,
use_cftime=True,
parallel=True,
decode_coords=True,
compat="override",
lock=False
) I get a
But I don't specify anything for coords. If I remove the compat option, the same code works fine for this set of files. I'm not sure if it's a bug, or me completing misunderstanding what I'm doing. 3) Memory explosionIf I'm running the same code, but on a slightly different set of files (2). All my local memory is being used (I'd like to note here as well that If i'm running the same code on a remote cluster with dask distributed and coiled from my local machine, then my local machine memory is being used to saturation as well, which doesn't make any sense, but I'm not sure if it's an xarray/dask bug). I'd like to understand why:
import s3fs
import xarray as xr
s3 = s3fs.S3FileSystem(anon=True)
s3path = 's3://imos-data/IMOS/SRS/SST/ghrsst/L3S-1d/dn/2024/*'
remote_files = s3.glob(s3path)
# Iterate through remote_files to create a fileset
fileset = [s3.open(file) for file in remote_files]
fileset = fileset[118:130]
# This blues out the mem
ds = xr.open_mfdataset(
fileset[3:5],
engine='h5netcdf',
concat_characters=True,
mask_and_scale=True,
decode_cf=True,
decode_times=True,
use_cftime=True,
parallel=True,
decode_coords=True,
#compat="override",
lock=False
) My env
4) Thanks :) |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Thanks a lot for the clear questions, @lbesnard! question 1: the decoding is applied to each individual dataset separately before concatenating, so unless you pass question 2: the defaults are question 3: not sure about the actual reason, but could you try with from distributed import Client
client = Client()
print(client.dashboard_link) |
Beta Was this translation helpful? Give feedback.
Thanks a lot for the clear questions, @lbesnard!
question 1: the decoding is applied to each individual dataset separately before concatenating, so unless you pass
decode_cf=False
that's correct. In general, we'd be happy to merge a PR that makes the documentation clearer!question 2: the defaults are
compat="equals", coords="different", data_vars="different"
. This means that if you switch tocompat="override"
, you also have to changecoords
anddata_vars
(annoying, I know. See #8778 for a proposal to change this)question 3: not sure about the actual reason, but could you try with
compat="override", coords="minimal", data_vars="minimal"
? At least on my machine that completes without memo…