Length of kerchunk file representing ocean model impacting speed of simulation? #1366

kthyng · 2024-07-19T22:00:41Z

kthyng
Jul 19, 2024

I'm throwing this out here because I got some encouragement in other forums to do so, so let's see what people have to say about this topic. There are a lot of details.

I am running OpenDrift with ROMS model output from 3 models:

NWGOA, 10 years long, runs fastest.
CIOFS hindcast, 24 years long, runs really slow.
CIOFS forecast/nowcast recent archive, 4 years long, runs medium.

I have a kerchunk file to represent the netcdf files for all three models and access them locally, and that file is parquet for all three. Also, in my code I subset the resulting xarray Dataset that I set up to be the time range of what I need for the particle tracking simulation I am running, since that helped save some time (maybe this is a related issue?).

The thing is, the two CIOFS models are the same domain, just one is run over 24 years in the past and one is run more recently in a forecast sense and has less output available, so I think it is odd that the simulations from the output take such different amounts of time, given that the dimensions of the model are the same (horizontal, vertical grids are the exact same, etc).

My understanding is that if the kerchunk file is json, then we would expect a bigger file representing a longer model time series to take longer to access. However, I am not sure we expect that to be the case for a parquet file.

I am making smaller kerchunk files for my particle tracking simulations to get around this issue, but I thought I'd bring it up to see what people think. Depending on the discussion, I can bring it up in other forums too. Thanks.

cc @rsignell

rsignell · 2024-07-22T11:41:15Z

rsignell
Jul 22, 2024

@kthyng can you provide the references?

0 replies

Fathom-mgray · 2024-07-22T14:43:28Z

Fathom-mgray
Jul 22, 2024

I don't really have an answer for you, but I can share my investigation into a very similar problem about 6 months ago.

While I haven't used the model data you refer to, I have run OpenDrift simulations using NetCDF files on object storage. For us, we experience a notable decrease in performance when we span a time range that crosses the bounds of individual files. We created zarr + parquet references using kerchunk for the entire dataset which helped with overall speed, but the slowdown was still noticeable. This could be due to another file being accessed.

It's been a while since I've read the reader scripts for OpenDrift, but as I recall data opened by xarray is extracted early in the model loop (which is why dask isn't immediately helpful in speeding up the simulations, just data reading). This suggests the problem is at least partially due to chunking. If multiple files are passed to the readers, xarray.open_mfdataset is used with a chunksize of 1 along the time dimension, which causes some overhead to rechunk if the source data isn't chunked the same way. chunks is not called when opening a single file with xarray in the readers. Also, there is no way to turn off dask in xarray.open_mfdataset so while you are getting parallel reads, you are also introducing the overhead of loading into a dask array then immediately putting it into a numpy array. I couldn't figure out a way around this without trying to rewrite scripts in a way that's more dask-friendly.

I'm not sure what the main culprit is, but this all seems to factor into it. I guess one experiment that could be done to help narrow things down is taking a single NetCDF file and:

(control run) running a standard simulation using data from a single file
editing the reader scripts to pass chunks={name_of_time_dim: 1} in xarray.open_dataset to load as a daskarray and running the same simulation as (1)
splitting the NetCDF file into two files along the time dimension and running the same simulation as (1)

0 replies

knutfrode · 2024-07-22T15:38:08Z

knutfrode
Jul 22, 2024
Maintainer

Yes, this sounds like an issue with chunking.
Both reader_netCDF_CF_generic and reader_ROMS_native opens the file(s)/URL with Xarray open_dataset or open_mfdataset.
However, it is also possible to open dataset explicitly with Xarray, with possiblility to experiment with chunks and other flags, and pass this to the reader instead of filename/URL.

In essence, at each time step of the simulation, OpenDrift will read a smallest possible rectangle around the active elements at the given time (but stores/caches this internally if the calculation time step is smaller than the forcing model timestep).

Thus request during simulations are of type:
data = ds.variables[<variable name>].isel(time=0, x=slice(x0, x1), y=slice(y0, y1)).data (possibly with addition of vertical dimension)
The worst case would be if the underlying data are chunked along time dimension. In that case, many (or even all) timesteps must be read physically from disk and decompressed, even if data at just a single timestep is actually returned. So the use case of OpenDrift would prefer no chunking in time-dimension, and as small chunks as possible in the spatial dimensions.
I guess that if chunks on disk are unfavorable, it is little one can do when reading, though one can only possibly make it worse by imposing another layer of chunking on top, which are not alighned with physical chunks on disk.

I do not have a deep insight into how Xarray/python-netcdf4 actually works, so any suggestions on how this can be optimized are welcome.
I have been playing a little with the flags to open_mfdataset (chunks, compat, data_vars, coords etc) but with limited effect.

0 replies

kthyng · 2024-07-24T16:52:42Z

kthyng
Jul 24, 2024
Author

I am opening the Dataset outside of OpenDrift and inputting it into the ROMS reader as the Dataset object giving me the ability to choose all those xarray options myself. So, xr.open_mfdataset is not involved at all actually. I am using

ds = xr.open_dataset([kerchunk file.parq], engine="kerchunk", chunks={})

and then inputting ds into the ROMS reader to run the OpenDrift simulation. The time chunks are 1. There is a new netCDF file per 24 hours of model output which is 24 outputs (one per hour) so there is slowdown each time a new file is accessed but that is not the major slowdown I'm mentioning here.

Rich here is the parquet file but the individual netcdf files aren't accessible publicly: ciofs_kerchunk.parq.zip

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Length of kerchunk file representing ocean model impacting speed of simulation? #1366

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Length of kerchunk file representing ocean model impacting speed of simulation? #1366

kthyng Jul 19, 2024

Replies: 4 comments

rsignell Jul 22, 2024

Fathom-mgray Jul 22, 2024

knutfrode Jul 22, 2024 Maintainer

kthyng Jul 24, 2024 Author

kthyng
Jul 19, 2024

rsignell
Jul 22, 2024

Fathom-mgray
Jul 22, 2024

knutfrode
Jul 22, 2024
Maintainer

kthyng
Jul 24, 2024
Author