Replies: 4 comments
-
@kthyng can you provide the references? |
Beta Was this translation helpful? Give feedback.
-
I don't really have an answer for you, but I can share my investigation into a very similar problem about 6 months ago. While I haven't used the model data you refer to, I have run OpenDrift simulations using NetCDF files on object storage. For us, we experience a notable decrease in performance when we span a time range that crosses the bounds of individual files. We created zarr + parquet references using It's been a while since I've read the reader scripts for OpenDrift, but as I recall data opened by I'm not sure what the main culprit is, but this all seems to factor into it. I guess one experiment that could be done to help narrow things down is taking a single NetCDF file and:
|
Beta Was this translation helpful? Give feedback.
-
Yes, this sounds like an issue with chunking. In essence, at each time step of the simulation, OpenDrift will read a smallest possible rectangle around the active elements at the given time (but stores/caches this internally if the calculation time step is smaller than the forcing model timestep). Thus request during simulations are of type: I do not have a deep insight into how Xarray/python-netcdf4 actually works, so any suggestions on how this can be optimized are welcome. |
Beta Was this translation helpful? Give feedback.
-
I am opening the Dataset outside of OpenDrift and inputting it into the ROMS reader as the Dataset object giving me the ability to choose all those xarray options myself. So, ds = xr.open_dataset([kerchunk file.parq], engine="kerchunk", chunks={}) and then inputting ds into the ROMS reader to run the OpenDrift simulation. The time chunks are 1. There is a new netCDF file per 24 hours of model output which is 24 outputs (one per hour) so there is slowdown each time a new file is accessed but that is not the major slowdown I'm mentioning here. Rich here is the parquet file but the individual netcdf files aren't accessible publicly: ciofs_kerchunk.parq.zip |
Beta Was this translation helpful? Give feedback.
-
I'm throwing this out here because I got some encouragement in other forums to do so, so let's see what people have to say about this topic. There are a lot of details.
I am running OpenDrift with ROMS model output from 3 models:
I have a kerchunk file to represent the netcdf files for all three models and access them locally, and that file is parquet for all three. Also, in my code I subset the resulting xarray Dataset that I set up to be the time range of what I need for the particle tracking simulation I am running, since that helped save some time (maybe this is a related issue?).
The thing is, the two CIOFS models are the same domain, just one is run over 24 years in the past and one is run more recently in a forecast sense and has less output available, so I think it is odd that the simulations from the output take such different amounts of time, given that the dimensions of the model are the same (horizontal, vertical grids are the exact same, etc).
My understanding is that if the kerchunk file is json, then we would expect a bigger file representing a longer model time series to take longer to access. However, I am not sure we expect that to be the case for a parquet file.
I am making smaller kerchunk files for my particle tracking simulations to get around this issue, but I thought I'd bring it up to see what people think. Depending on the discussion, I can bring it up in other forums too. Thanks.
cc @rsignell
Beta Was this translation helpful? Give feedback.
All reactions