Investigate issues associated with .zarr format of new Parcels releases #384

hrsdawson · 2024-06-14T01:48:27Z

Newer versions of Parcels output trajectory data in .zarr format, rather than .netcdf. In some (all?) cases, this may lead to the creation of many, many files clogging NCI projects on Gadi.

To do:

Check the output format of new Parcels releases
If no flexibility in output (don't think there is);
- Add warning to the Particle tracking recipe about the generation of many files
- Add example to same recipe of how to consolidate Parcels output into fewer files (e.g. using netcdf)
- Add some guidelines on workflow, e.g. 1) running simulations on scratch, 2) consolidating output, 3) moving consolidated file(s) to gdata, 4) deleting raw .zarr output.

anton-seaice · 2024-06-16T22:40:22Z

The recipe at the moment puts the output in scratch:

dir = ! echo /scratch/$PROJECT/$USER/particle_tracking

and Parcels docs recommend Zarr (https://docs.oceanparcels.org/en/latest/examples/tutorial_output.html#Reading-the-output-file), so maybe we can just add a note about this (i.e. why we are using scratch, warning about lots of files).

adele-morrison · 2024-06-16T22:52:31Z

I think what Hannah was getting at is we need to provide an example of how to postprocess the zarr files to reduce the file numbers before they are transferred to gdata. We recently had an example of a relatively small particle tracking project (only thousands of particles) that resulted in >4 milllion files stored on gdata. In that case, each particle had a separate file for each of lon, lat, depth, time etc at EVERY time/position!

I’m not sure if that’s the default for Parcels now, because our notebook example uses an old parcels version that saved in netcdf. So as Hannah says above, first step is to check what the new default does in terms of number of files and what options there are for reducing file numbers if necessary if the default is bad.

anton-seaice · 2024-06-16T23:00:09Z

Oh I see! That's definitely problematic.

I’m not sure if that’s the default for Parcels now, because our notebook example uses an old parcels version that saved in netcdf.

Our example is up to date - We moved to Parcels 3 at the end of last year, when 'conda-analysis' moved over to Parcels 3, so the example is using zarr already.

hrsdawson · 2024-06-16T23:31:41Z

Okay, well in that case maybe we just need to:

Do what @anton-seaice suggested and add a more explicit warning. Maybe in/before cell 4(?). Because although it's using scratch, it doesn't provide a reasoning as to why and does say "change to any directory you would like" - oops, probably my bad when this example was first created.
Provide a link to the Parcels example for consolidating files and provide a step-through example in the recipe of how to do this, before moving trajectory data to gdata?

hrsdawson · 2024-06-16T23:37:23Z

@anton-seaice is there anything else you think would be worth updating too?

anton-seaice · 2024-06-16T23:52:39Z

Sounds good - its also possible that changing the

outputdt – Interval which dictates the update frequency of file output

argument in the instance of ParticleFile (https://docs.oceanparcels.org/en/latest/reference/particlefile.html#module-parcels.particlefile) would reduce the number of files produced, but would need some experimentation.

2. Provide a link to the Parcels example for consolidating files and provide a step-through example in the recipe of how to do this, before moving trajectory data to gdata?

This article makes a good point - storing tracks (i.e. lines) is more efficient in a vector format (i.e. GeoJSON, KML etc) than storing in a raster format (i.e. Netcdf). I don't know how much we want to mess with that, but whatever we do, compressing the output will most likely save a lot of space.

Thomas-Moore-Creative · 2024-06-16T23:52:52Z

Aloha, better understanding how we can / could use zarr on Gadi is indeed an important issue right at the moment.

I need to tackle it here: Thomas-Moore-Creative/Climatology-generator-demo#12 and intend to employ Zarr ZipStore. There are apparently some limitations and important details but I can't speak to them fully until I try it myself.

A few further throw away comments for consideration:

zarr as a data model of future netcdf offerings is on the cards with NCzarr
elsewhere ( Pangeo community ) there is increased focus on getting all "big data" earth science onto "cloud" / object store workflows > example Arraylake: A Cloud-Native Data Lake Platform for Earth System Science
Pawsey offers significant object store capabilities linked to compute - will NCI???
My personal experience with zarr on Gadi /scratch is that it offers good compression and high performance ideal for turning original model output netcdf, which very often is not ready for use out of the box, into "analysis ready data" (ARD).

adele-morrison · 2024-08-30T20:51:10Z

@hrsdawson do you know if progress on this was made at the Hackathon? Doesn't seem like it would take too long to add this warning and link, then we can close this issue?

Okay, well in that case maybe we just need to:

Do what @anton-seaice suggested and add a more explicit warning. Maybe in/before cell 4(?). Because although it's using scratch, it doesn't provide a reasoning as to why and does say "change to any directory you would like" - oops, probably my bad when this example was first created.

Provide a link to the Parcels example for consolidating files and provide a step-through example in the recipe of how to do this, before moving trajectory data to gdata?

hrsdawson added the 🕹️ hackathon 4.0 label Jun 14, 2024

adele-morrison added this to COSIMA Hackathon v4.0 🧑🏻‍💻🎄👩🏼‍💻🤶🏽👩🏽‍💻🎅🏻 Jun 14, 2024

adele-morrison moved this to To do - harder in COSIMA Hackathon v4.0 🧑🏻‍💻🎄👩🏼‍💻🤶🏽👩🏽‍💻🎅🏻 Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate issues associated with .zarr format of new Parcels releases #384

Investigate issues associated with .zarr format of new Parcels releases #384

hrsdawson commented Jun 14, 2024

anton-seaice commented Jun 16, 2024

adele-morrison commented Jun 16, 2024

anton-seaice commented Jun 16, 2024

hrsdawson commented Jun 16, 2024 •

edited

Loading

hrsdawson commented Jun 16, 2024

anton-seaice commented Jun 16, 2024

Thomas-Moore-Creative commented Jun 16, 2024 •

edited

Loading

adele-morrison commented Aug 30, 2024

Investigate issues associated with .zarr format of new Parcels releases #384

Investigate issues associated with .zarr format of new Parcels releases #384

Comments

hrsdawson commented Jun 14, 2024

anton-seaice commented Jun 16, 2024

adele-morrison commented Jun 16, 2024

anton-seaice commented Jun 16, 2024

hrsdawson commented Jun 16, 2024 • edited Loading

hrsdawson commented Jun 16, 2024

anton-seaice commented Jun 16, 2024

Thomas-Moore-Creative commented Jun 16, 2024 • edited Loading

adele-morrison commented Aug 30, 2024

hrsdawson commented Jun 16, 2024 •

edited

Loading

Thomas-Moore-Creative commented Jun 16, 2024 •

edited

Loading