Skip to content

aodn/aodn_cloud_optimised

Repository files navigation

AODN Cloud Optimised Conversion

Build Status Test Status Release codecov Documentation Status

A tool designed to convert IMOS NetCDF and CSV files into Cloud Optimised formats such as Zarr and Parquet

Documentation

Visit the documentation on ReadTheDocs for detailed information.

Documentation Status

Key Features

  • Conversion of CSV/NetCDF to Cloud Optimised format (Zarr/Parquet)
  • Clustering capability:
    • Local dask cluster
    • Remote Coiled cluster
    • driven by configuration/can be easily overwritten
    • Zarr: gridded dataset are done in batch and in parallel with xarray.open_mfdataset
    • Parquet: tabular files are done in batch and in parallel as independent task, done with future
  • Reprocessing:
    • Zarr,: reprocessing is achieved by writting to specific regions with slices. Non-contigous regions are handled
    • Parquet: reprocessing is done via pyarrow internal overwritting function, but can also be forced in case an input file has significantly changed
  • Chunking:
    • Parquet: to facilitate the query of geospatial data, polygon and timestamp slices are created as partitions
    • Zarr: done via dataset configuration
  • Metadata:
    • Parquet: Metadata is created as a sidecar _metadata.parquet file
  • Unittesting of module: Very close to integration testing, local cluster is used to create cloud optimised files

Quick Guide

Installation

Requirements:

  • Python >= 3.10.14
  • AWS SSO to push files to S3
  • An account on Coiled for remote clustering (Optional)

Automatic installation of the latest wheel release

curl -s https://raw.githubusercontent.com/aodn/aodn_cloud_optimised/main/install.sh | bash

Otherwise, go to the release page.

Development

See ReadTheDocs - Dev

Usage

See ReadTheDocs - Usage

Notebooks

Notebooks can directly be imported into Google Colab.