Poor performance of Zarr vs HDF5 and NetCDF4. #1954

Stel0s · 2024-06-07T10:54:59Z

Stel0s
Jun 7, 2024

I have made a test to compare read/write performance between Zarr, NetCDF and HDF5. The results show that zarr performs very poorly against HDF5 and NetCDF. From what I've seen, Zarr is known for very good performance and I'm wondering what part of the test code is slowing me down. Here is my code :

import numpy as np
import h5py
import netCDF4
import zarr
import matplotlib.pyplot as plt
import time
import os
import hdf5plugin

Function to generate random image data

def generate_random_image(height=600, width=500):
    return np.random.randint(0, 255, size=(height, width), dtype=np.uint8)

Define the range of sample sizes from 5 to 200

start_size = 5
end_size = 200

Use a logarithmic scale to generate the sample sizes

size = np.logspace(np.log10(start_size), np.log10(end_size), num=20, dtype=int)
size=size[1:]
size

data_directory = 'Data'
if not os.path.exists(data_directory):
    os.makedirs(data_directory)

write_netcdf= []
write_zarr= []
write_hdf5= []

for s in size:
    images = [generate_random_image() for _ in range(s)]
    # NetCDF4
    t0 = time.time()
    nc_file_path = os.path.join(data_directory, f'netcdf_{s}.nc')
    with netCDF4.Dataset(nc_file_path, 'w', format='NETCDF4') as nc:
        nc.createDimension('image', len(images))
        nc.createDimension('height', images[0].shape[0])
        nc.createDimension('width', images[0].shape[1])
        video_dataset = nc.createVariable('video', images[0].dtype, dimensions=('image', 'height', 'width'),
                                            compression=None)
        video_dataset[:] = np.array(images)
    write_netcdf.append(time.time() - t0)
    print(write_netcdf)

   
    # Zarr 
    t0 = time.time()
    zarr_file_path = os.path.join(data_directory, f'zarr_{s}')
    zarr_store = zarr.DirectoryStore(zarr_file_path)
    zarr_group = zarr.group(store=zarr_store)
    zarr_group.create_dataset('video', data=images,dtype=images[0].dtype, compression=None, chunks=(1, images[0].shape[0] // 4, images[0].shape[1] // 8))
    write_zarr.append(time.time() - t0)
    print(write_zarr)

    # HDF5
    t0 = time.time()
    hdf5_file_path = os.path.join(data_directory, f'hdf5_{s}.h5')
    with h5py.File(hdf5_file_path, "w") as file:
        dataset = file.create_dataset('video', data=images, dtype=images[0].dtype,
                                        compression=None,
                                        chunks=(1, images[0].shape[0] // 4, images[0].shape[1] // 8))
    write_hdf5.append(time.time() - t0)
    print(write_hdf5)

Plotting the results

# Parameter to configure maximum abscissa and log scale
w=19
log=True
#
plt.figure(figsize=[12.8, 9.6])

# Using different line patterns and markers for each series
plt.plot(size[0:w], write_netcdf[0:w], label="NetCDF ", linestyle='-', marker='o')
plt.plot(size[0:w], write_zarr[0:w], label="Zarr", linestyle='-.')
plt.plot(size[0:w], write_hdf5[0:w], label="HDF5", linestyle='-', marker='*')

plt.legend()
plt.xlabel("Number of images in the sequence")
plt.ylabel("Writing time (s)")
plt.title("Writing time for all images as a function of the number of images in the sequence")
if log:
    plt.xscale("log")
    plt.savefig(f'write_netcdf_zarr_hdf5_{w}_log.png', format='png')
else:
    plt.savefig(f'write_netcdf_zarr_hdf5_{w}_nolog.png', format='png')
plt.show()

#Read

read_netcdf = []
read_zarr = []
read_hdf5 = []

for s in size:

    # NetCDF 
    t0 = time.time()
    nc_file_path = os.path.join(data_directory, f'netcdf_{s}.nc')
    with netCDF4.Dataset(nc_file_path, 'r') as nc:
        video_dataset = nc.variables['video'][:]
    read_netcdf.append(time.time() - t0)
    print(read_netcdf)

    # Zarr
    t0 = time.time()
    zarr_file_path = os.path.join(data_directory, f'zarr_{s}')
    zarr_group = zarr.open_group(zarr_file_path)
    video_dataset = zarr_group['video'][:]
    read_zarr.append(time.time() - t0)
    print(read_zarr)

    # HDF5 
    t0 = time.time()
    hdf5_file_path = os.path.join(data_directory, f'hdf5_{s}.h5')
    with h5py.File(hdf5_file_path, "r") as file:
        video_dataset = file['video'][:]
    read_hdf5.append(time.time() - t0)
    print(read_hdf5)

# Parameter to configure maximum abscissa and log scale
w=19
log=True

plt.figure(figsize=[12.8, 9.6])

plt.plot(size[0:w], read_netcdf[0:w], label="NetCDF ", linestyle='-', marker='o')
plt.plot(size[0:w], read_zarr[0:w], label="Zarr", linestyle='-.')
plt.plot(size[0:w], read_hdf5[0:w], label="HDF5", linestyle='-', marker='*')

plt.legend()
plt.xlabel("Number of images in the sequence")
plt.ylabel("reading time (s)")
plt.title("read time for all images as a function of the number of images in the sequence")
if log:
    plt.xscale("log")
    plt.savefig(f'read_netcdf_zarr_hdf5_{w}_log.png', format='png')
else :
     plt.savefig(f'read_netcdf_zarr_hdf5_{w}_nolog.png', format='png')
plt.show()

rabernat · 2024-06-07T12:16:23Z

rabernat
Jun 7, 2024
Maintainer

I wouldn't say categorically that you would expect Zarr performance to be better than HDF5 / NetCDF4 in all scenarios. As with all technology, there are tradeoffs with each. In particular, the details of your filesystem, chunking, and access patterns will affect the performance of each format in different ways. This paper gives a great overview.

0 replies

d-v-b · 2024-06-07T12:20:19Z

d-v-b
Jun 7, 2024
Maintainer

zarr_group.create_dataset('video', data=images,dtype=images[0].dtype, compression=None, chunks=(1, images[0].shape[0] // 4, images[0].shape[1] // 8))

this will write the array sequentially, i.e. with no parallelism. Zarr (the format) is designed so that chunks can be written in parallel, even though Zarr (the python library) doesn't exploit this at all. I recommend writing the chunks in parallel (e.g., with dask) to get a better performance indication.

0 replies

Stel0s · 2024-06-07T13:07:37Z

Stel0s
Jun 7, 2024
Author

I agree that Zarr has on average similar performance to HDF5 and NetCDF4 . But in my test it takes 45 seconds to write 200 frames of 600*500 uint8 while HDF5 takes 0.6 seconds. When I replace the sequential writing code with that of dask, I get the same results

Convert the list of images into a Dask array with the right chunking

dask_images = da.stack([da.from_array(image) for image in images], axis=0)
 dask_images = dask_images.rechunk((1, images[0].shape[0] // 4, images[0].shape[1] // 8))
 
 # Parallel data writing with Dask
 t0 = time.time()
 zarr_file_path = os.path.join(data_directory, f'zarr_dask_{s}')
 dask_images.to_zarr(zarr_file_path, compute=True)
 write_zarr.append(time.time() - t0)
 print(write_zarr)

7 replies

d-v-b Jun 7, 2024
Maintainer

creating the dask array from a numpy array will be extremely fast. but zarr will create a numpy array from the dask array, which will incur dask's overhead, which will appear in your benchmark.

Stel0s Jun 7, 2024
Author

Ok, I have no idea how to measure only zarr writing time. Do you know a function to do so? Thank you very much for all these clarifications

d-v-b Jun 7, 2024
Maintainer

Ok, I have no idea how to measure only zarr writing time. Do you know a function to do so? Thank you very much for all these clarifications

I think I need to know more about the goal of the benchmark.

If your goal is to benchmark the top-level API of zarr python, then I think what you have already done is fine, and it reveals that the top-level API is slow (because that API does not do any parallelism).
If you want to do a parallel writing benchmark for zarr python then you can create an array, allocate your numpy arrays, then use some parallel executor (e.g., threadpoolexecutor or dask.delayed) to submit functions that, when called, will write chunks to the array.
If you want to benchmark zarr the format, then I would recommend trying tensorstore (a c++ implementation), because it should have much better performance than zarr-python.

just keep in mind that zarr is not designed to be "fast", it is designed to be a relatively simple format that supports parallel reading and writing for large n-dimensional arrays.

Stel0s Jun 10, 2024
Author

Parallel reading can be used to gather data faster, so using it will speeds up the process. The objective of this benchmark is to determine which format is faster for reading/writing images in Python. I'm really interested to know how you implement the second point with Dask's delayed functionality.

d-v-b Jun 10, 2024
Maintainer

The dask docs are much better at explaining things than me, so I recommend you look there. Also consider trying tensorstore before bothering with dask, as I'm sure tensorstore has much more efficient parallelism than dask

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance of Zarr vs HDF5 and NetCDF4. #1954

{{title}}

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Poor performance of Zarr vs HDF5 and NetCDF4. #1954

Stel0s Jun 7, 2024

Function to generate random image data

Define the range of sample sizes from 5 to 200

Use a logarithmic scale to generate the sample sizes

Plotting the results

Replies: 3 comments · 7 replies

rabernat Jun 7, 2024 Maintainer

d-v-b Jun 7, 2024 Maintainer

Stel0s Jun 7, 2024 Author

Convert the list of images into a Dask array with the right chunking

d-v-b Jun 7, 2024 Maintainer

Stel0s Jun 7, 2024 Author

d-v-b Jun 7, 2024 Maintainer

Stel0s Jun 10, 2024 Author

d-v-b Jun 10, 2024 Maintainer

Stel0s
Jun 7, 2024

Replies: 3 comments 7 replies

rabernat
Jun 7, 2024
Maintainer

d-v-b
Jun 7, 2024
Maintainer

Stel0s
Jun 7, 2024
Author

d-v-b Jun 7, 2024
Maintainer

Stel0s Jun 7, 2024
Author

d-v-b Jun 7, 2024
Maintainer

Stel0s Jun 10, 2024
Author

d-v-b Jun 10, 2024
Maintainer