Logically, an HDF5 file stored in a file system is just a sequence of bytes formatted according to the HDF5 file format specification. Several factors affect the performance with which such a sequence can be manipulated, including (but not limited to) the latency and throughput of the underlying storage hardware. Memory-backed HDF5 files eliminate this constraint by maintaining the underlying byte sequence in contiguous RAM buffers.
The life-cycle of memory-backed HDF5 files is very simple:
- Instruct the HDF5 library to create an HDF5 file in memory rather than in a file system. (It is also possible to load existing HDF5 files into memory buffers.)
- For the lifetime of the application perform HDF5 I/O operations as usual.
- Before application shutdown decide whether to abandon the RAM buffer or to
persist certain parts of it in an HDF5 file in a file system.
The obvious use case for in-memory HDF5 files is their use as I/O buffers, for example, in applications with large numbers of small-size and random-access I/O operations. A second, often overlooked, use case is the ability to deal with a high degree of uncertainty, especially, in application backends. The idea is that by maintaining a memory-backed shadow HDF5 file for such highly uncertain candidates, the decision about what to put into a persisted (in storage) HDF5 file can often be delayed, resulting in faster access and smaller files.
The vehicle for implementing this behavior in the HDF5 library is the virtual file layer (VFL), specifically, the so-called core virtual file driver (VFD). This is, however, not the place to learn about implementation details. See ??? for that.
This document is organized as follows: First, we present three examples (sections sec:basic-use, sec:copying-objects, sec:delaying-decisions) with their code outlines and their actual behavior. In section sec:building-blocks, we show the implementation of their (common) building blocks.
The HDF5 library supports an in-memory, aka. in-/core/, representation of HDF5 files, which should not to be confused with memory-mapped files. Unlike files in a file system, which are represented by inodes and ultimately blocks on a block device, memory-backed HDF5 files are just contiguous memory regions (“buffers”) formatted according to the HDF5 file format specification. Since they are HDF5 files, the API for such “files” is identical to the one for “regular” (=on-disk) HDF5 files.
The purpose of the first example is to show that working with memory-backed HDF5 files is as straightforward as working with HDF5 files in a file system.
We would like to write an array of integers to a 2D dataset in a memory-backed HDF5 file.
Before we look at memory-backed HDF5 files, let’s recap the steps for ordinary HDF5 files!
Given our important array data
, we:
- Create an HDF5 file,
disk.h5
- Create a suitably sized and typed dataset
/2x3
, and writedata
- Close the dataset
- Close the file, after printing the HDF5 library version and the file size
on-disk
**Note:** The paragraph following the outline shows the actual program output.
#include "literate-hdf5.h"
int main(int argc, char** argv)
{
int data[] = {0, 1, 2, 3, 4, 5};
hid_t file = (*
<<make-disk-file>> ) ("disk.h5"); // (ref:vfd0-blk0)
hid_t dset = (*
<<make-2D-dataset>> ) (file, "2x3", H5T_STD_I32LE, // (ref:vfd0-blk1)
(hsize_t[]){2,3}, data);
H5Dclose(dset);
(*
<<print-lib-version>> ) ();
(*
<<print-file-size>> ) (file);
H5Fclose(file);
return 0;
}
The <<make-disk-file>>
block (line (vfd0-blk0)) is merely a call to
H5Fcreate
(see section sec:disk-file-creation) and the
<<make-2D-dataset>>
block (line (vfd0-blk1)) is a call to H5Dcreate
with
all the trimmings (see section sec:dataset-creation).
The outline for memory-backed HDF5 files is almost identical to on-disk
files. The <<make-mem-file>>
block on line (mem-file-creation) has two
additional arguments (see section sec:mem-file-creation). The first is the
increment (in bytes) by which the backing memory buffer will grow, should
that be necessary. In this example, it’s 1 MiB. The third parameter, a flag,
controls if the memory-backed file is persisted in storage after closing.
Any argument passed to the executable will be interpreted as TRUE
and the
file persisted. By default (no arguments), there won’t be a core.h5
file
after running the program.
#include "literate-hdf5.h"
int main(int argc, char** argv)
{
int data[] = {0, 1, 2, 3, 4, 5};
hid_t file = (*
<<make-mem-file>> ) ("core.h5", 1024*1024, (argc > 1)); // (ref:mem-file-creation)
hid_t dset = (*
<<make-2D-dataset>> ) (file, "2x3", H5T_STD_I32LE,
(hsize_t[]){2,3}, data);
H5Dclose(dset);
(*
<<print-lib-version>> ) ();
(*
<<print-file-size>> ) (file);
H5Fclose(file);
return 0;
}
The only difference between the on-disk and the memory-backed version is line (mem-file-creation), which shows that
- We are dealing with HDF5 files after all.
- The switch to memory-backed HDF5 files requires only minor changes of
existing applications.
See section sec:mem-file-creation for the implementation of
<<make-mem-files>>
.
When running the executable core-vfd1
for the memory-backed HDF5 file, we
are informed that, for HDF5 library version 1.13.0, the (in-memory) file has
a size of 1,048,576 bytes (1 MiB). However, the dataset itself is only about
24 bytes (=six times four bytes plus metadata). Since we told the core VFD
to grow the file in 1 MiB increments that’s the minimum allocation.
Running the program with any argument will persist the memory-backed HDF5
file as core.h5
. Surprisingly, that file is only 2072 bytes (for HDF5
1.13.0). The reason is that the HDF5 library truncates and eliminates any
unused space in the memory-backed HDF5 file before closing it.
**Bottom line:** Memory-backed HDF5 files are as easy to use as HDF5 files in file systems.
We can copy HDF5 objects such as groups and datasets inside the same HDF5 file or across HDF5 files. A common scenario is to use a memory-backed HDF5 file as a scratch space (or RAM disk) and, before closing it, to store only a few selected objects of interest in an on-disk HDF5 file.
We would like to copy a dataset from a memory-backed HDF5 file to an HDF5 file stored in a file system.
In this example, we are working with two HDF5 files, one memory-backed and the
other in a file system. We re-use the file creation building blocks (lines
(copy-file1), (copy-file2)) and the dataset creation building block (line
(copy-mdset)) to create a dataset dset_m
in the memory-backed HDF5 file
file_m
. Fortunately, the HDF5 library provides a function, H5Ocopy
, for
copying HDF5 objects between HDF5 files. All we have to do is call it on line
(copy-call).
#include "literate-hdf5.h"
int main(int argc, char** argv)
{
int data[] = {0, 1, 2, 3, 4, 5};
hid_t file_d = (*
<<make-disk-file>> ) ("disk.h5"); // (ref:copy-file1)
hid_t file_m = (*
<<make-mem-file>> ) ("core.h5", 4096, (argc > 1)); // (ref:copy-file2)
hid_t dset_m = (*
<<make-2D-dataset>> ) (file_m, "2x3", H5T_STD_I32LE, // (ref:copy-mdset)
(hsize_t[]){2,3}, data);
H5Dclose(dset_m);
(*
<<print-lib-version>> ) ();
(*
<<print-file-size>> ) (file_m);
H5Ocopy(file_m, "2x3", file_d, "2x3copy", H5P_DEFAULT, H5P_DEFAULT); // (ref:copy-call)
H5Fclose(file_m);
(*
<<print-file-size>> ) (file_d);
H5Fclose(file_d);
return 0;
}
When running the program core-vfd2
, we are informed that, for HDF5 library
version 1.13.0, both files have a size of 4 KiB. That is a coincidence of two
independent factors: Firstly, in line (copy-file2), we instructed the HDF5
library to grow the memory-backed HDF5 file in 4 KiB increments, and one
increment is plenty to accommodate our small dataset. Secondly, the 4 KiB size
of the disk.h5
file is due to paged allocation with 4 KiB being the default
page size. (Really?)
**Bottom line:** Transferring objects or parts of a hierarchy from a
memory-backed HDF5 file to another HDF5 file, be it in a file system or another
memory-backed file, is easy thanks to H5Ocopy
!
The developers and maintainers of certain application types, for example, data persistence back-ends of interactive applications, face specific challenges which stem from the uncertainty over the particular course of action(s) their users take as part of a transaction or over the duration of a session. Ideally, any decisions that amount to commitments not easily undone later can be postponed or delayed until a better informed decision can be made.
As stated earlier, when creating new objects, the HDF5 library needs certain information (e.g., creation properties) which stays with an object throughout its lifetime and which is immutable. The copy approach from the previous example won’t work, because it preserves HDF5 objects’ creation properties. Still, a memory-backed HDF5 “shadow” file can be used effectively alongside other HDF5 files as a holding area for objects whose final whereabouts are uncertain at object creation time.
We would like to maintain a potentially very large 2D dataset in a memory-backed HDF5 file and eventually persist it to an HDF5 file in a file system.
There are a few new snippets in this example. The <<make-big-2D-dataset>>
block
on line (big-dset) appears identical to <<make-2D-dataset>>
, but the
implementation in section sec:big-dataset-creation shows that we are dealing
with a datset of potentially arbitrary extent, using chunked storage layout.
Between lines (uncert1) and (uncert2), we mimic the uncertainty around its
extent during an application’s lifetime by growing and shrinking it using
H5Dset_extent
.
On line (size-check), we check its size once more (see section
sec:dataset-size). If the size doesn’t exceed 60,000 bytes, we optimize its
persisted representation by using the so-called compact storage layout (line
(compact) and section sec:compact-replica). In this case we need to transfer the
data manually (line (data-xfer) and section sec:dataset-xfer). Otherwise, we
fall back onto H5Ocopy
(line (big-copy)).
#include "literate-hdf5.h"
int main(int argc, char** argv)
{
int data[] = {0, 1, 2, 3, 4, 5};
hid_t file_d = (*
<<make-disk-file>> ) ("disk.h5");
hid_t file_m = (*
<<make-mem-file>> ) ("core.h5", 1024*1024, (argc > 1));
hid_t dset_m = (*
<<make-big-2D-dataset>> ) (file_m, "2x3", // (ref:big-dset)
H5T_NATIVE_INT32,
(hsize_t[]){2,3}, data);
(*
<<print-lib-version>> ) ();
(*
<<print-file-size>> ) (file_m);
{ /* UNCERTAINTY */
H5Dset_extent(dset_m, (hsize_t[]){200,300}); // (ref:uncert1)
H5Dset_extent(dset_m, (hsize_t[]){200000,300000});
H5Dset_extent(dset_m, (hsize_t[]){2,3}); // (ref:uncert2)
}
if ((*
<<dataset-size>>) (dset_m) < 60000) // (ref:size-check)
{
hid_t dset_d = (*
<<create-compact>> ) (dset_m, file_d, "2x3copy"); // (ref:compact)
(*
<<xfer-data>> ) (dset_m, dset_d); // (ref:data-xfer)
H5Dclose(dset_d);
}
else
{
H5Ocopy(file_m, "2x3", file_d, "2x3copy", H5P_DEFAULT, H5P_DEFAULT); // (ref:big-copy)
}
H5Dclose(dset_m);
H5Fclose(file_m);
(*
<<print-file-size>> ) (file_d);
H5Fclose(file_d);
return 0;
}
When running the program core-vfd3
, we are informed that, for HDF5 library
version 1.13.0, the memory-backed HDF5 file has a size of over 4 MiB while the
persisted file is just 2 KiB.
As can be seen in section sec:big-dataset-creation, the chunk size chosen for
the /2x3
dataset is 4 MiB. Although we are writing only six 32-bit integer
(24 bytes), a full 4 MiB chunk needs to be allocated, which explains the
overall size for the memory-backed HDF5 file.
The compact storage layout is particularly storage- and access-efficient: the dataset elements are stored as part of the dataset’s object header (metadata). This header is read whenever the dataset is opened, and the dataset elements “travel along for free”, which means that there is no separate storage access necessary for subsequent read or write operations.
**Bottom line:** The use of memory-backed HDF5 files can lead to substantial storage and access performance improvements, if applications “keep their cool” and do not prematurely commit storage resources to HDF5 objects.
H5Fcreate
has four parameters, of which the first two, file name and access
flag, are usally in the limelight. To create an on-disk HDF5 file is as easy as
this:
lambda(hid_t, (const char* name),
{
return H5Fcreate(name, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
})
The third and the fourth parameter, a file creation and a file access property list (handle), unlock a few extra treats, as we will see in a moment.
We use the fourth parameter of H5Fcreate
, a file access property list, to do
the in-memory magic.
lambda(hid_t, (const char* name, size_t increment, hbool_t flg),
{
hid_t retval;
hid_t fapl = H5Pcreate(H5P_FILE_ACCESS);
H5Pset_fapl_core(fapl, increment, flg); // (ref:fapl-core)
retval = H5Fcreate(name, H5F_ACC_TRUNC, H5P_DEFAULT, fapl);
H5Pclose(fapl);
return retval;
})
That’s right, a suitably initialized property list (line (fapl-core)) makes all the difference. This is in fact the ONLY difference between an application using regular vs. memory-backed HDF5 files.
To create a dataset, we must specify a name
, its element type dtype
, its
shape dims
, and, optionally, an inital value buffer
. Without additional
customization, the default dataset storage layout is H5D_CONTIGUOUS
, i.e.,
the (fixed-size) dataset elements are layed out in a contigous (memory or
storage) region.
lambda(hid_t,
(hid_t file, const char* name, hid_t dtype, const hsize_t* dims, void* buffer),
{
hid_t retval;
hid_t dspace = H5Screate_simple(2, dims, NULL);
retval = H5Dcreate(file, name, dtype, dspace, // (ref:dset-dtype1)
H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
if (buffer)
H5Dwrite(retval, dtype, H5S_ALL, H5S_ALL, H5P_DEFAULT, buffer); // (ref:dset-dtype2)
H5Sclose(dspace);
return retval;
})
**WARNING:** This snippet contains an important assumption that may not be
obvious to many readers: The datatype handle dtype
is used in two places with
different interpretations. In the first instance, line (dset-dtype1), it refers
to the in-file element type of the dataset to be created. In the second
instance, line (dset-dtype2), it refers to the datatype of the elements in
buffer
. The assumption is that the two are the same. While this assumption is
valid in many practical examples, it can lead to subtle errors if its violation
goes undetected. In a production code, this should be either documented and
enforced, or an additional datatype argument be passed to distinguish them.
lambda(void, (void),
{
unsigned majnum;
unsigned minnum;
unsigned relnum;
H5get_libversion(&majnum, &minnum, &relnum);
printf("HDF5 library version %d.%d.%d\n", majnum, minnum, relnum);
})
lambda(void, (hid_t file),
{
hsize_t size;
H5Fget_filesize(file, &size);
printf("File size: %ld bytes\n", size);
})
This lambda
returns a handle to the potentially large dataset in the
memory-backed HDF5 file. Since the dataset’s final size will only be known
eventually (e.g., end of epoch or transaction), we can’t impose a finite
maximum extent. On line (big-sky), we set the maxmimum extent as unlimited in
all (2) dimensions. Currently, the only HDF5 storage layout that supports such
an arrangement is chunked storage layout. By passing a non-default dataset
creation property list dcpl
to H5Dcreate
(line (big-dcpl)), we instruct the
HDF5 library to use chunked storage layout instead of the default contiguous
layout. For chunked layout, we must specify the size of an individual chunk in
terms of dataset elements per chunk; see line (big-chunk). The size of a
chunk in bytes depends on the element datatype. In our example (32-bit
integers), a 1024^2 chunk occupies 4 MiB of memory or storage.
lambda(hid_t,
(hid_t file, const char* name, hid_t dtype, const hsize_t* dims, void* buffer),
{
hid_t retval;
hid_t dspace = H5Screate_simple(2, dims,
(hsize_t[]){H5S_UNLIMITED, H5S_UNLIMITED}); // (ref:big-sky)
hid_t dcpl = H5Pcreate(H5P_DATASET_CREATE);
H5Pset_chunk(dcpl, 2, (hsize_t[]){1024, 1024}); // (ref:big-chunk)
retval = H5Dcreate(file, name, dtype, dspace, // (ref:big-dtype1)
H5P_DEFAULT, dcpl, H5P_DEFAULT); // (ref:big-dcpl)
if (buffer)
H5Dwrite(retval, dtype, H5S_ALL, H5S_ALL, H5P_DEFAULT, buffer); // (ref:big-dtype2)
H5Pclose(dcpl);
H5Sclose(dspace);
return retval;
})
The same warning and assumptions expressed at the end of section
sec:dataset-creation apply to dtype
.
This lambda
returns the size (in bytes) of the source dataset in the
memory-backed HDF5 file. It’s a matter of determining the storage size of an
individual dataset element and counting how many there are (lines (size-calc1),
(Size-calc2))
lambda(hid_t, (hid_t dset),
{
size_t retval;
hid_t ftype = H5Dget_type(dset);
hid_t dspace = H5Dget_space(dset);
retval = H5Tget_size(ftype) * // (ref:size-calc1)
(size_t) H5Sget_simple_extent_npoints(dspace); // (ref:size-calc2)
H5Sclose(dspace);
H5Tclose(ftype);
return retval;
})
This lambda
returns a handle to the freshly minted compact replica of the
source dataset. (It’s a placeholder, because the actual values are transferred
separately.)
What sets this dataset creation apart from the default case occurs on lines
(compact-layout)-(compact-dcpl). By passing a non-default dataset creation
property list dcpl
to H5Dcreate
, we instruct the HDF5 library to use
compact storage layout instead of the default contiguous (H5D_CONTIGUOUS
)
layout.
lambda(hid_t, (hid_t src_dset, hid_t file, const char* name),
{
hid_t retval;
hid_t ftype = H5Dget_type(src_dset);
hid_t src_dspace = H5Dget_space(src_dset); // (ref:compact-src)
hid_t dcpl = H5Pcreate(H5P_DATASET_CREATE);
hid_t dspace = H5Scopy(src_dspace); // (ref:compact1)
hsize_t dims[H5S_MAX_RANK]; // (ref:compact-max-rank)
H5Sget_simple_extent_dims(dspace, dims, NULL);
H5Sset_extent_simple(dspace, H5Sget_simple_extent_ndims(dspace),
dims, NULL); // (ref:compact2)
H5Pset_layout(dcpl, H5D_COMPACT); // (ref:compact-layout)
retval = H5Dcreate(file, name, ftype, dspace,
H5P_DEFAULT, dcpl, H5P_DEFAULT); // (ref:compact-dcpl)
H5Pclose(dcpl);
H5Sclose(dspace);
H5Tclose(ftype);
return retval;
})
Two other things are worth mentioning about this snippet.
- The dataspace construction on lines (compact1)-(compact2) appears a little
clumsy. Since the extent of the source dataset
src_dset
is not changing, why not just work withsrc_dspace
(line (compact-src))? The reason is that dataspaces withH5S_UNLIMITED
extent bounds, for obvious reasons, are not supported with compact layout. In that case, in our example (!), passingsrc_dspace
as an argument toH5Dcreate
would generate an error. It’s easier to just create a copy of the dataspace and “kill” (NULL
) whatever maximum extent there might be. - On line (compact-max-rank), we use the HDF5 library macro
H5S_MAX_RANK
to avoid the dynamic allocation of thedims
array.
The HDF5 library does not currently have a function to “automagically” transfer data between two datasets, especially datasets with different storage layouts. There is not much else we can do but to read (line (xfer-read)) the data from the source dataset and to write (line (xfer-write)) to the destination dataset.
Since we transfer the data through memory, we need to determine first the size of the transfer buffer needed (line (xfer-size)).
lambda(void, (hid_t src, hid_t dst),
{
hid_t ftype = H5Dget_type(src);
hid_t dspace = H5Dget_space(src);
size_t size = H5Tget_size(ftype) * H5Sget_simple_extent_npoints(dspace); // (ref:xfer-size)
char* buffer = (char*) malloc(size);
H5Dread(src, ftype, H5S_ALL, H5S_ALL, H5P_DEFAULT, buffer); // (ref:xfer-read)
H5Dwrite(dst, ftype, H5S_ALL, H5S_ALL, H5P_DEFAULT, buffer); // (ref:xfer-write)
free(buffer);
H5Sclose(dspace);
H5Tclose(ftype);
})