Writing a h5py.Dataset loads the whole thing into memory #1623

ivirshup · 2024-08-28T22:37:03Z

Please make sure these conditions are met

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of anndata.
(optional) I have confirmed this bug exists on the master branch of anndata.

Report

Code:

%load_ext memory_profiler

import h5py
from anndata.experimental import write_elem
import numpy as np

f = h5py.File("tmp.h5", "w")
X = np.ones((10_000, 10_000))

%memit write_elem(f, "X", X)
# peak memory: 940.14 MiB, increment: 0.00 MiB

%memit write_elem(f, "X2", f["X"])
# peak memory: 1702.89 MiB, increment: 762.75 MiB

The second write doubles the amount of memory. We can move to a chunked approach to writing pretty easily from the solution suggested here:

Cannot create dataset from another astype wrapped dataset h5py/h5py#1761 (comment)

dst_ds = f.create_dataset_like('dst', src_ds, dtype=np.int64)

for chunk in src_ds.iter_chunks():
    dst_ds[chunk] = src_ds[chunk]

Versions

-----
IPython             8.26.0
anndata             0.11.0.dev168+g8cc5a18
h5py                3.11.0
numpy               1.26.4
session_info        1.0.0
-----
asciitree           NA
asttokens           NA
bottleneck          1.4.0
cloudpickle         3.0.0
cython_runtime      NA
dask                2024.8.1
dateutil            2.9.0.post0
decorator           5.1.1
executing           2.0.1
importlib_metadata  NA
jedi                0.19.1
jinja2              3.1.4
markupsafe          2.1.5
memory_profiler     0.61.0
msgpack             1.0.8
natsort             8.4.0
numcodecs           0.13.0
numexpr             2.10.1
packaging           24.1
pandas              2.2.1
parso               0.8.4
prompt_toolkit      3.0.47
psutil              5.9.8
pure_eval           0.2.2
pyarrow             15.0.2
pygments            2.18.0
pytz                2024.1
scipy               1.12.0
setuptools          70.3.0
six                 1.16.0
stack_data          0.6.3
tblib               3.0.0
tlz                 0.12.1
toolz               0.12.1
traitlets           5.14.3
typing_extensions   NA
wcwidth             0.2.13
yaml                6.0.1
zarr                2.18.2
zipp                NA
-----
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]
Linux-6.8.0-1010-aws-x86_64-with-glibc2.39
-----
Session information updated at 2024-08-28 22:36

ivirshup · 2024-08-28T23:01:48Z

Some complications:

iter_chunks errors if the hdf5 array is memory mapped and not chunked
What if the output is chunked but the input isn't?
What if neither are chunked? It would still be valuable to cut down memory usage.

ivirshup added performance 🐌 Bug 🐛 Triage 🩺 labels Aug 28, 2024

ivirshup linked a pull request Aug 28, 2024 that will close this issue

Chunked writing of h5py.Dataset and zarr.Array #1624

Open

4 tasks

ilan-gold removed the Triage 🩺 label Aug 29, 2024

ilan-gold assigned ilan-gold and ivirshup and unassigned ilan-gold Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing a h5py.Dataset loads the whole thing into memory #1623

Writing a h5py.Dataset loads the whole thing into memory #1623

ivirshup commented Aug 28, 2024 •

edited

Loading

ivirshup commented Aug 28, 2024

Writing a h5py.Dataset loads the whole thing into memory #1623

Writing a h5py.Dataset loads the whole thing into memory #1623

Comments

ivirshup commented Aug 28, 2024 • edited Loading

Please make sure these conditions are met

Report

Versions

ivirshup commented Aug 28, 2024

ivirshup commented Aug 28, 2024 •

edited

Loading