`load_ctc_data` makes two copies of the loaded array #165

DragaDoncila · 2024-09-19T23:24:18Z

Description

Our current _load_tiffs function temporarily spawns two copies of the data when loading from disk, before the function returns and one copy (the list) is garbage collected. For medium to large datasets, this might lead to a memory error, even though the dataset alone fits into memory just fine. I've confirmed this behaviour with python profiler.

def _load_tiffs(data_dir):
    """Load a directory of individual frames into a stack.

    Args:
        data_dir (Path): Path to directory of tiff files

    Raises:
        FileNotFoundError: No tif files found in data_dir

    Returns:
        np.array: 4D array with dims TYXC
    """
    files = np.sort(glob.glob(f"{data_dir}/*.tif*"))
    if len(files) == 0:
        raise FileNotFoundError(f"No tif files were found in {data_dir}")

    ims = []
    for f in tqdm(files, "Loading TIFFs"):
        ims.append(imread(f))
	# ims now holds a full sized copy of the data

    mov = np.stack(ims)
	# both ims and mov hold full sized copies of the data, before we return

    return mov

We should update this code to peek at the first frame and check its size, then allocate a numpy array we assign into. Roughly, as below:

def _load_tiffs(data_dir):
    files = np.sort(glob.glob(f"{data_dir}/*.tif*"))
    if len(files) == 0:
        raise FileNotFoundError(f"No tif files were found in {data_dir}")

    first_im = imread(all_tiffs[0])
    shape = (len(all_tiffs), *first_im.shape)
    dtype = first_im.dtype
    stack = np.zeros(shape=shape, dtype=dtype)
    stack[0] = first_im

    for i, f in enumerate(tqdm(all_tiffs[1:], "Loading TIFFs")):
        imread(f, out=stack[i + 1])

    return stack

Minimal example to reproduce the bug

Best way to reproduce is to load a dataset that is more than half your RAM size. I've usually noticed this when running pipelines for multiple datasets, but as I mentioned above, have confirmed this with python profile (will try reproduce the profile at some stage if we want, but I don't think I have a copy anymore...).

Severity

Unusable
Annoying, but still functional
Very minor

The text was updated successfully, but these errors were encountered:

DragaDoncila added the bug Something isn't working label Sep 19, 2024

DragaDoncila mentioned this issue Sep 19, 2024

Extend benchmarks #74

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`load_ctc_data` makes two copies of the loaded array #165

`load_ctc_data` makes two copies of the loaded array #165

DragaDoncila commented Sep 19, 2024

load_ctc_data makes two copies of the loaded array #165

load_ctc_data makes two copies of the loaded array #165

Comments

DragaDoncila commented Sep 19, 2024

Description

Minimal example to reproduce the bug

Severity

`load_ctc_data` makes two copies of the loaded array #165

`load_ctc_data` makes two copies of the loaded array #165