Memory leak when looping through data variables of a dataset loaded from a VRT #774

amaissen · 2024-05-03T08:31:37Z

Code Sample, a copy-pastable example if possible

A "Minimal, Complete and Verifiable Example" will make it much easier for maintainers to help you:
http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

import rioxarray as rxr
import xarray as xr
import gc

PATH = "path_to_multi_band_vrt.vrt"

def memory_leak():
  raster = rxr.open_rasterio(PATH, band_as_variable=True, chunks={"x": -1, "y": -1})
  bands = list(raster.data_vars)
  
  for band in bands:
    data = raster[band].copy(deep=True).load()
    
    delete data
    gc.collect()

Problem description

The allocated memory increases after each iteration.

Expected Output

The memory is released after each iteration, so one can process multi-band datasets that do not fit in memory.

Environment Information

rioxarray (0.15.5) deps:
 rasterio: 1.3.10
   xarray: 2024.3.0
     GDAL: 3.8.4
     GEOS: 3.11.1
     PROJ: 9.3.1
PROJ DATA: /opt/conda/envs/some-env/share/proj
GDAL DATA: /opt/conda/envs/some-env/share/gdal

Other python deps:
    scipy: 1.13.0
   pyproj: 3.6.1

System:
   python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
executable: /opt/conda/envs/some-env/bin/python
  machine: Linux-5.15.0-101-generic-x86_64-with-glibc2.35

Conda environment information (if you installed with conda):

Environment (conda list):

gdal                      3.8.5           py310h3b926b6_2    conda-forge
libgdal                   3.8.5                hf9625ee_2    conda-forge
rasterio                  1.3.10                   pypi_0    pypi
rioxarray                 0.15.5                   pypi_0    pypi
xarray                    2024.3.0                 pypi_0    pypi

The text was updated successfully, but these errors were encountered:

snowman2 · 2024-05-03T11:20:36Z

Add these kwargs to open_rasterio to disable caching:

lock=False,  # disable internal caching
cache=False,  # don't keep data loaded in memory. pull from disk every time

amaissen · 2024-05-03T11:55:41Z

@snowman2 , thanks for pointing to these options. I tired the options you suggested but they did not help to release the memory.

However, when I store the entire raster to zarr storage with to_zarr(), and load with raster = xarray.open_zarr(...), I don't see any memory leaks when iterating through the data variables. This would look like

import rioxarray as rxr
import xarray as xr
import gc

PATH = "path_to_multi_band_vrt.vrt"

def no_memory_leak():
  # Read from VRT and save to zarr (one chunk per band)
  rxr.open_rasterio(PATH, band_as_variable=True, chunks={"x": -1, "y": -1}).to_zarr(some_temp_dataset)
  
  # Open zarr and iterate over data vars.
  raster = xr.open_zarr(some_temp_dataset, chunks={"x": -1, "y": -1})
  bands = list(raster.data_vars)
  
  for band in bands:
    data = raster[band].copy(deep=True).load()
  
    del data
    gc.collect()

GregoryPetrochenkov-NOAA · 2024-06-11T21:25:16Z

I have experienced similar issues with memory leaks. I did an experiment with rioxarray loading GeoTIFFs and similarly with xarray loading NetCDF files.

The tests use:

python: 3.10.14
memray: 1.12.0
rasterio: 1.3.9
xarray: 2024.3.0
rioxarray: 0.15.3
netCDF4: 1.6.5

I would run the operation 5 times and memory profile with memray as follows (run in a Jupyter Notebook):

import time
import subprocess
import os
import gc
from functools import partial

import memray
import rioxarray as rxr
import xarray as xr

%load_ext memray

def run_test(func):
    """
    Driver to run memory accumulation test by running 5 times simulating a batch process
    """

    for x in range(5):
        func()

    time.sleep(1)

I also set up these arguments in advance:

xarray_kwargs = {
   "cand_file": "./subsample_benchmark_mean.nc",
   "bench_file": "./subsample_candidate_mean.nc",
    "cache": False,
    "lock": False
}

rio_kwargs = {
    "cand_file": "./c_uint8.tif",
    "bench_file": "./b_uint8.tif",
    "cache": False,
    "lock": False
}

The first snippet shows rioxarray loading the two GeoTIFFs in a context wrapper which automatically calls the close method when finished:

%%memray_flamegraph --temporal

def run_context_load(cand_file, bench_file, cache=False, lock=False):
    """
    Loads with context wrappers
    """

    with (rxr.open_rasterio(cand_file, cache=cache, lock=lock) as ds,
            rxr.open_rasterio(bench_file, cache=cache, lock=lock) as ds2):

        # Pure load
        ds.load()
        ds2.load()
              
run_test(partial(run_context_load, **rio_kwargs))

As can be seen the memory is never released. In practice of larger iterations it leads to very large memory consumption due to accumulation of memory.

The same method is done with xarray and NetCDF files as follows:

%%memray_flamegraph --temporal

def run_context_load_xr(cand_file, bench_file, cache=False, lock=False):
    """
    Loads with context wrappers
    """

    with (xr.open_dataset(cand_file, cache=cache, lock=lock) as ds,
             xr.open_dataset(bench_file, cache=cache, lock=lock) as ds2):

        # Pure load
        ds.load()
        ds2.load()
              
run_test(partial(run_context_load_xr, **xarray_kwargs))

As can be seen, as expected all memory is released by the end of the operation.

I tried changing the cache and lock arguments to no avail, I could not get rioxarray to behave similarly. The only way to fully release memory is to directly delete the objects and garbage collect:

%%memray_flamegraph --temporal

def run_context_load_delete_gc(cand_file, bench_file, cache=False, lock=False):
    """
    Loads with context wrappers and deletes objects
    """

    with (rxr.open_rasterio(cand_file, cache=cache, lock=lock) as ds,
            rxr.open_rasterio(bench_file, cache=cache, lock=lock) as ds2):

        # Pure load
        ds.load()
        ds2.load()

    del ds, ds2
    gc.collect()
    
run_test(partial(run_context_load_delete_gc, **rio_kwargs))

While this does work, it is not a clean solution and would necessitate prescribing users to do so. I would suggest to relabel this issue as a bug because this takes extra work for a user to diagnose. A user would not expect this behavior when loading GeoTIFFs in rioxarray. This is a dependency in a package I am supporting and the memory accumulation caused issues for workflows as can be seen in this memory profiling example:

Also, I apologize for not being able to simply paste the data as would be preferable but If you like I can provide my notebooks and data which total in about 100 MBs.

snowman2 · 2024-06-21T16:35:14Z

The GDAL cache settings may be worth looking into: https://gdal.org/user/configoptions.html#performance-and-caching

amaissen added the bug Something isn't working label May 3, 2024

snowman2 added question Further information is requested and removed bug Something isn't working labels May 3, 2024

snowman2 added the bug Something isn't working label Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak when looping through data variables of a dataset loaded from a VRT #774

Memory leak when looping through data variables of a dataset loaded from a VRT #774

amaissen commented May 3, 2024 •

edited

Loading

snowman2 commented May 3, 2024

amaissen commented May 3, 2024 •

edited

Loading

GregoryPetrochenkov-NOAA commented Jun 11, 2024 •

edited

Loading

snowman2 commented Jun 21, 2024

Memory leak when looping through data variables of a dataset loaded from a VRT #774

Memory leak when looping through data variables of a dataset loaded from a VRT #774

Comments

amaissen commented May 3, 2024 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Environment Information

Conda environment information (if you installed with conda):

snowman2 commented May 3, 2024

amaissen commented May 3, 2024 • edited Loading

GregoryPetrochenkov-NOAA commented Jun 11, 2024 • edited Loading

snowman2 commented Jun 21, 2024

amaissen commented May 3, 2024 •

edited

Loading

amaissen commented May 3, 2024 •

edited

Loading

GregoryPetrochenkov-NOAA commented Jun 11, 2024 •

edited

Loading