Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when looping through data variables of a dataset loaded from a VRT #774

Open
amaissen opened this issue May 3, 2024 · 4 comments
Labels
bug Something isn't working question Further information is requested

Comments

@amaissen
Copy link

amaissen commented May 3, 2024

Code Sample, a copy-pastable example if possible

A "Minimal, Complete and Verifiable Example" will make it much easier for maintainers to help you:
http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

import rioxarray as rxr
import xarray as xr
import gc

PATH = "path_to_multi_band_vrt.vrt"

def memory_leak():
  raster = rxr.open_rasterio(PATH, band_as_variable=True, chunks={"x": -1, "y": -1})
  bands = list(raster.data_vars)
  
  for band in bands:
    data = raster[band].copy(deep=True).load()
    
    delete data
    gc.collect()

Problem description

The allocated memory increases after each iteration.

Expected Output

The memory is released after each iteration, so one can process multi-band datasets that do not fit in memory.

Environment Information

rioxarray (0.15.5) deps:
 rasterio: 1.3.10
   xarray: 2024.3.0
     GDAL: 3.8.4
     GEOS: 3.11.1
     PROJ: 9.3.1
PROJ DATA: /opt/conda/envs/some-env/share/proj
GDAL DATA: /opt/conda/envs/some-env/share/gdal

Other python deps:
    scipy: 1.13.0
   pyproj: 3.6.1

System:
   python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
executable: /opt/conda/envs/some-env/bin/python
  machine: Linux-5.15.0-101-generic-x86_64-with-glibc2.35

Conda environment information (if you installed with conda):


Environment (conda list):
gdal                      3.8.5           py310h3b926b6_2    conda-forge
libgdal                   3.8.5                hf9625ee_2    conda-forge
rasterio                  1.3.10                   pypi_0    pypi
rioxarray                 0.15.5                   pypi_0    pypi
xarray                    2024.3.0                 pypi_0    pypi

@amaissen amaissen added the bug Something isn't working label May 3, 2024
@snowman2 snowman2 added question Further information is requested and removed bug Something isn't working labels May 3, 2024
@snowman2
Copy link
Member

snowman2 commented May 3, 2024

Add these kwargs to open_rasterio to disable caching:

lock=False,  # disable internal caching
cache=False,  # don't keep data loaded in memory. pull from disk every time

@amaissen
Copy link
Author

amaissen commented May 3, 2024

@snowman2 , thanks for pointing to these options. I tired the options you suggested but they did not help to release the memory.

However, when I store the entire raster to zarr storage with to_zarr(), and load with raster = xarray.open_zarr(...), I don't see any memory leaks when iterating through the data variables. This would look like

import rioxarray as rxr
import xarray as xr
import gc

PATH = "path_to_multi_band_vrt.vrt"

def no_memory_leak():
  # Read from VRT and save to zarr (one chunk per band)
  rxr.open_rasterio(PATH, band_as_variable=True, chunks={"x": -1, "y": -1}).to_zarr(some_temp_dataset)
  
  # Open zarr and iterate over data vars.
  raster = xr.open_zarr(some_temp_dataset, chunks={"x": -1, "y": -1})
  bands = list(raster.data_vars)
  
  for band in bands:
    data = raster[band].copy(deep=True).load()
  
    del data
    gc.collect()

@GregoryPetrochenkov-NOAA
Copy link

GregoryPetrochenkov-NOAA commented Jun 11, 2024

I have experienced similar issues with memory leaks. I did an experiment with rioxarray loading GeoTIFFs and similarly with xarray loading NetCDF files.

The tests use:

  • python: 3.10.14
  • memray: 1.12.0
  • rasterio: 1.3.9
  • xarray: 2024.3.0
  • rioxarray: 0.15.3
  • netCDF4: 1.6.5

I would run the operation 5 times and memory profile with memray as follows (run in a Jupyter Notebook):

import time
import subprocess
import os
import gc
from functools import partial

import memray
import rioxarray as rxr
import xarray as xr

%load_ext memray

def run_test(func):
    """
    Driver to run memory accumulation test by running 5 times simulating a batch process
    """

    for x in range(5):
        func()

    time.sleep(1)

I also set up these arguments in advance:

xarray_kwargs = {
   "cand_file": "./subsample_benchmark_mean.nc",
   "bench_file": "./subsample_candidate_mean.nc",
    "cache": False,
    "lock": False
}

rio_kwargs = {
    "cand_file": "./c_uint8.tif",
    "bench_file": "./b_uint8.tif",
    "cache": False,
    "lock": False
}

The first snippet shows rioxarray loading the two GeoTIFFs in a context wrapper which automatically calls the close method when finished:

%%memray_flamegraph --temporal

def run_context_load(cand_file, bench_file, cache=False, lock=False):
    """
    Loads with context wrappers
    """

    with (rxr.open_rasterio(cand_file, cache=cache, lock=lock) as ds,
            rxr.open_rasterio(bench_file, cache=cache, lock=lock) as ds2):

        # Pure load
        ds.load()
        ds2.load()
              
run_test(partial(run_context_load, **rio_kwargs))

image

As can be seen the memory is never released. In practice of larger iterations it leads to very large memory consumption due to accumulation of memory.

The same method is done with xarray and NetCDF files as follows:

%%memray_flamegraph --temporal

def run_context_load_xr(cand_file, bench_file, cache=False, lock=False):
    """
    Loads with context wrappers
    """

    with (xr.open_dataset(cand_file, cache=cache, lock=lock) as ds,
             xr.open_dataset(bench_file, cache=cache, lock=lock) as ds2):

        # Pure load
        ds.load()
        ds2.load()
              
run_test(partial(run_context_load_xr, **xarray_kwargs))

image

As can be seen, as expected all memory is released by the end of the operation.

I tried changing the cache and lock arguments to no avail, I could not get rioxarray to behave similarly. The only way to fully release memory is to directly delete the objects and garbage collect:

%%memray_flamegraph --temporal

def run_context_load_delete_gc(cand_file, bench_file, cache=False, lock=False):
    """
    Loads with context wrappers and deletes objects
    """

    with (rxr.open_rasterio(cand_file, cache=cache, lock=lock) as ds,
            rxr.open_rasterio(bench_file, cache=cache, lock=lock) as ds2):

        # Pure load
        ds.load()
        ds2.load()

    del ds, ds2
    gc.collect()
    
run_test(partial(run_context_load_delete_gc, **rio_kwargs))

image

While this does work, it is not a clean solution and would necessitate prescribing users to do so. I would suggest to relabel this issue as a bug because this takes extra work for a user to diagnose. A user would not expect this behavior when loading GeoTIFFs in rioxarray. This is a dependency in a package I am supporting and the memory accumulation caused issues for workflows as can be seen in this memory profiling example:

image

Also, I apologize for not being able to simply paste the data as would be preferable but If you like I can provide my notebooks and data which total in about 100 MBs.

@snowman2
Copy link
Member

The GDAL cache settings may be worth looking into: https://gdal.org/user/configoptions.html#performance-and-caching

@snowman2 snowman2 added the bug Something isn't working label Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants