-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak when looping through data variables of a dataset loaded from a VRT #774
Comments
Add these kwargs to
|
@snowman2 , thanks for pointing to these options. I tired the options you suggested but they did not help to release the memory. However, when I store the entire raster to zarr storage with import rioxarray as rxr
import xarray as xr
import gc
PATH = "path_to_multi_band_vrt.vrt"
def no_memory_leak():
# Read from VRT and save to zarr (one chunk per band)
rxr.open_rasterio(PATH, band_as_variable=True, chunks={"x": -1, "y": -1}).to_zarr(some_temp_dataset)
# Open zarr and iterate over data vars.
raster = xr.open_zarr(some_temp_dataset, chunks={"x": -1, "y": -1})
bands = list(raster.data_vars)
for band in bands:
data = raster[band].copy(deep=True).load()
del data
gc.collect() |
I have experienced similar issues with memory leaks. I did an experiment with rioxarray loading GeoTIFFs and similarly with xarray loading NetCDF files. The tests use:
I would run the operation 5 times and memory profile with memray as follows (run in a Jupyter Notebook): import time
import subprocess
import os
import gc
from functools import partial
import memray
import rioxarray as rxr
import xarray as xr
%load_ext memray
def run_test(func):
"""
Driver to run memory accumulation test by running 5 times simulating a batch process
"""
for x in range(5):
func()
time.sleep(1) I also set up these arguments in advance: xarray_kwargs = {
"cand_file": "./subsample_benchmark_mean.nc",
"bench_file": "./subsample_candidate_mean.nc",
"cache": False,
"lock": False
}
rio_kwargs = {
"cand_file": "./c_uint8.tif",
"bench_file": "./b_uint8.tif",
"cache": False,
"lock": False
} The first snippet shows rioxarray loading the two GeoTIFFs in a context wrapper which automatically calls the close method when finished: %%memray_flamegraph --temporal
def run_context_load(cand_file, bench_file, cache=False, lock=False):
"""
Loads with context wrappers
"""
with (rxr.open_rasterio(cand_file, cache=cache, lock=lock) as ds,
rxr.open_rasterio(bench_file, cache=cache, lock=lock) as ds2):
# Pure load
ds.load()
ds2.load()
run_test(partial(run_context_load, **rio_kwargs)) As can be seen the memory is never released. In practice of larger iterations it leads to very large memory consumption due to accumulation of memory. The same method is done with xarray and NetCDF files as follows: %%memray_flamegraph --temporal
def run_context_load_xr(cand_file, bench_file, cache=False, lock=False):
"""
Loads with context wrappers
"""
with (xr.open_dataset(cand_file, cache=cache, lock=lock) as ds,
xr.open_dataset(bench_file, cache=cache, lock=lock) as ds2):
# Pure load
ds.load()
ds2.load()
run_test(partial(run_context_load_xr, **xarray_kwargs)) As can be seen, as expected all memory is released by the end of the operation. I tried changing the %%memray_flamegraph --temporal
def run_context_load_delete_gc(cand_file, bench_file, cache=False, lock=False):
"""
Loads with context wrappers and deletes objects
"""
with (rxr.open_rasterio(cand_file, cache=cache, lock=lock) as ds,
rxr.open_rasterio(bench_file, cache=cache, lock=lock) as ds2):
# Pure load
ds.load()
ds2.load()
del ds, ds2
gc.collect()
run_test(partial(run_context_load_delete_gc, **rio_kwargs)) While this does work, it is not a clean solution and would necessitate prescribing users to do so. I would suggest to relabel this issue as a bug because this takes extra work for a user to diagnose. A user would not expect this behavior when loading GeoTIFFs in rioxarray. This is a dependency in a package I am supporting and the memory accumulation caused issues for workflows as can be seen in this memory profiling example: Also, I apologize for not being able to simply paste the data as would be preferable but If you like I can provide my notebooks and data which total in about 100 MBs. |
The GDAL cache settings may be worth looking into: https://gdal.org/user/configoptions.html#performance-and-caching |
Code Sample, a copy-pastable example if possible
A "Minimal, Complete and Verifiable Example" will make it much easier for maintainers to help you:
http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports
Problem description
The allocated memory increases after each iteration.
Expected Output
The memory is released after each iteration, so one can process multi-band datasets that do not fit in memory.
Environment Information
Conda environment information (if you installed with conda):
Environment (
conda list
):The text was updated successfully, but these errors were encountered: