-
-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] LZ4 decompression error when using r.patch with MASK in effect #2809
Comments
Is this something new, or is it reproducible on any of the latest releases (7.8, 8.0 or 8.2)? |
I'll check it against the 8.2 release and report back in a bit. |
Good! If you could also share a minimal set of files to reproduce this, I'd be more than thankful. |
I also get the same error in 8.2.1. I've uploaded a tar package containing these files (.pack files created with r.pack): The first two files I'm trying to patch together while using the provided MASK which is a reclass map of mask_region. Projection is EPSG:3395 (World Mercator). Total size is 100MB, uploaded to Proton Drive here: |
Many thanks! We'll see what I can do. |
I'm also getting ZSTD errors when running the same r.patch command with a mask in effect:
r.patch completes normally with no mask though. |
For some reason I haven't got the MASK.pack file to work (problem with referencing mask_region in another non-existing location). I did test with creating mask directly with mask_region though and patched without issues. I'll do some more testing, but thought letting you know. (FWIW, I'm testing on recent 8.3.dev on Mac). |
OK, thanks for testing. I will post gdb backtrace output soon, maybe that will shed some light. |
Just a thought, is the reclassified mask_region (your MASK) valid as mask file (as per https://grass.osgeo.org/grass82/manuals/r.mask.html)? I cannot check now myself. |
I now solved my initial problem with non-existing mapset. My tests succeeded, I couldn't reproduce your issues:
|
Yes, it was created with 'r.mask raster=mask_region'. |
It seems I run into the LZ4 decompression error whenever I try doing a raster operation (r.patch, r.resamp.interp, etc.) when a mask is created in a very large region. After some more investigation, I found I was able to use r.patch and r.resamp.interp successfully with a mask enabled in relatively small region. When doing a r.patch operation while using a mask in a region with 1,017,578,100 cells in it causes the aforementioned lz4 error. |
I increased the resolution in the same location I previously worked in to 1,098,572,200 cells. Still couldn't reproduce the error. Please share a backtrace log if you can. |
Ok, here's the backtrace: gdb r.patch
Starting program: /usr/local/grass83/bin/r.patch Thread 5 "r.patch" received signal SIGSEGV, Segmentation fault. (gdb) bt |
Thanks! Have to admit, I didn't get much wiser with that though. The trace path through GRASS is missing. Perhaps |
and |
Do you want the command run inside gdb with the DEBUG=3 or outside it? |
I didn't actually have the binary As of now, all zstd libraries and binary are version 1.4.4. |
I was thinking of just an ordinary run of the r.patch after setting DEBUG. Could be that the gdb debugging reveals little about GRASS because it wasn't build with |
Ok, I set DEBUG=3, re-ran it and attached the output as a text file; ~24K lines long, though! |
Thanks! Getting a better picture of it, but nowhere close to a solution. Sooo difficult without being able to reproduce it. Could you perhaps try with |
Thats it! With a MASK in effect, r.patch completes with |
Grand! At last we narrowed it down a bit. Still, I did try with nprocs>1, but couldn't reproduce the failure. That's why I figured it a long shot, but I guess this is the "problem" with parallel code: frustratingly difficult to debug. Just to rule out, would adding Could you reproduce the failure also with the test files you made available? (Creating new location, r.unpack, etc...) |
It makes no difference, unfortunately. Similar errors as before.
Wow, I unpacked my files into a new location, and Grass can't read the header files of the rasters. You didn't have any problem reading them? |
I had to create a mapset "UNB_2016" and activate and unpack into that. |
@ericrpatton did you recompile (after |
Hi @neteler , I ran make distclean, and recompiled with `gdb r.patch For help, type "help". No sure why there's not more output :/ |
I was afraid of this. My guess it's because of multithreading processing. Did you manage to also test with the sample files? |
Yes, the test files seemed to generate (possibly) more informative output in gdb:
|
I didn't have to, but I believe |
And it should only take a sec or two. |
Right, before
(maybe edit the script above in #2809 (comment) ?) |
Hm, it's over 13 billion cells, no? So it should take a while (especially with valgrind). |
Ok, with the Note that I hit "enter" frequently in the terminal to see when it would fail.
So, it roughly happened at 90 % of the |
I tried debugging with
|
Now, setting the region to that, I also can reproduce this. |
Glad that at the very least the error is reproducible; thanks for confirming, guys. |
I can confirm this too, the openmp implementation did not consider using mask. Reading the rasters themselves works fine, but not with the mask. So a quick solution for 8.3 is to disable parallelization when mask is present. Alternatively the reading could be done in serial, but that would slow it down, or the mask could be disabled for reading and applied at the end. Unfortunately, I suspect this could be a problem for other parallelized modules as well. |
I couldn't produce any meaningful backtrace output. But just wanted to share how a debug session could look like with lldb. I ran the r.patch command and put a breakpoint at GRASS : grass > lldb r.patch -- in=Leg1_01_20m,Leg3_10_datalistp_20m out=outputmap nprocs=8 mem=2000 --v --o Mapset <UNB_2016> in <test_lz4>
(lldb) target create "r.patch"
Current executable set to 'r.patch' (arm64).
(lldb) settings set -- target.run-args "in=Leg1_01_20m,Leg3_10_datalistp_20m" "out=outputmap" "nprocs=8" "mem=2000" "--v" "--o"
(lldb) breakpoint set --file /Dev/grass/lib/raster/get_row.c --line 174
Breakpoint 1: where = libgrass_raster.8.3.dylib`read_data_compressed + 756 at get_row.c:174:17, address = 0x00000000000175a4
(lldb) run
Process 54428 launched: '/Dev/grass/dist.aarch64-apple-darwin21.6.0/bin/r.patch' (arm64)
Percent complete...
WARNING: ZSTD compression error -10: Unknown frame descriptor
Process 54428 stopped
* thread #8, stop reason = breakpoint 1.1
frame #0: 0x00000001002275a4 libgrass_raster.8.3.dylib`read_data_compressed(fd=1, row=197250, data_buf="", nbytes=0x000000010400b1e0) at get_row.c:174:17
171 if ((n = G_expand(cmp, readamount, data_buf, bufsize,
172 fcb->cellhd.compressed)) < 0 ||
173 (unsigned int)n != bufsize) {
-> 174 G_fatal_error(
175 _("Error uncompressing raster data for row %d of <%s>"),
176 row, fcb->name);
177 }
Target 0: (r.patch) stopped.
(lldb) gui In the terminal GUI, it is possible to "walk about" in the threads and frames, examining the variables in each: We should perhaps not throw in the towel just yet, but disabling parallelisation when mask is present for seems to me the least bad (i.e. safest) option for 8.3 if it comes to that. |
From what @petrasovaa showed me, I think the issue is that the same mask file descriptor is used in all threads. The library works just fine reading data from individual rasters in parallel, but the seeking and reading of mask is done using the same file descriptor. So, disabling parallelization when mask is present for v8.3 seems like a good option to me given we want to release soon(ish). Anyway, I'm curious what you think about the source of the issue. |
I'm inclined to believe just that is the root issue: multiple threads are opening and reading the same file. I don't know what mechanism triggers this: is it on GRASS side or the compressor side. The funny thing is that, there is apparently no problem "reading" the file, only "decompressing" what has just been read. In this latest aspect I've given some thought on the global variable |
While I don't have exact explanation, I think that's because you can read using the file descriptor just fine, but what is read is garbage from point of view of the other thread.
My understanding is, and I think it also stems from @aaronsms analysis of the code, that, given how the code is written, this is only a problem when opening the rasters. When reading the rasters, the per-raster globals are not modified, only read, so everything is okay except the mask which is modified in the sense that seek and read all called by all threads. |
|
I meant one thread can read one map, although now when I'm saying that I realize that in r.patch multiple rasters are read by multiple threads. |
Only, the MASK raster is read together with every raster and therefore it is read concurrently in multiple threads. |
It is indeed, as I suspected, the case that the global variable Here it is set (fcb = R__.fileinfo[x]) :
Line 595 in 1bb3943
and here it is read (row = cur_row):
Lines 133 to 134 in 1bb3943
Concurrent reading of MASK in multiple threads will obviously lead to very unexpected results. Fixing this is, as I see it, a non-trivial task, I'd suggest to merge #2829 for 8.3 release. |
Merged and backported. Should we create a new, more readable ticket and close this one? |
I consider this bug fixed. The rest is an enhancement. |
@ericrpatton Thanks for the report and your valuable assistance in tracking this issue down! |
You're welcome. So the verdict on this bug is a wontfix? |
Yes, that is unfortunately the case. |
I would say this was fixed in #2829. The error no longer occurs. Now a new valid feature request would be for mask and parallel support to work together. Wontfix sounds like we did nothing and plan nothing which is not the case. |
Sorry, I hadn't seen the fix; I re-synced my git checkout and I can confirm the error no long occurs - thanks for that, much appreciated. |
Describe the bug
A clear and concise description of what the bug is.
I receive the following error when trying to patch a series of raster maps together when a MASK is in existence:
r.patch in=Leg1_01_20m,Leg1_02_20m,Leg1_03_20m,Leg2_01_20m,Leg4_05_20m out=AAA --v --o
Percent complete...
WARNING: LZ4 decompressionWARNING: LZ4 decompression error
error
ERROR: Error uncompressing null dataERROR: Error uncompressing null data for row 5 of
for row 43 of
WARNING: Unable to close file fcell for raster map AAA: Bad file descriptor
WARNING: Unable to close file nullcmpr for raster map AAA: Bad file
descriptor
double free or corruption (!prev)
Aborted (core dumped)
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
No LZ4 decompression errors are encountered, and r.patch competes successfully.
System description (please complete the following information):
uname -a:
Linux machinename 5.15.0-58-generic #64~20.04.1-Ubuntu SMP Fri Jan 6 16:42:31 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Grass version:
version=8.3.dev
date=2023
revision=2a7d5ec65e
build_date=2023-02-08
build_platform=x86_64-pc-linux-gnu
build_off_t_size=8
libgis_revision=d22ee4748e
libgis_date=2023-02-02T07:48:36+00:00
proj=8.2.0
gdal=3.5.3
geos=3.11.0
sqlite=3.31.1
lz4 --version
*** LZ4 command line interface 64-bits v1.9.2, by Yann Collet ***
python3 -c "import sys, wx; print(sys.version); print(wx.version())"
3.8.10 (default, Nov 14 2022, 12:59:47)
[GCC 9.4.0]
4.0.7 gtk3 (phoenix) wxWidgets 3.0.4
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: