Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PnetCDF mcoll_perf detects incorrect data #757

Open
adammoody opened this issue Jan 12, 2023 · 0 comments
Open

PnetCDF mcoll_perf detects incorrect data #757

adammoody opened this issue Jan 12, 2023 · 0 comments
Labels

Comments

@adammoody
Copy link
Collaborator

adammoody commented Jan 12, 2023

The test/nonblocking/mcoll_perf.c test detects incorrect data when comparing two files that were written two different ways which should have identical content.

cd test/nonblocking
srun -n2 ./mcoll_perf /unifyfs/testfile.nc
<snip>
P0: diff at line 282 variable[2] var1_2: NC_INT buf1 != buf2 at position 32762

After tracing pwrite and pread calls under a debugger, the problem is that both ranks write to the same byte offsets without any synchronization in between. In this case, rank 1 writes a fill value and rank 0 later writes actual data. It's a race as to which value actually ends up in the file.

The fill call is here:

https://github.com/Parallel-NetCDF/PnetCDF/blob/bb59553ca3542bc09ead12c6ce4e65b913ef51fa/test/nonblocking/mcoll_perf.c#L521

When filling the variable 2, rank 1 writes to (offset=648, length=8) and (offset=680, length=8).

And the write call is here:

https://github.com/Parallel-NetCDF/PnetCDF/blob/bb59553ca3542bc09ead12c6ce4e65b913ef51fa/test/nonblocking/mcoll_perf.c#L526

In that write, rank 0 writes to (offset=640, length=16) and (offset=672, length=16), which overlaps with the region that rank 1 wrote to during the fill operation.

The test case can be fixed by adding a call to ncmpi_sync(ncid);:

           for (i=2; i<nvars; i++){
                /* fill record variables to silence valgrind complaining about uninitialised bytes */
                for (j=0; j<array_of_gsizes[0]; j++) {
                    err = ncmpi_fill_var_rec(ncid, varid[i], j);
                    CHECK_ERR
                }
            }
            ncmpi_sync(ncid); // <--- add sync here to fix the test case
            for (i=0; i<nvars; i++){
                err = ncmpi_put_vara_all(ncid, varid[i], starts[i], counts[i], buf[i], bufcounts[i], MPI_INT);
                CHECK_ERR
            }

For reference, here is the sequence of (offset, length) values for writes from different ranks when k==0. There are multiple overlapping writes, one of which is shown below:

offset, length values for writes
--------  -------
rank 0    rank 1
--------  -------
  0, 336
512, 32   544, 32
576, 32   608, 32
640, 8    648, 8  <--- this "fill" by rank 1
  4, 4
672, 8    680, 8
  4, 4
704, 8    712, 8
  4, 4
736, 8    744, 8
  4, 4
656, 8    664, 8
688, 8    696, 8
720, 8    728, 8
752, 8    760, 8
512, 32   544, 32
576, 32   608, 32
640, 16   704, 16  <-- overlaps with this "put" by rank 0
672, 16   736, 16
656, 16   720, 16
688, 16   752, 16
@adammoody adammoody changed the title PnetCDF test/nonblocking/mcoll_perf detects incorrect data PnetCDF mcoll_perf detects incorrect data Jan 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant