Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failure in pnetcdf with netCDF 4.8.0 #2038

Closed
ArchangeGabriel opened this issue Jul 22, 2021 · 23 comments · Fixed by #2310
Closed

Test failure in pnetcdf with netCDF 4.8.0 #2038

ArchangeGabriel opened this issue Jul 22, 2021 · 23 comments · Fixed by #2310
Assignees
Milestone

Comments

@ArchangeGabriel
Copy link
Contributor

While rebuild pnetcdf for netCDF 4.8.0, I’ve encountered a test failure, reported at Parallel-NetCDF/PnetCDF#72.

@wkliao thinks this is an issue with netCDF as he explained in the linked ticket. My MPI/netCDF is rusty, so I trust him about that and decided to open a ticket here. ;)

@WardF
Copy link
Member

WardF commented Jul 22, 2021

Thanks! I’ll take a look :)

@WardF
Copy link
Member

WardF commented Jul 22, 2021

I am traveling so won't be able to dig into this today, but I'll take a look when I get back to the 'office'. Also tagging the author of the MPI code, @edhartnett, in case an obvious answer leaps out! Thanks again @ArchangeGabriel!

@WardF WardF self-assigned this Jul 26, 2021
@WardF WardF added this to the 4.8.1 milestone Jul 26, 2021
@ArchangeGabriel
Copy link
Contributor Author

Was this solved in 4.8.1 (I can’t test right now because of #2085)?

@ArchangeGabriel
Copy link
Contributor Author

@DennisHeimbigner I think you wanted to post that in #2085. ;)
But yes, I’ll build that way to check this one here indeed.

@ArchangeGabriel
Copy link
Contributor Author

So it’s not fixed.

@ArchangeGabriel
Copy link
Contributor Author

netCDF 4.8 has been stuck in the [staging] repository for two months on Arch and this is blocking other updates, so I’ve been asked to do something about it.
Do you and/or @wkliao think this is minor enough so that we should ignore it and publish 4.8.1 to the users, or rather withdraw the 4.8 build until this get fixed?

@WardF
Copy link
Member

WardF commented Oct 1, 2021

Revisiting this now, leaving a comment so that it rises to the top of my queue.

@WardF
Copy link
Member

WardF commented Oct 1, 2021

So it bears further testing, but in a new debug environment with pnetcdf 1.12.2 and netcdf-c 4.8.1, the failure previously observed (and replicated) does not occur. I'll continue to dig into this.

@WardF
Copy link
Member

WardF commented Oct 4, 2021

So it seems I'm going to need a little bit more information to duplicate this. In my environments (Ubuntu, mpicc is build for MPICH v. 3.4.1, gcc version 10.3.0) I am not observing any errors with the bundled pnetcdf tests, nor with the test program described here, using the testfile.nc file from the pnetcdf tarball). It's possible I'm using the the wrong binary test file, or that there is some other mismatch in my environment. At this point, if I knew a specific version of mpicc to use, or had other pertinent info about the environment, I could do more to try to replicate this issue. Thanks!

@ArchangeGabriel
Copy link
Contributor Author

As always with my issues, you can find the version of any packages on https://archlinux.org/packages/.

In this case, this is openmpi 4.1.1 (I still did not finish packaging mpich) and gcc 11.1.0.

@wkliao
Copy link
Contributor

wkliao commented Oct 4, 2021

I don't think the MPICH/OpenMPI version or gcc version matters.
If you have a NetCDF 4.8.0 or 4.8.1 that was built with parallel I/O
enabled, but with pnetcdf disabled, then you should be able to run the
test program, tst_nc4.c, and see the same error message.
That test program requires no input file.
It does not use PnetCDF underneath.

@WardF
Copy link
Member

WardF commented Oct 4, 2021

I will try installing openmpi and see if the issue occurs; I do not observe any failures with tst_nc4.c in my environment. Thanks for the additional information!

@wkliao
Copy link
Contributor

wkliao commented Oct 4, 2021

Did you run it with more than one MPI process?
The example there uses command "mpiexec -n 4 ./tst_nc4"

@WardF
Copy link
Member

WardF commented Oct 4, 2021

@wkliao Yes, I'm executing via mpiexec -n 4 and do not observe any failures.

@wkliao
Copy link
Contributor

wkliao commented Oct 4, 2021

Could you please add the following lines to your tst_nc4.c before the call
to nc_create_par to print the NetCDF version string?
This is to ensure we are using the same NetCDF library.

    if (rank == 0)
        printf("\nNetCDF library version is: %s\n\n", nc_inq_libvers());

Here is what I am getting.


NetCDF library version is: 4.8.1 of Oct  4 2021 16:24:17 $

Rank 0 error at line 43 of file tst_nc4.c:
	after writing start=0 count=1
	expecting number of records = 4 but got 1
Rank 1 error at line 43 of file tst_nc4.c:
	after writing start=1 count=1
	expecting number of records = 4 but got 2
Rank 2 error at line 43 of file tst_nc4.c:
	after writing start=2 count=1
	expecting number of records = 4 but got 3

@WardF
Copy link
Member

WardF commented Oct 4, 2021

We may be on to something here as I am receiving an error when using mpicc as provided by openmpi

@WardF
Copy link
Member

WardF commented Oct 4, 2021

The output I'm observing is as follows, using mpicc as provided by openmpi. Let me revert my VM image to mpich2 and generate the same output.

wfisher@ubuntudev-arm:~/Desktop$ mpiexec -n 4 tst_nc4

NetCDF library version is: 4.8.1 of Oct  4 2021 15:23:46 $

Rank 1 error at line 43 of file tst_nc4.c:
	after writing start=1 count=1
	expecting number of records = 4 but got 2
Rank 2 error at line 43 of file tst_nc4.c:
	after writing start=2 count=1
	expecting number of records = 4 but got 3
Rank 0 error at line 43 of file tst_nc4.c:
	after writing start=0 count=1
	expecting number of records = 4 but got 1

@WardF
Copy link
Member

WardF commented Oct 4, 2021

Interesting, seeing inconsistent behavior now. I'll follow up after I pursue this further, but it appears that the issue may also now be manifesting in the mpich2-based VM image. Frustrating, as it very much wasn't before, but that's the nature of software development I guess. Stand by as we sort this out.

@WardF WardF pinned this issue Oct 5, 2021
@WardF WardF modified the milestones: 4.8.1, 4.8.2 Dec 13, 2021
@edwardhartnett
Copy link
Contributor

OK, I've been taking a look at this and indeed this seems to be a bug in netcdf parallel I/O code.

The problem is here:

err = nc_put_vara_int(ncid, varid, start, count, &buf); CHECK_ERR
err = nc_inq_dimlen(ncid, dimid, &nrecs); CHECK_ERR

Obviously the dimlen is not getting the correct result because it is being updated on each processor, but the result is not being propagated to the other processors.

I will take a look at what can be done.

To start with, I have converted the test from the pnetcdf issue into a netcdf test, tst_parallel6.c.

I will update here when I get more progress...

@edwardhartnett
Copy link
Contributor

OK, the problem is in libhdf5/hdf5internal.c:

find_var_dim_max_length(NC_GRP_INFO_T *grp, int varid, int dimid,
                        size_t *maxlen)
{
    hid_t datasetid = 0, spaceid = 0;
    NC_VAR_INFO_T *var;
    hsize_t *h5dimlen = NULL, *h5dimlenmax = NULL;
    int d, dataset_ndims = 0;
    int retval = NC_NOERR;

    *maxlen = 0;

    /* Find this var. */
    var = (NC_VAR_INFO_T*)ncindexith(grp->vars,varid);
    if (!var) return NC_ENOTVAR;
    assert(var->hdr.id == varid);

    /* If the var hasn't been created yet, its size is 0. */
    if (!var->created)
    {
        *maxlen = 0;
    }
    else
    {
        /* Get the number of records in the dataset. */
        if ((retval = nc4_open_var_grp2(grp, var->hdr.id, &datasetid)))
            BAIL(retval);
        if ((spaceid = H5Dget_space(datasetid)) < 0)
            BAIL(NC_EHDFERR);

        /* If it's a scalar dataset, it has length one. */
        if (H5Sget_simple_extent_type(spaceid) == H5S_SCALAR)
        {
            *maxlen = (var->dimids && var->dimids[0] == dimid) ? 1 : 0;
        }
        else
        {
            /* Check to make sure ndims is right, then get the len of each
               dim in the space. */
            if ((dataset_ndims = H5Sget_simple_extent_ndims(spaceid)) < 0)
                BAIL(NC_EHDFERR);
            if (dataset_ndims != var->ndims)
                BAIL(NC_EHDFERR);
            if (!(h5dimlen = malloc(dataset_ndims * sizeof(hsize_t))))
                BAIL(NC_ENOMEM);
            if (!(h5dimlenmax = malloc(dataset_ndims * sizeof(hsize_t))))
                BAIL(NC_ENOMEM);
            if ((dataset_ndims = H5Sget_simple_extent_dims(spaceid,
                                                           h5dimlen, h5dimlenmax)) < 0)
                BAIL(NC_EHDFERR);
            LOG((5, "find_var_dim_max_length: varid %d len %d max: %d",
                 varid, (int)h5dimlen[0], (int)h5dimlenmax[0]));
            for (d=0; d<dataset_ndims; d++) {
                if (var->dimids[d] == dimid) {
                    *maxlen = *maxlen > h5dimlen[d] ? *maxlen : h5dimlen[d];
                }
            }
        }
    }

OK, so H5Dget_space() is returning local information. I have tried using H5fsync() to sync the file, but this did not work either.

I don't know any way to force HDF5 to update this information except by closing and reopening the file. And that's no good.

So what should happen here? @wkliao suggestions are welcome...

@wkliao
Copy link
Contributor

wkliao commented Apr 11, 2022

In PnetCDF, we call MPI_Allreduce to get the maximal record number among all writing processes if the I/O mode is collective. If it is in independent mode, then the consistency on record number is not guaranteed.

@edwardhartnett
Copy link
Contributor

OK I will take a stab at this tomorrow morning...

@edwardhartnett
Copy link
Contributor

OK I have a PR up with a fix. (#2310). Feedback welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants