Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nc_create_par triggers HDF5 error stack message related to H5FDunregister #2990

Closed
brtnfld opened this issue Aug 23, 2024 · 19 comments · Fixed by #3013
Closed

nc_create_par triggers HDF5 error stack message related to H5FDunregister #2990

brtnfld opened this issue Aug 23, 2024 · 19 comments · Fixed by #3013
Assignees
Milestone

Comments

@brtnfld
Copy link
Contributor

brtnfld commented Aug 23, 2024

To report a non-security related issue, please provide:

  • the version of the software with which you are encountering an issue
    v4.9.3-rc1

  • environmental information (i.e. Operating System, compiler info, java version, python version, etc.)
    Frontier (ORNL), Cray clang version 17.0.0 (b59b7a8e9169719529cf5ab440f3c301e515d047)

  • a description of the issue with the steps needed to reproduce it

A call to status = nc_create_par(output_path, NC_NETCDF4 | NC_CLOBBER, MPI_COMM_WORLD, MPI_INFO_NULL, &ncid);

is returning the HDF5 error message

HDF5-DIAG: Error detected in HDF5 (1.15.0):
  #000: /ccs/home/brtnfld/packages/hdf5/src/H5FD.c line 368 in H5FDunregister(): not a file driver
    major: Invalid arguments to routine
    minor: Inappropriate type

It does not cause an error; it is just a distraction since all the ranks print the message. It seems to be triggered by H5FD_http_finalize

@WardF WardF self-assigned this Aug 23, 2024
@WardF
Copy link
Member

WardF commented Aug 23, 2024

Thanks for highlighting this, I haven't tested against HDF5 1.15.0 yet. Do you know offhand if this issue also happens when using HDF5 1.14.x or earlier?

@WardF WardF added this to the 4.9.3 milestone Aug 23, 2024
@edhartnett
Copy link
Contributor

edhartnett commented Aug 23, 2024 via email

@WardF
Copy link
Member

WardF commented Aug 23, 2024 via email

@brtnfld
Copy link
Contributor Author

brtnfld commented Aug 23, 2024

I confirmed that it also happens with 1.14.4-3.

@WardF
Copy link
Member

WardF commented Aug 23, 2024

I've walked back to old combinations of hdf5 and netcdf which definitely did not have this issue, but I am observing it now. I will continue to poke around.

@WardF
Copy link
Member

WardF commented Aug 23, 2024

I am also seeing the issue in the pure-h5 test h5_test/run_par_tests.

@edwardhartnett
Copy link
Contributor

Yes, this is due to a change in HDF5. I will try to track this down this week. I am also having trouble getting by parallel zstd test working, which would be scary, but I believe parallel filters are already being tested elsewhere, so they must work (but is zstd being tested?)

So it seems like time for this cowboy to saddle up my pony and ride into NetCDF country...
image

@WardF
Copy link
Member

WardF commented Aug 26, 2024

I'm not 100% certain it's just a change in HDF5. Testing against HDF5 1.10.10 and netCDF-C 4.9.1, I observe the following:

On Ubuntu 24.04, using gcc version 13.2.0 (as reported by mpicc --version), I see errors reported.

On Ubuntu 22.04, using gcc version 11.4.0, the tests run successfully.

This doesn't rule out 100% a change in HDF5, either, but I wonder about changes in the newer gcc/mpicc versions.

The errors reported using gcc 13.2.0 are as follows:

195/239 Test #193: h5_test_run_par_tests .................***Failed    0.37 sec

Testing parallel I/O with HDF5...
*** Creating file for parallel I/O read, and rereading it...
p=1, write_rate=4362.35, read_rate=1340.99
ok.
*** Tests successful!
*** Creating file for parallel I/O read, and rereading it...
p=1, write_rate=3476.18HDF5-DIAG: Error detected in HDF5 (1.10.10) MPI-process 0:
  #000: H5F.c line 412 in H5Fopen(): unable to open file
    major: File accessibility
    minor: Unable to open file
  #001: H5Fint.c line 1826 in H5F_open(): unable to read superblock
    major: File accessibility
    minor: Read failed
  #002: H5Fsuper.c line 413 in H5F__super_read(): file signature not found
    major: File accessibility
    minor: Not an HDF5 file
Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/h5_test/tst_h_par.c, line: 170
*** Creating file for parallel I/O read, and rereading it...
p=1, write_rate=3151.62, read_rate=1200.91
ok.
*** Tests successful!

@edwardhartnett
Copy link
Contributor

What I have been seeing is these error messages in parallel I/O tests, but they do not error out, they just print.

@WardF
Copy link
Member

WardF commented Aug 26, 2024

Interesting/frustrating. Something is going on, that much is certain. Let me double-check against the latest HDF5 again.

@WardF
Copy link
Member

WardF commented Aug 26, 2024

For completeness sake, the issue I'm seeing (with old versions like 1.10.10, and all of a sudden with later versions in 1.14.x) are as follows:

*** Testing parallel IO for raw-data with MPI-IO (driver)...Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 284
Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 121
Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 305
Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 91
Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 284
Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 91

*** Testing more advanced parallel access.
*** Testing parallel IO for raw-data with MPI-IO (driver)...Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 284
Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 91
Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 305
Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 92

*** Testing more advanced parallel access.
*** Testing parallel IO for raw-data with MPI-IO (driver)...
*** Testing more advanced parallel access.
*** Testing parallel IO for raw-data with MPI-IO (driver)...Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 284
Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 92
Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 337
Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 92

*** Testing more advanced parallel access.
*** Testing parallel IO for raw-data with MPI-IO (driver)...Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 284
Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 92
Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 284
Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 92
Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 284
Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 91

*** Testing more advanced parallel access.
*** Testing parallel IO for raw-data with MPI-IO (driver)...Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 284
Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 91

*** Testing more advanced parallel access.
*** Testing parallel IO for raw-data with MPI-IO (driver)...Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 284
Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 91
Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 284
Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 91
ok.
*** Testing parallel IO for meta-data with MPI-IO (driver)...ok.
*** Testing parallel IO for meta-data with MPI-IO (driver)...ok.
*** Testing parallel IO for different hyperslab selections with MPI-IO (driver)...ok.
*** Testing parallel IO for extending variables with MPI-IO (driver)...ok.
*** Testing parallel IO for different hyperslab selections with MPI-IO (driver)...ok.
*** Testing parallel IO for extending variables with MPI-IO (driver)...ok.
*** Testing parallel IO for raw-data with MPIPOSIX-IO (driver)...ok.
*** Testing parallel IO for raw-data with MPIPOSIX-IO (driver)...ok.
*** Testing parallel IO for meta-data with MPIPOSIX-IO (driver)...ok.
*** Testing parallel IO for meta-data with MPIPOSIX-IO (driver)...ok.
*** Testing parallel IO for different hyperslab selections with MPIPOSIX-IO (driver)...ok.
*** Testing parallel IO for extending variables with MPIPOSIX-IO (driver)...ok.
*** Testing parallel IO for different hyperslab selections with MPIPOSIX-IO (driver)...ok.
*** Testing parallel IO for extending variables with MPIPOSIX-IO (driver)...ok.
ok.
*** Tests successful!
*** Tests successful!
HDF5-DIAG: Error detected in HDF5 (1.10.10) thread 0:
  #000: H5FD.c line 324 in H5FDunregister(): not a file driver
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.10.10) thread 0:
  #000: H5FD.c line 324 in H5FDunregister(): not a file driver
    major: Invalid arguments to routine
    minor: Inappropriate type

@jhendersonHDF
Copy link

@WardF @edhartnett Hi all, I believe what may be going on here is an ordering issue where HDF5 has already closed the ID used for the HTTP VFD before getting to the part where the H5FD_http_finalize function (which calls H5FDunregister) is called. So, the VFD is holding on to what it thinks is a valid ID, but it actually isn't. Normally, the termination callback of a VFD (in this case, H5FD_http_term) is where one would unregister and reset the ID for a VFD, but it looks like the termination callback for the HTTP VFD does nothing and the ID isn't unregistered until later on when the HDF5 dispatch code calls H5FD_http_finalize directly. The easiest fix would probably be to add a call to H5FD_http_finalize to check that the ID is valid before calling H5FDunregister, as in:

    if (H5FD_HTTP_g && (H5Iis_valid(H5FD_HTTP_g) > 0))
         H5FDunregister(H5FD_HTTP_g);

https://github.com/Unidata/netcdf-c/blob/main/libhdf5/H5FDhttp.c#L259-L261

Alternatively, the code to unregister the ID could be moved to H5FD_http_term instead of H5FD_http_finalize, but I don't know if that would break something from the netCDF-C perspective.

@DennisHeimbigner
Copy link
Collaborator

That sounds very plausible. When is H5FD_http_term called?
In any case, I do not see any immediate problem with using
your alternate suggestion at the end of your comment.

@jhendersonHDF
Copy link

When is H5FD_http_term called?

That function should be called when HDF5 is terminating and closing IDs and gets around to releasing the ID that was registered for the VFD by the VFD itself. In that sense, calling H5FDunregister isn't strictly necessary as HDF5 will release the ID anyway and may even cause problems due to trying to unregister the ID that's currently in the process of being unregistered. Most HDF5 VFDs just reset their internal value for the ID in their termination callback:

https://github.com/HDFGroup/hdf5/blob/develop/src/H5FDmulti.c#L254-L261
https://github.com/hpc-io/vfd-gds/blob/master/src/H5FDgds.c#L463-L464

@DennisHeimbigner
Copy link
Collaborator

Well there should be no reason we can't follow what other VFDs do.

@jhendersonHDF
Copy link

I believe it should be fine to just reset the H5FD_HTTP_g value and skip calling H5FDunregister, though I'd definitely like to investigate a bit to see if something changed in HDF5 to start causing this.

@DennisHeimbigner
Copy link
Collaborator

I agree. There should be nothing particularly different in the H5FDhttp.c code.

@WardF
Copy link
Member

WardF commented Aug 27, 2024

So I believe I am seeing something different from what was originally reported by OP; I will try to sort it out, but it boils down to issues with mpich (gcc) 13.x+; even rolling back to old versions of hdf5 and netCDF, which worked previously, now give errors (not messages which can be otherwise ignored). Putting this information here in case it is connected in a way that is obvious to folk more familiar with MPI. Otherwise, I'll continue sorting it out on my end.

vagrant@ubuntu2404gdm:~/Desktop/netcdf-c/build/nc_test4$ ./tst_parallel3 

*** Testing more advanced parallel access.
*** Testing parallel IO for raw-data with MPI-IO (driver)...ok.
*** Testing parallel IO for meta-data with MPI-IO (driver)...ok.
*** Testing parallel IO for different hyperslab selections with MPI-IO (driver)...ok.
*** Testing parallel IO for extending variables with MPI-IO (driver)...ok.
*** Testing parallel IO for raw-data with MPIPOSIX-IO (driver)...ok.
*** Testing parallel IO for meta-data with MPIPOSIX-IO (driver)...ok.
*** Testing parallel IO for different hyperslab selections with MPIPOSIX-IO (driver)...ok.
*** Testing parallel IO for extending variables with MPIPOSIX-IO (driver)...ok.
*** Tests successful!
vagrant@ubuntu2404gdm:~/Desktop/netcdf-c/build/nc_test4$ mpiexec -n 4 ./tst_parallel3 

*** Testing more advanced parallel access.
*** Testing parallel IO for raw-data with MPI-IO (driver)...
*** Testing more advanced parallel access.
*** Testing parallel IO for raw-data with MPI-IO (driver)...
*** Testing more advanced parallel access.
*** Testing parallel IO for raw-data with MPI-IO (driver)...
*** Testing more advanced parallel access.
*** Testing parallel IO for raw-data with MPI-IO (driver)...Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 284
Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 92
Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 284
Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 91
Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 284
Sorry! Unexpected result, /home/vagrant/Desktop/netcdf-c/nc_test4/tst_parallel3.c, line: 91
ok.
*** Testing parallel IO for meta-data with MPI-IO (driver)...ok.
*** Testing parallel IO for different hyperslab selections with MPI-IO (driver)...ok.
*** Testing parallel IO for extending variables with MPI-IO (driver)...ok.
*** Testing parallel IO for raw-data with MPIPOSIX-IO (driver)...ok.
*** Testing parallel IO for meta-data with MPIPOSIX-IO (driver)...ok.
*** Testing parallel IO for different hyperslab selections with MPIPOSIX-IO (driver)...ok.
*** Testing parallel IO for extending variables with MPIPOSIX-IO (driver)...ok.
*** Tests successful!
vagrant@ubuntu2404gdm:~/Desktop/netcdf-c/build/nc_test4$ 

@WardF
Copy link
Member

WardF commented Sep 5, 2024

This was closed as part of the PR merges that went in; I recognize it may be premature and will re-open if folk are still observing this. We've incorporated #3012 which was testing for the issue (thanks @edwardhartnett) and #3013 which incorporated a fix (thanks @jhendersonHDF), after which the tests went from failing to passing.

I'm still seeing MPI related issues with the most recent Ubuntu, but I will open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants