Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build_ascent.sh failures #1379

Open
mlohry opened this issue Sep 5, 2024 · 17 comments
Open

build_ascent.sh failures #1379

mlohry opened this issue Sep 5, 2024 · 17 comments

Comments

@mlohry
Copy link
Contributor

mlohry commented Sep 5, 2024

On latest develop branch e0100bf5, running env enable_mpi=ON install_dir=/path/to/install build_jobs=10 ./scripts/build_ascent/build_ascent.sh ends with the following error in the ascent configure step:

**** Creating Ascent host-config (ascent-config.cmake)
**** Configuring Ascent
loading initial cache file /lustre/orion/ard174/proj-shared/mlohry/ascent-test/ascent/ascent-config.cmake
-- The C compiler identification is Clang 18.1.6
-- The CXX compiler identification is Clang 18.1.6
-- Cray Programming Environment 2.7.32 C
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/cray/pe/craype/2.7.32/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Cray Programming Environment 2.7.32 CXX
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/cray/pe/craype/2.7.32/bin/CC - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Error at cmake/SetupBLT.cmake:43 (include):
  include could not find requested file:

    blt/SetupBLT.cmake
Call Stack (most recent call first):
  CMakeLists.txt:119 (include)


CMake Error at cmake/SetupBLT.cmake:69 (message):
  Cannot use CMake imported targets for MPI.(ENABLE_MPI == ON, but
  MPI::MPI_CXX CMake target is missing.)
Call Stack (most recent call first):
  CMakeLists.txt:119 (include)


-- Configuring incomplete, errors occurred!
See also "/lustre/orion/ard174/proj-shared/mlohry/ascent-test/ascent/build/ascent-checkout/CMakeFiles/CMakeOutput.log".
@cyrush
Copy link
Member

cyrush commented Sep 5, 2024

@mlohry for the cray compiler wrappers, we have to tell CMake that mpi will magically work and it shouldn't look for it.

I am working on a path that uses the cray compiler wrappers as well as the mpich mpi compiler wrappers for Frontier w/ the new modules.

Can you confirm what modules would be ideal for your case?

@mlohry
Copy link
Contributor Author

mlohry commented Sep 5, 2024

In the short term really any CPU build with a working replay_mpi on frontier would be of use to look at some large blueprint saves.

The above error was from the default frontier modules. The modules used for our solver case on frontier are these:

module load craype-x86-trento perftools-base/24.07.0 libfabric/1.20.1 cpe/24.07 craype-network-ofi rocm/6.2.0 xpmem/2.8.4-1.0_7.3__ga37cbd9.shasta gcc-native/13.2 Core/24.00 craype/2.7.32 tmux/3.2a cray-dsmml/0.3.0 hsi/default cray-mpich/8.1.30 lfs-wrapper/0.0.1 cray-libsci/24.07.0 DefApps PrgEnv-gnu/8.5.0 cray-pmi/6.1.15.21 craype-accel-amd-gfx90a

so ideally those (gcc 13) if it prevents issues. Running with that set,

[[email protected] ascent]$ module list

Currently Loaded Modules:
  1) craype-x86-trento        5) craype-network-ofi                     9) Core/24.00         13) DefApps            17) cray-libsci/23.12.5      21) cmake/3.23.2
  2) perftools-base/24.07.0   6) rocm/6.2.0                            10) tmux/3.2a          14) craype/2.7.31.11   18) PrgEnv-gnu/8.5.0
  3) cpe/24.07                7) xpmem/2.8.4-1.0_7.3__ga37cbd9.shasta  11) hsi/default        15) cray-dsmml/0.2.2   19) cray-pmi/6.1.15.21
  4) libfabric/1.20.1         8) gcc-native/13.2                       12) lfs-wrapper/0.0.1  16) cray-mpich/8.1.28  20) craype-accel-amd-gfx90a

Inactive Modules:
  1) darshan-runtime

HDF5 hits a linker error during the build,

/lustre/orion/ard174/proj-shared/mlohry/ascent-test/ascent/build/hdf5-1.14.1-2/bin/H5make_libsettings: error while loading shared libraries: libamdhip64.so.5: cannot open shared object file: No such file or directory
gmake[2]: *** [src/CMakeFiles/gen_hdf5-static.dir/build.make:85: src/gen_SRCS.stamp2] Error 127
gmake[2]: *** Waiting for unfinished jobs....
/lustre/orion/ard174/proj-shared/mlohry/ascent-test/ascent/build/hdf5-1.14.1-2/bin/H5detect: error while loading shared libraries: libamdhip64.so.5: cannot open shared object file: No such file or directory
gmake[2]: *** [src/CMakeFiles/gen_hdf5-static.dir/build.make:80: src/gen_SRCS.stamp1] Error 127
gmake[1]: *** [CMakeFiles/Makefile2:2042: src/CMakeFiles/gen_hdf5-static.dir/all] Error 2
gmake[1]: *** Waiting for unfinished jobs....
/lustre/orion/ard174/proj-shared/mlohry/ascent-test/ascent/build/hdf5-1.14.1-2/bin/H5make_libsettings: error while loading shared libraries: libamdhip64.so.5: cannot open shared object file: No such file or directory
gmake[2]: *** [src/CMakeFiles/gen_hdf5-shared.dir/build.make:97: src/gen_SRCS.stamp2] Error 127
gmake[2]: *** Waiting for unfinished jobs....
/lustre/orion/ard174/proj-shared/mlohry/ascent-test/ascent/build/hdf5-1.14.1-2/bin/H5detect: error while loading shared libraries: libamdhip64.so.5: cannot open shared object file: No such file or directory
gmake[2]: *** [src/CMakeFiles/gen_hdf5-shared.dir/build.make:92: src/gen_SRCS.stamp1] Error 127
gmake[1]: *** [CMakeFiles/Makefile2:2095: src/CMakeFiles/gen_hdf5-shared.dir/all] Error 2

The (recently updated) rocm/6.2.0 module has libamdhip64.so.6 not .5. Loading module rocm/6.1.3 fixes the HDF5 problem, but then later I hit the original error of missing SetupBLT.cmake. I also noticed at least some of the builds eg RAJA seem to be picking up GCC 7.5.0 from /usr/bin/c++, not 13 as expected there. Although the compiler wrapper /opt/cray/pe/craype/2.7.32/bin/cc does point to gcc-13.2.1

Trying to minimize the modules a bit,

module load Core/24.00 # makes cmake available
module load cmake # loads cmake/3.23.2
module load PrgEnv-gnu

[[email protected] ascent]$ module list

Currently Loaded Modules:
  1) Core/24.00     3) gcc-native/12.3    5) cray-dsmml/0.2.2   7) craype-network-ofi   9) cray-libsci/23.12.5
  2) cmake/3.23.2   4) craype/2.7.31.11   6) libfabric/1.20.1   8) cray-mpich/8.1.28   10) PrgEnv-gnu/8.5.0

I still hit the original build error of BLT_SOURCE_DIR not being defined correctly (where do you normally pick this up from? It's not being exported to ascent-config.cmake.)

@cyrush
Copy link
Member

cyrush commented Sep 5, 2024

@mlohry

Thanks for the details - try this branch:

https://github.com/Alpine-DAV/ascent/tree/task/2024_09_frontier

run:

https://github.com/Alpine-DAV/ascent/blob/task/2024_09_frontier/scripts/build_ascent/build_ascent_hip_frontier.sh

It's not the same modules you need for the integrated case, but I was able to run ascent mpi tests (I tried two ranks) successfully.

Here are the modules that need to be loaded to run (from the top of the build frontier script)

module load cmake #3.23.2
module load PrgEnv-cray
module load craype-accel-amd-gfx90a
module load rocm/5.7.1
module load cray-mpich/8.1.28
module load cce/17.0.0
module load cray-python/3.11.5

@cyrush
Copy link
Member

cyrush commented Sep 5, 2024

If you see BLT_SOURCE_DIR missing, that means you missed --recursive on the git clone.

You can fix that with:

git submodule init
git submodule update

@cyrush
Copy link
Member

cyrush commented Sep 5, 2024

Looking again -- I think the MPI issue confused me -- seeing that BLT_SOURCE_DIR is missing would also cause that MPI error. Submodule update should fix that.

@mlohry
Copy link
Contributor Author

mlohry commented Sep 5, 2024

@cyrush thanks that built, but back to the original issue I hit -- when I execute ascent_replay_mpi it fails in the file check step:

terminate called after throwing an instance of 'conduit::Error'
  what():
file: /lustre/orion/ard174/proj-shared/mlohry/ascent-test/ascent/src/utilities/replay/replay.cpp
line: 214
message:
Actions file not found: ascent_actions_relay_no_boundary.yaml

srun: error: frontier06127: task 1: Aborted

That file exists and rank 0 sees it, but the mpi broadcast of the bool seems to leave the other ranks seeing false. Looks like that code is fairly recent:
e504a28

Are you able to successfully run ascent_replay_mpi?

@cyrush
Copy link
Member

cyrush commented Sep 5, 2024

Ok - sounds like a new bug & system MPI is ok. We did have a change, looking into it.

@cyrush
Copy link
Member

cyrush commented Sep 5, 2024

@mlohry On the /task/2024_09_frontier branch -- I changed the actions checking logic in relay to match another implementation we have. Can you see if this resolves your issue?

@mlohry
Copy link
Contributor Author

mlohry commented Sep 5, 2024

Sitting in queues, will let you know.

What is MPI_BOOL in that code?

@cyrush
Copy link
Member

cyrush commented Sep 5, 2024

The other code uses MPI_INT instead of MPI_BOOL.

@mlohry
Copy link
Contributor Author

mlohry commented Sep 5, 2024

I was wondering where MPI_BOOL was actually being defined, since that's not an MPI datatype and I'm not seeing it grepping the dependencies.

The latest branch looks like it might have worked, but the post I expected to take 10 minutes on 471 nodes timed out after 60 minutes and didn't produce any images so can't tell if it was hanging or not. I'll try it again on a smaller dataset.

@cyrush
Copy link
Member

cyrush commented Sep 5, 2024

That is a great question. The ifdef was wrong, so MPI_BOOL was not getting compiled. That makes a bit more sense.
I pushed another fix to the frontier branch.

@mlohry
Copy link
Contributor Author

mlohry commented Sep 5, 2024

relay::mpi::broadcast_using_schema(actions, 0, mpi_comm);

missing conduit::

@@ -232,7 +232,7 @@ void load_actions(const std::string &file_name, int mpi_comm_id, conduit::Node &
                      << "\n" << emsg);
     }
 #ifdef ASCENT_REPLAY_MPI
-    relay::mpi::broadcast_using_schema(actions, 0, mpi_comm);
+    conduit::relay::mpi::broadcast_using_schema(actions, 0, mpi_comm);
 #endif
 }

@cyrush
Copy link
Member

cyrush commented Sep 5, 2024

pushed fix -- checked compile and it worked.

@mlohry
Copy link
Contributor Author

mlohry commented Sep 6, 2024

aside -- trying to build a past working version commit a5f51b,

git clone --recursive https://github.com/Alpine-DAV/ascent.git
git checkout a5f51b
cd ascent
env enable_mpi=ON ./scripts/build_ascent/build_ascent.sh

this check

if [ -d ${ascent_checkout_dir} ]; then
fails, and it does a new clone of ascent-develop and ends up building that, not the checked out branch.

@cyrush
Copy link
Member

cyrush commented Sep 6, 2024

@mlohry Looked into this: the logic to use existing checkout was added after a5f51b (#1324) -- it was part of: #1339
So I think that explains that specific issue.

@cyrush
Copy link
Member

cyrush commented Sep 6, 2024

Overall develop (or the frontier branch) should be the best -- but sorry for the bumps in the road with the recent replay bugs. We are planning to add extensive replay testing. Since the ifdef typos have happened twice now we will need to think of a good way to protect from those errors, which the compiler doesn't help us with :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants