Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential fix for Apple silicon build #272

Open
mjs271 opened this issue Jan 19, 2023 · 11 comments
Open

Potential fix for Apple silicon build #272

mjs271 opened this issue Jan 19, 2023 · 11 comments

Comments

@mjs271
Copy link
Contributor

mjs271 commented Jan 19, 2023

For the past few months (~beginning of November), I haven't been able to build EKAT successfully on my mac laptop with an M1 chip that's on macos Monterey. First, the EKAT version I've been using (661dbc5) is what's used in the EAGLES project by haero and, by extension, mam4xx.

I've been attempting to build with the following configuration flags,

-DCMAKE_CXX_COMPILER=mpic++
-DCMAKE_Fortran_COMPILER=mpifort
-DKokkos_ENABLE_DEPRECATED_CODE=OFF
-DKokkos_ENABLE_DEBUG=TRUE
-DKokkos_ENABLE_AGGRESSIVE_VECTORIZATION=OFF
-DKokkos_ENABLE_CUDA=OFF
-DKokkos_ENABLE_SERIAL=ON
-DEKAT_ENABLE_FPE=OFF

in which mpic++ is built with Apple clang v14 and mpifort with gfortran 12.2.
When I make, the errors I've been seeing are of the type:

EKAT/src/ekat/util/ekat_feutils.hpp:xx:yy: error: no member named '__control' in 'fenv_t' [...]
EKAT/src/ekat/util/ekat_feutils.hpp:xx:yy: error: no member named '__mxcsr' in 'fenv_t' [...]

The apparent fix turns out to be a matter of adding an #ifdef statement around an #include in ekat_arch.cpp, namely

#ifdef EKAT_ENABLE_FPE
  #include "ekat/util/ekat_feutils.hpp"
#endif

After a successful build, I get the following from make test

95% tests passed, 4 tests failed out of 75

Label Time Summary:
MustFail    =   0.64 sec*proc (3 tests)

Total Test time (real) =  22.83 sec

The following tests FAILED:
	 53 - comm_np1 (Failed)
	 54 - comm_np2 (Failed)
	 55 - comm_np3 (Failed)
	 56 - comm_np4 (Failed)

And the failure log output for these tests indicates that this is expected on mac, noting

A request was made to bind a process, but at least one node does NOT
support binding processes to cpus.

Node: <node>

Open MPI uses the "hwloc" library to perform process and memory
binding. This error message means that hwloc has indicated that
processor binding support is not available on this machine.

On OS X, processor and memory binding is not available at all (i.e.,
the OS does not expose this functionality).

Given all of this, I am not sure if this is a tenable fix or whether there may be knock-on effects. I did want to put it on the EKAT team's radar, though.

@jeff-cohere

@welcome
Copy link

welcome bot commented Jan 19, 2023

Thanks for opening your first issue here! Be sure to follow the issue template!

@bartgol
Copy link
Contributor

bartgol commented Jan 19, 2023

The binding to core is usually a good thing, which is why we do it by default. However, you are free to change the MPI extra args used when launching executable during tests. The CMake var EKAT_TEST_MPI_EXTRA_ARGS can be set to an empty string in your config file, so that CMake won't use the default binding options.

@bartgol
Copy link
Contributor

bartgol commented Jan 19, 2023

It looks like CMake has a tool to check if a class/struct has a member. This would allow us to check that fenv_t exists, and that it contains the expected members.

Perhaps where we do

check_cxx_symbol_exists(feenableexcept "fenv.h" EKAT_HAVE_FEENABLEEXCEPT)

we should also do

check_struct_has_member(fenv_t __member "fenv.h" EKAT_FENV_HAS_MEMBER CXX)

and enable FENV stuff only if EKAT_FENV_HAS_MEMBER=TRUE.

@mahf708
Copy link
Contributor

mahf708 commented Jul 25, 2024

Thanks for opening your first issue here! Be sure to follow the issue template!

I love this hearty welcome.

any update on this issue?

@bartgol, this is the first, not necessarily the last/only, problem that appears when trying to build pyscream on macos M? machines: https://github.com/mahf708/experimental-scream-feedstock

@mjs271
Copy link
Contributor Author

mjs271 commented Jul 25, 2024

@mahf708 The short answer is, "no updates/progress from me."

Marginally expanding on that... I was successfully building EKAT on my M1 machine until earlier this year using the above hack. However, once starting work on SCREAM, I ran out of time to squash the other build issues I was running into and moved my work to a linux box exclusively.

Once I'm through this push on mam4xx, I'd be up for comparing notes and working on getting SCREAM/pyscream building on apple silicon. I'd personally love being able to run simple test cases locally for the sake of faster/offline development.

@mahf708
Copy link
Contributor

mahf708 commented Jul 25, 2024

Yeah, sounds good!

I think the best way to iterate is through non-local setups, e.g., github actions. That's how I've been building and publishing the pyscream packages for linux (two mpi impls, four python versions, for a total of 8 builds). See an example run here: https://github.com/mahf708/experimental-scream-feedstock/actions/runs/10024430474, which automatically uploads the python packages here https://anaconda.org/mahf708/pyscream. We are undertaking this python effort to make it really simple to do some basic SCREAM science testing (on the fly without compilation) in python. I left an item on the meeting notes for today's call with an example and more information on linux :)

@bartgol
Copy link
Contributor

bartgol commented Jul 25, 2024

@mahf708 Have you tried implementing the suggestion in my last comment? I don't have a macos machine, so I can't test it quickly (and not quick => not doing it, in this case, sorry).

@mahf708
Copy link
Contributor

mahf708 commented Jul 25, 2024

@mahf708 Have you tried implementing the suggestion in my last comment? I don't have a macos machine, so I can't test it quickly (and not quick => not doing it, in this case, sorry).

That's precisely why I was suggesting the non-local setup :) I will set up a workflow for this on github actions on the repo linked above (with something running on macos machines) and start iterating with your fix above.

@bartgol
Copy link
Contributor

bartgol commented Jul 25, 2024

Is it really a priority though?

@mahf708
Copy link
Contributor

mahf708 commented Jul 25, 2024

No, linux is enough for the foreseeable future

@mahf708
Copy link
Contributor

mahf708 commented Aug 18, 2024

Update:
I could get this built with either of the edits above, but only for static linking. For shared builds, I get the following error:

[ 52%] Linking CXX shared library libekat_test_main.dylib
2 warnings generated.
[ 53%] Linking CXX executable tridiag
Undefined symbols for architecture arm64:
  "ekat_finalize_test_session()", referenced from:
      _main in ekat_catch_main.cpp.o
  "ekat_initialize_test_session(int, char**, bool)", referenced from:
      _main in ekat_catch_main.cpp.o
ld: symbol(s) not found for architecture arm64
clang-16: error: linker command failed with exit code 1 (use -v to see invocation)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants