Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add Apple Silicon support to Mesh #104

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

camelid
Copy link
Contributor

@camelid camelid commented Nov 19, 2023

Apple Silicon uses 16K pages, while most OSes use 4K. Mesh makes several assumptions of a 4K page size for performance reasons, so it is nontrivial to add support for 16K pages.

@camelid
Copy link
Contributor Author

camelid commented Nov 19, 2023

This is a work-in-progress since the tests are still segfaulting. They segfault while gmock is printing a message to stdout; specifically while stdout is about to be flushed.

@bpowers
Copy link
Member

bpowers commented Apr 6, 2024

@camelid sorry for letting this linger for so long - for me tests are failing too, but they're hitting assertions. I've done a few things, so it make sense to see if our setups are the same.

First, I'm running on macOS Sonoma 14.3.1 (the latest), and I've just run brew update; brew upgrade. I've checked Heap-Layers out to your arm64 branch, but I had to apply the following 1-line change:

diff --git a/wrappers/mmapwrapper.h b/wrappers/mmapwrapper.h
index 0226c9f..2fc24ee 100644
--- a/wrappers/mmapwrapper.h
+++ b/wrappers/mmapwrapper.h
@@ -185,7 +185,7 @@ namespace HL {
       #ifdef HL_APPLE_SILICON
         #include <mach/vm_page_size.h>
         assert(sz % vm_page_size == 0);
-        assert((ptr % vm_page_size) == 0);
+        assert((reinterpret_cast<uintptr_t>(ptr) % vm_page_size) == 0);
       #endif

       if (ptr == MAP_FAILED) {

to work past a compilation failure.

Next, I had to update bazel to work with Python 3.12 locally. I fixed that in: #107 -- since you also have changes to bazel, I think it would make sense to rebase this branch on master.

Ok, so at this point I could run:

$ make clean; make

which fails with the following test failure:

==================== Test output for //src:unit-tests:
Running main() from gmock_main.cc
[==========] Running 23 tests from 7 test suites.
[----------] Global test environment set-up.
[----------] 2 tests from Alignment
[ RUN      ] Alignment.NaturalAlignment
src/mini_heap.h:65:mesh::Flags::Flags(uint32_t, uint32_t, uint32_t, uint32_t): ASSERTION 'this->maxCount() == maxCount' FAILED:
================================================================================

This is due to us increasing the number of bits needed for maxCount, but not changing the bitmask on the getter, fixed with:

diff --git a/src/mini_heap.h b/src/mini_heap.h
index 14e38c8..9512aea 100644
--- a/src/mini_heap.h
+++ b/src/mini_heap.h
@@ -62,7 +62,7 @@ public:
     d_assert(svOffset < (kPageSize / kMinObjectSize - 1));
     d_assert_msg(sizeClass < 255, "sizeClass: %u", sizeClass);
     d_assert(maxCount <= (kPageSize / kMinObjectSize));
-    d_assert(this->maxCount() == maxCount);
+    d_assert_msg(this->maxCount() == maxCount, "maxCount() (%u) != maxCount (%u)", this->maxCount(), maxCount);
   }

   inline uint32_t freelistId() const {
@@ -79,7 +79,7 @@ public:

   inline uint32_t maxCount() const {
     // XXX: does this assume little endian?
-    return (_flags.load(std::memory_order_seq_cst) >> MaxCountShift) & 0x1ff;
+    return (_flags.load(std::memory_order_seq_cst) >> MaxCountShift) & 0x7ff;
   }

   inline uint32_t sizeClass() const {

With this fix applied, I then get:

==================== Test output for //src:unit-tests:
Running main() from gmock_main.cc
[==========] Running 23 tests from 7 test suites.
[----------] Global test environment set-up.
[----------] 2 tests from Alignment
[ RUN      ] Alignment.NaturalAlignment
src/shuffle_vector.h:234:void mesh::ShuffleVector::reinit(): ASSERTION 'mh->isAttached()' FAILED:
================================================================================

Which is fixed with: #108 (which I merged to master). Boy this was a hard/frustrating one to find! I made a lot of use of mh->dumpDebug(), and eventually came across it by noticing that if I call dumpDebug() here:
https://github.com/plasma-umass/Mesh/blob/master/src/global_heap.h#L318

everything looks good, but immediately after that while loop if I add a miniheaps[0]->dumpDebug() the MiniHeap's data has changed, and the only thing that could reasonably impact that is allocating a new MiniHeap via allocMiniheapLocked (when finally looking at meshable_areana.h the bare 64 we were using for the MiniHeap's size really stood out). You'll also have to rebase onto master to get this (and then update/change the definition of kMiniHeapSize to 160).

At this point I see 2 failures, depending on whether I run the unit test binary under LLDB or not.

First, not under LLDB I see:

Running main() from gmock_main.cc
[==========] Running 23 tests from 7 test suites.
[----------] Global test environment set-up.
[----------] 2 tests from Alignment
[ RUN      ] Alignment.NaturalAlignment
[       OK ] Alignment.NaturalAlignment (1728 ms)
[ RUN      ] Alignment.NonOverlapping
[       OK ] Alignment.NonOverlapping (0 ms)
[----------] 2 tests from Alignment (1728 ms total)

[----------] 11 tests from BitmapTest
[ RUN      ] BitmapTest.RepresentationSize
[       OK ] BitmapTest.RepresentationSize (0 ms)
[ RUN      ] BitmapTest.LowestSetBitAt
[       OK ] BitmapTest.LowestSetBitAt (0 ms)
[ RUN      ] BitmapTest.HighestSetBitAt
[       OK ] BitmapTest.HighestSetBitAt (0 ms)
[ RUN      ] BitmapTest.SetAndExchangeAll
[       OK ] BitmapTest.SetAndExchangeAll (0 ms)
[ RUN      ] BitmapTest.SetAll
[       OK ] BitmapTest.SetAll (0 ms)
[ RUN      ] BitmapTest.SetGet
[       OK ] BitmapTest.SetGet (192 ms)
[ RUN      ] BitmapTest.SetGetRelaxed
unit-tests(7311,0x1e0fc5c40) malloc: Incorrect checksum for freed object 0x13880a800: probably modified after being freed.
Corrupt value: 0x280ca6000
unit-tests(7311,0x1e0fc5c40) malloc: *** set a breakpoint in malloc_error_break to debug

Which is baffling, as all that does is call calloc and free in a loop, and eyeballing it it sure doesn't seem like we could be doing anything out-of-bounds there. Running under LLDB doesn't hit that problem - I sort of assume because Apple turns off (?) some malloc checks in debugging mode, but IDK. I think its possibly a real thing to fix, and I'd be tempted to try to add -fsanitize=address to our unit test compilation/link flags to see if it catches anything.

Finally, there is also:

[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from SizeClass
[ RUN      ] SizeClass.MinObjectSize
src/testing/unit/size_class_test.cc:22: Failure
Expected equality of these values:
  alignof(max_align_t)
    Which is: 8
  kMinObjectSize
    Which is: 16
[  FAILED  ] SizeClass.MinObjectSize (0 ms)
[----------] 1 test from SizeClass (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (0 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] SizeClass.MinObjectSize

 1 FAILED TEST

I think this test case just needs to be updated - I don't think those strictly have to be equal (but I could have forgotten some context).

Hope this helps! The meat and potatoes tests are currently skipped because you added BAZEL_CONFIG = --config=disable-meshing to the makefile:

[----------] 2 tests from ConcurrentMeshTest
[ RUN      ] ConcurrentMeshTest.TryMesh
src/testing/unit/concurrent_mesh_test.cc:67: Skipped

If I remove that flag, and change:

+++ b/src/testing/unit/concurrent_mesh_test.cc
@@ -20,7 +20,7 @@ using namespace std;
 using namespace mesh;

 static constexpr uint32_t StrLen = 128;
-static constexpr uint32_t ObjCount = 32;
+static constexpr uint32_t ObjCount = 128;

and also slightly change the segfault handler:

diff --git a/src/runtime.cc b/src/runtime.cc
index 528604c..adfdc61 100644
--- a/src/runtime.cc
+++ b/src/runtime.cc
@@ -358,7 +358,7 @@ void Runtime::segfaultHandler(int sig, siginfo_t *siginfo, void *context) {

   // okToProceed is a barrier that ensures any in-progress meshing has
   // completed, and the reason for the fault was 'just' a meshing
-  if (siginfo->si_code == SEGV_ACCERR && runtime().heap().okToProceed(siginfo->si_addr)) {
+  if ((siginfo->si_code == SEGV_ACCERR || siginfo->si_code == SEGV_MAPERR)  && runtime().heap().okToProceed(siginfo->si_addr)) {
     // debug("TODO: trapped access violation from meshing, log stat\n");
     return;
   }

The concurrent mesh tests pass!

[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from ConcurrentMeshTest
[ RUN      ] ConcurrentMeshTest.TryMesh
[       OK ] ConcurrentMeshTest.TryMesh (3 ms)
[ RUN      ] ConcurrentMeshTest.TryMeshInverse
[       OK ] ConcurrentMeshTest.TryMeshInverse (3 ms)
[----------] 2 tests from ConcurrentMeshTest (6 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (6 ms total)
[  PASSED  ] 2 tests.

So I think we're close!

@camelid
Copy link
Contributor Author

camelid commented May 2, 2024

Thanks for the detailed reply! And apologies for my delayed response. I've applied the fixes you mentioned (including rebasing on master), but my tests are still failing, and indeed with no test output at all:

make test
INFO: Analyzed target //src:unit-tests (0 packages loaded, 0 targets configured).
INFO: Found 1 test target...
FAIL: //src:unit-tests (see /private/var/tmp/_bazel_noahlev/059a892b8842c15c2c7b084bc67a8744/execroot/org_libmesh/bazel-out/darwin_arm64-fastbuild/testlogs/src/unit-tests/test.log)
INFO: From Testing //src:unit-tests:
==================== Test output for //src:unit-tests:
================================================================================
Target //src:unit-tests up-to-date:
  bazel-bin/src/unit-tests
INFO: Elapsed time: 0.149s, Critical Path: 0.05s
INFO: 2 processes: 2 local.
INFO: Build completed, 1 test FAILED, 2 total actions
//src:unit-tests                                                         FAILED in 0.0s
  /private/var/tmp/_bazel_noahlev/059a892b8842c15c2c7b084bc67a8744/execroot/org_libmesh/bazel-out/darwin_arm64-fastbuild/testlogs/src/unit-tests/test.log

INFO: Build completed, 1 test FAILED, 2 total actions
make: *** [test] Error 3

Any idea what's causing this lack of output and the discrepancy between our test results?

@camelid
Copy link
Contributor Author

camelid commented May 2, 2024

First, I'm running on macOS Sonoma 14.3.1 (the latest), and I've just run brew update; brew upgrade.

Oh, I forgot to mention that I'm on Sonoma 14.1 (slightly older), and I've also updated all my homebrew packages.

@camelid
Copy link
Contributor Author

camelid commented May 2, 2024

Weirder still, if I cd into build and then make, I get a static assertion error regarding a heap-layers object (only including the last bit of output). Were you using my branch of heap-layers for your testing (emeryberger/Heap-Layers#54)? Either way, it's weird that running make from build has a different compile result.

In file included from /Users/noahlev/code/plasma/mesh/src/global_heap.cc:8:
/Users/noahlev/code/plasma/mesh/src/./global_heap.h:75:3: error: static assertion failed due to requirement 'HL::gcd<4096, 16384>::value == Alignment': expected MmapHeap to have 16-byte alignment
  static_assert(HL::gcd<MmapHeap::Alignment, Alignment>::value == Alignment,
  ^             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/noahlev/code/plasma/mesh/src/./global_heap.h:75:64: note: expression evaluates to '4096 == 16384'
  static_assert(HL::gcd<MmapHeap::Alignment, Alignment>::value == Alignment,
                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~
/Users/noahlev/code/plasma/mesh/src/./global_heap.h:108:68: warning: unused para
meter 'sizeClass' [-Wunused-parameter]
  inline MiniHeap *ATTRIBUTE_ALWAYS_INLINE allocMiniheapLocked(int sizeClass, size_t pageCount, size_t objectCount,
                                                                   ^
/Users/noahlev/code/plasma/mesh/src/global_heap.cc:439:44: warning: unused parameter 'beDetailed' [-Wunused-parameter]
void GlobalHeap::dumpStats(int level, bool beDetailed) const {
                                           ^
10 warnings and 1 error generated.
make[2]: *** [src/CMakeFiles/mesh.dir/global_heap.cc.o] Error 1
make[1]: *** [src/CMakeFiles/mesh.dir/all] Error 2
make: *** [all] Error 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants