Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove coordinator and support forking #1067

Merged
merged 59 commits into from
Apr 9, 2024
Merged

Remove coordinator and support forking #1067

merged 59 commits into from
Apr 9, 2024

Conversation

wks
Copy link
Collaborator

@wks wks commented Jan 12, 2024

This PR removes the coordinator (a.k.a. controller) thread, and allows temporarily terminating and restarting all GC worker threads in order to support forking.

Major changes include:

  • GCController is removed. All synchronization mechanisms involving the controller are removed, too. Important synchronization operations, such as opening buckets and declaring GC finished, are done by the last parked worker. The work packet EndOfGC is removed, and its job is now done by the last parked worker.
  • The WorkerMonitor, which previously synchronizes between GC workers, now also synchronizes between mutators and GC workers. This allows mutators to trigger GC by directly communicating with GC workers.
  • Introduced a new mechanism: "goals". Mutators can now request "goals", and GC workers will work on one goal at a time. Currently, a "goal" can be either GC or StopForFork. Triggering GC is now implemented as requesting the GC goal.
  • Added a pair of new APIs, namely MMTK::prepare_to_fork() and MMTK::after_fork(). VM bindings call them before and after forking to let the MMTK instance do proper preparation for forking.

Fixes: #1053
Fixes: #1054

@wks wks changed the title Support forking Remove coordinator and support forking Jan 31, 2024
The change is unnecessary because GC workers now use the active goal to
determine what to do when the last worker parked.
@wks
Copy link
Collaborator Author

wks commented Apr 3, 2024

I have gone through my changes and updated stale comments. I also added some unit tests. I think I have addressed all comments except (1) some assertions about initialize_collection and allocation (#1067 (comment)) and (2) creating GCWorkerShared later (#1067 (comment)). We can do the refactoring after this PR.

I am doing performance evaluation again on vole.moma to check if any recent changes introduced any performance problem. The result will come out this evening.

Copy link
Collaborator

@k-sareen k-sareen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks! The only thing I'm concerned about is how the additional complexity in the scheduler affects @wenyuzhao's work packet/scheduler redesign. I presume PRs for OpenJDK and other bindings to remove the coordinator thread will follow? Also could you fix the typos I highlighted in a previous review?

@wks
Copy link
Collaborator Author

wks commented Apr 4, 2024

The result. On vole.moma, lusearch from DaCapo Chopin, 40 invocations, 5 iterations each, 3x min heap size w.r.t. OpenJDK G1, using 5 different plans, comparing the master branch and this PR. Reference processing is off. (I thought I accidentally turned it on, but it's not.)

image

Plotty: https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2024-04-03-Wed-070951&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&mmtk_gc&build;build1|40&Histogram%20(with%20CI)^build^mmtk_gc&

The result shows an average 2% reduction in total time across the five different plans, where GenImmix has a 4% reduction and SemiSpace has a 0.5% reduction in total time. The results in #1062 only shows a moderate 0.5% improvement on average. I can't explain this difference now, but I'll verify the result on a different machine.

@wks
Copy link
Collaborator Author

wks commented Apr 4, 2024

Here are the results from two other machines. They use the identical binary and running script as vole.moma, except I slightly modified the running script to use taskset on lynx.moma.

lynx.moma (using taskset -c 0-15 to limit the execution to big cores, only): (plotty)

image

fisher.moma: (plotty)

image

These results are similar. The reduction in total time is noticeable for generational plans, but insignificant for non-generational ones. The total time improvement on lynx and fisher is not as significant as vole.moma.

@wks
Copy link
Collaborator Author

wks commented Apr 4, 2024

The extended-tests-openjdk CI test failed because it hung while triggering the first GC when running antlr from DaCapo 2006 using MarkSweep. However, I can't reproduce this locally.

JikesRVM failed for "Failing instruction starting at f797b0a1 wasn't in RVM address space".

I restarted those tests.

@wks
Copy link
Collaborator Author

wks commented Apr 5, 2024

The OpenJDK binding test failed again because it still got stuck while running MarkSweep. I managed to reproduce it locally. In mmtk::policy::marksweepspace::native_ms::block::Block::simple_sweep, the cell_size variable can sometimes be 0, causing the while loop to go on forever.

        let cell_size = self.load_block_cell_size(); // This sometimes returns 0.
        let mut cell = self.start();
        // ...
        while cell + cell_size <= self.start() + Block::BYTES {
            // ...
            cell += cell_size; // If `cell_size` is 0, this loop will go on forever.
        }

It is likely an unrelated bug, and I'll report it in a separate issue and try to find the cause later.

Update: It is reproducible on the master branch, so the bug already existed before this PR.

@wks wks added this pull request to the merge queue Apr 8, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 8, 2024
@wks wks added this pull request to the merge queue Apr 9, 2024
@wks wks removed this pull request from the merge queue due to a manual request Apr 9, 2024
@wks wks added this pull request to the merge queue Apr 9, 2024
Merged via the queue into mmtk:master with commit 5ab62f9 Apr 9, 2024
23 of 24 checks passed
@wks wks deleted the feature/fork branch April 9, 2024 03:36
mmtkgc-bot added a commit to mmtk/mmtk-v8 that referenced this pull request Apr 9, 2024
Upstream PR: mmtk/mmtk-core#1067

---------

Co-authored-by: mmtkgc-bot <[email protected]>
mmtkgc-bot added a commit to mmtk/mmtk-openjdk that referenced this pull request Apr 9, 2024
Upstream PR: mmtk/mmtk-core#1067

---------

Co-authored-by: mmtkgc-bot <[email protected]>
mmtkgc-bot added a commit to mmtk/mmtk-jikesrvm that referenced this pull request Apr 9, 2024
Upstream PR: mmtk/mmtk-core#1067

---------

Co-authored-by: mmtkgc-bot <[email protected]>
mmtkgc-bot added a commit to mmtk/mmtk-julia that referenced this pull request Apr 9, 2024
Upstream PR: mmtk/mmtk-core#1067

---------

Co-authored-by: mmtkgc-bot <[email protected]>
qinsoon pushed a commit to qinsoon/mmtk-julia that referenced this pull request Apr 24, 2024
Upstream PR: mmtk/mmtk-core#1067

---------

Co-authored-by: mmtkgc-bot <[email protected]>
k-sareen added a commit to k-sareen/art that referenced this pull request Jul 16, 2024
MMTk PR: mmtk/mmtk-core#1067

Change-Id: I316728aeefd0127b5d1fc1d85e464062cc629aa8
k-sareen added a commit to k-sareen/mmtk-art that referenced this pull request Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR-extended-testing Run extended tests for the pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Supporting the fork() system call Remove the coordinator (controller) thread
4 participants