Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GC heuristic #3997

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

JoJoDeveloping
Copy link
Contributor

@JoJoDeveloping JoJoDeveloping commented Oct 27, 2024

Miri's GC runs every 10000 basic block, by default. It turns out that sometimes, it is better if the GC runs more often. This is especially true for some Tree Borrows programs, since in Tree Borrow, memory accesses take time linear in the size of a certain tree, and this tree can grow quite large, but is usually compacted significantly at every GC run. For some programs, it can be advantageous to run the GC every 200 blocks, or even more often.

Unfortunately, such a GC frequency would slow down other programs. To achieve a balance, this PR proposes a "heuristics" that allows running the GC more often based on how large the Tree Borrows tree grows. It is quite likely that other parts of Miri want to plug into this mechanisms as well, eventually. However, such concrete other cases are further work, as long as the system is designed to be general enough. This PR only contributes integration with Tree Borrows. It is also possible that the actual heuristics need more tweaks, but for now, the performance improvements look promising.

It is further likely that the overall design needs some change, since it currently relies on Cells. Also, I was not too sure where to put the new enums. So this PR is WIP for now, although it is functionally complete.

@saethlin
Copy link
Member

Hm, I'm wary of this heuristic because I tried something like it to solve the inverse problem and didn't get good results. Do you have any demonstrations of workloads where this implementation makes a huge difference?

@RalfJung
Copy link
Member

My main immediate concern is that this can lead to significant slowdown if there are large trees that do not get compacted. So I feel like if we have such a heuristic, it should also take into account "did the big tree actually shrink on the last GC run" or something like that, and increase the GC period if the answer is "no".

@RalfJung
Copy link
Member

What does this PR do on MIRIFLAGS=-Zmiri-tree-borrows ./miri bench? Please show numbers from before and after this change.

(Unfortunately we don't have a way to automatically compare two bench reports and print "things got X% slower/faster"... see #3999 )

@JoJoDeveloping
Copy link
Contributor Author

JoJoDeveloping commented Oct 28, 2024

My main immediate concern is that this can lead to significant slowdown if there are large trees that do not get compacted. So I feel like if we have such a heuristic, it should also take into account "did the big tree actually shrink on the last GC run" or something like that, and increase the GC period if the answer is "no".

The way this works is by initially having some exponential back off with how often the GC runs (whenever the size doubles since the last GC run). Eventually that goes to linear (whenever a tree grew by X nodes since the last GC run). The reason is that I recon that most programs do not have many different provenances around (by using a large array storing pointers only differing in their provenance, for example). For the tree to not get compacted by the GC requires that programs execute lots of retags and then keep all these pointers alive. On such workloads, the GC already does nothing. Even even such programs would have to create and store a new reference every 7 basic blocks (with the current settings) for it to be slowed down by this patch. But note that for programs with a lot of memory use, the GC already needs to be tuned down (probably) for things to work.

In the end it is all trade-offs. This one seems to be worth it. I am not sure the current performance testsuite consists of programs where this shows (one way or another) but I can add some where it does.

@JoJoDeveloping
Copy link
Contributor Author

JoJoDeveloping commented Oct 28, 2024

After some musing with @saethlin, we figured the better answer than having a GC might be to use (some form of) ref-counting, at least in theory. Unfortunately that would require a large refactoring (and past attempts of doing it failed?
), and so for now we seem to be stuck with the GC, for better or worse.

@RalfJung
Copy link
Member

RalfJung commented Oct 28, 2024

The way this works is by initially having some exponential back off with how often the GC runs (whenever the size doubles since the last GC run).

Right, that makes sense, it just wasn't clear from the PR description. :)

Eventually that goes to linear (whenever a tree grew by X nodes since the last GC run).

That's a bit more surprising. What is the rationale for that? Did you compare the PR with and without this part?

Many programs not having a ton of provenances would already be captured by the first part, no? (Also I am not sure that's true, each live reference has its own provenance and there can easily be a ton of those if someone has a large array of things that involve references.)

I am not sure the current performance testsuite consists of programs where this shows (one way or another) but I can add some where it does.

Yes that would be good -- preferably one benchmark demonstrating a clear benefit. (We don't want to grow our number of benchmarks too much, either...)

ProvenanceGcSettings::Regularly { interval } => this.machine.since_gc >= interval,
ProvenanceGcSettings::Heuristically =>
// no GC run if the last one was recently
this.machine.since_gc >= 75
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

75 is an oddly specific number... did you try various values? On which benchmark?

Copy link
Contributor Author

@JoJoDeveloping JoJoDeveloping Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is mostly based on vibes/previous experience with running TB on a bunch of crates. I found that usually, you find some local minimum (of the GC frequency -> performance function) around 100-500, so 75 seemed like a reasonable lower bound.

Unfortunately I don't think I can pack these all up into a nice test case. But in general, you want to put the cutoff somewhere. If you feel like a different number should be chosen, suggest a better one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in general, you want to put the cutoff somewhere.

This is somewhat surprising, given that the heuristic only kicks in when a tree doubled in size. Did you try different values with the heuristic already in place? Measurements done before you added the heuristic wouldn't really be valid any more.

Also note that this code here is not TB-specific.

This looks more like "guarding against an overeager heuristic" than anything else? Given it is entirely based on vibes, I'd go for 100 and add a comment for why the number was picked. But I am not sure we want a lower bound here at all; each heuristic is really responsible for ensuring a lower bound itself since it has more data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks more like "guarding against an overeager heuristic" than anything else?

Indeed, that is also one of the points. It ensures that performance does not tank when a heuristic goes wrong. With the current setup, all the heuristics can do is request the GC happens earlier, so some measures should be taken that it does not happen too early.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A buggy heuristic making the GC fire all the time is bad even if we "cap" it at some number here this way. So this would paper over the issue and make it harder to notice, not fix it.

let last = self.nodes_at_last_gc.max(50);
// trigger the GC if the tree doubled since the last run,
// or otherwise got "significantly" larger.
// Note that for trees < 100 nodes, nothing happens.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Note that for trees < 100 nodes, nothing happens.
// Note that for trees <= 100 nodes, nothing happens.

@JoJoDeveloping
Copy link
Contributor Author

We don't want to grow our number of benchmarks too much, either...

Do we not? Changes like the one in this PR would be much easier to find/develop/test if the suite of benchmarks was more diverse and larger in general. It its current form, it is useful only to ensure there are no really major performance regressions. But that's a separate issue.

@RalfJung
Copy link
Member

More benchmarks means they take forever to run, and at least until we figure out #3999 also forever to analyze.

@bors
Copy link
Contributor

bors commented Nov 2, 2024

☔ The latest upstream changes (presumably #4009) made this pull request unmergeable. Please resolve the merge conflicts.

@RalfJung RalfJung self-assigned this Nov 4, 2024
@RalfJung RalfJung added the S-waiting-on-author Status: Waiting for the PR author to address review comments label Nov 25, 2024
@JoJoDeveloping
Copy link
Contributor Author

I'm a bit unsure what is the best way to go forward here. I found that there are some more cases where changing the numbers here so that the GC runs less often.

The more I think about it, the more I think you want some "more clever" tuning thing here that looks at Miri's current execution speed, compared to the one after the most recent GC run, and tries to gauge when it's time for the next one. Perhaps @saethlin also has something to weigh in on that.

@RalfJung
Copy link
Member

I found that there are some more cases where changing the numbers here so that the GC runs less often.

Sorry I can't parse this, what do you mean?

@JoJoDeveloping
Copy link
Contributor Author

Huh, I seem to have swallowed a sentence there. What I mean is that I found some programs where this change is bad because it makes the GC run too often. Instead, these programs achieve optimal performance when the "magic numbers" in this PR are moved up a bit, so that the trees need to grow more before the GC is triggered. So finding what is a good number there is nontrivial/impossible, which makes me more unhappy with the current approach.

@RalfJung
Copy link
Member

Well that's the usual issue with heuristics...

@saethlin
Copy link
Member

I'm not a GC expert, so if you haven't already, it might be good to at least draw inspiration from what techniques the really good mark-and-sweep GCs use to get around this problem. I bet the severity of the problem is significantly reduced by having a generational GC (which we don't have) so it's also possible other GCs just don't suffer this as much.

I think the only way out of this is to make the GC tune the frequency based on some kind of metric for how well its time is being spent. Looking at things like how much of the tags were garbage on the last run, or what fraction of time is being spent in the GC itself. The lack of self-tuning is the crux of #3209.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: Waiting for the PR author to address review comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants