-
Notifications
You must be signed in to change notification settings - Fork 352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GC heuristic #3997
base: master
Are you sure you want to change the base?
Add GC heuristic #3997
Conversation
Hm, I'm wary of this heuristic because I tried something like it to solve the inverse problem and didn't get good results. Do you have any demonstrations of workloads where this implementation makes a huge difference? |
My main immediate concern is that this can lead to significant slowdown if there are large trees that do not get compacted. So I feel like if we have such a heuristic, it should also take into account "did the big tree actually shrink on the last GC run" or something like that, and increase the GC period if the answer is "no". |
What does this PR do on (Unfortunately we don't have a way to automatically compare two bench reports and print "things got X% slower/faster"... see #3999 ) |
The way this works is by initially having some exponential back off with how often the GC runs (whenever the size doubles since the last GC run). Eventually that goes to linear (whenever a tree grew by X nodes since the last GC run). The reason is that I recon that most programs do not have many different provenances around (by using a large array storing pointers only differing in their provenance, for example). For the tree to not get compacted by the GC requires that programs execute lots of retags and then keep all these pointers alive. On such workloads, the GC already does nothing. Even even such programs would have to create and store a new reference every 7 basic blocks (with the current settings) for it to be slowed down by this patch. But note that for programs with a lot of memory use, the GC already needs to be tuned down (probably) for things to work. In the end it is all trade-offs. This one seems to be worth it. I am not sure the current performance testsuite consists of programs where this shows (one way or another) but I can add some where it does. |
After some musing with @saethlin, we figured the better answer than having a GC might be to use (some form of) ref-counting, at least in theory. Unfortunately that would require a large refactoring (and past attempts of doing it failed? |
Right, that makes sense, it just wasn't clear from the PR description. :)
That's a bit more surprising. What is the rationale for that? Did you compare the PR with and without this part? Many programs not having a ton of provenances would already be captured by the first part, no? (Also I am not sure that's true, each live reference has its own provenance and there can easily be a ton of those if someone has a large array of things that involve references.)
Yes that would be good -- preferably one benchmark demonstrating a clear benefit. (We don't want to grow our number of benchmarks too much, either...) |
ProvenanceGcSettings::Regularly { interval } => this.machine.since_gc >= interval, | ||
ProvenanceGcSettings::Heuristically => | ||
// no GC run if the last one was recently | ||
this.machine.since_gc >= 75 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
75 is an oddly specific number... did you try various values? On which benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is mostly based on vibes/previous experience with running TB on a bunch of crates. I found that usually, you find some local minimum (of the GC frequency -> performance function) around 100-500, so 75 seemed like a reasonable lower bound.
Unfortunately I don't think I can pack these all up into a nice test case. But in general, you want to put the cutoff somewhere. If you feel like a different number should be chosen, suggest a better one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in general, you want to put the cutoff somewhere.
This is somewhat surprising, given that the heuristic only kicks in when a tree doubled in size. Did you try different values with the heuristic already in place? Measurements done before you added the heuristic wouldn't really be valid any more.
Also note that this code here is not TB-specific.
This looks more like "guarding against an overeager heuristic" than anything else? Given it is entirely based on vibes, I'd go for 100 and add a comment for why the number was picked. But I am not sure we want a lower bound here at all; each heuristic is really responsible for ensuring a lower bound itself since it has more data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks more like "guarding against an overeager heuristic" than anything else?
Indeed, that is also one of the points. It ensures that performance does not tank when a heuristic goes wrong. With the current setup, all the heuristics can do is request the GC happens earlier, so some measures should be taken that it does not happen too early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A buggy heuristic making the GC fire all the time is bad even if we "cap" it at some number here this way. So this would paper over the issue and make it harder to notice, not fix it.
let last = self.nodes_at_last_gc.max(50); | ||
// trigger the GC if the tree doubled since the last run, | ||
// or otherwise got "significantly" larger. | ||
// Note that for trees < 100 nodes, nothing happens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Note that for trees < 100 nodes, nothing happens. | |
// Note that for trees <= 100 nodes, nothing happens. |
Do we not? Changes like the one in this PR would be much easier to find/develop/test if the suite of benchmarks was more diverse and larger in general. It its current form, it is useful only to ensure there are no really major performance regressions. But that's a separate issue. |
More benchmarks means they take forever to run, and at least until we figure out #3999 also forever to analyze. |
☔ The latest upstream changes (presumably #4009) made this pull request unmergeable. Please resolve the merge conflicts. |
I'm a bit unsure what is the best way to go forward here. I found that there are some more cases where changing the numbers here so that the GC runs less often. The more I think about it, the more I think you want some "more clever" tuning thing here that looks at Miri's current execution speed, compared to the one after the most recent GC run, and tries to gauge when it's time for the next one. Perhaps @saethlin also has something to weigh in on that. |
Sorry I can't parse this, what do you mean? |
Huh, I seem to have swallowed a sentence there. What I mean is that I found some programs where this change is bad because it makes the GC run too often. Instead, these programs achieve optimal performance when the "magic numbers" in this PR are moved up a bit, so that the trees need to grow more before the GC is triggered. So finding what is a good number there is nontrivial/impossible, which makes me more unhappy with the current approach. |
Well that's the usual issue with heuristics... |
I'm not a GC expert, so if you haven't already, it might be good to at least draw inspiration from what techniques the really good mark-and-sweep GCs use to get around this problem. I bet the severity of the problem is significantly reduced by having a generational GC (which we don't have) so it's also possible other GCs just don't suffer this as much. I think the only way out of this is to make the GC tune the frequency based on some kind of metric for how well its time is being spent. Looking at things like how much of the tags were garbage on the last run, or what fraction of time is being spent in the GC itself. The lack of self-tuning is the crux of #3209. |
Miri's GC runs every 10000 basic block, by default. It turns out that sometimes, it is better if the GC runs more often. This is especially true for some Tree Borrows programs, since in Tree Borrow, memory accesses take time linear in the size of a certain tree, and this tree can grow quite large, but is usually compacted significantly at every GC run. For some programs, it can be advantageous to run the GC every 200 blocks, or even more often.
Unfortunately, such a GC frequency would slow down other programs. To achieve a balance, this PR proposes a "heuristics" that allows running the GC more often based on how large the Tree Borrows tree grows. It is quite likely that other parts of Miri want to plug into this mechanisms as well, eventually. However, such concrete other cases are further work, as long as the system is designed to be general enough. This PR only contributes integration with Tree Borrows. It is also possible that the actual heuristics need more tweaks, but for now, the performance improvements look promising.
It is further likely that the overall design needs some change, since it currently relies on
Cell
s. Also, I was not too sure where to put the newenum
s. So this PR is WIP for now, although it is functionally complete.