-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wasmtime: Remove ALL constant phis #8565
base: main
Are you sure you want to change the base?
Conversation
When we're trying to delete block-params that can be replaced by a single dominating value, we weren't handling some cases. In particular, if we concluded that a block formal parameter (say, `v3`) had more than one value, then we believed that any uses of that parameter (say, defining another formal parameter `v4`) also had more than one value, and therefore could not be deleted. However, in such cases we can try using `v3` itself as the definition of `v4`. If there are no other definitions of `v4`, then it can be deleted. With this change, if a block has only one predecessor, it is now true that this pass will delete all of its block params. We couldn't rely on that property before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, thanks! Intuitively it seems the difference to me is the two-stage meet of sorts -- we get an actual Value
(replacement
) then test that against the abstract value and handle the case where it's actually the same as the existing One
; is that where these two diverged?
I'm not sure if there's a comment of some sort that would help with this subtlety (I also didn't write the original and haven't looked at this code in a while so I may have misunderstood some of the original thinking); but it's already quite well commented, so I'll leave it up to you if you can think of any further explanation-comments to add or not. In any case, thanks!
@jameysharp would you mind also doing a Sightglass run on this? It should, from first principles, mainly result in speedups; but I'm getting some weird results when trying it on some weval'd code (-5% deltas) and I'm wondering if it has some unfortunate interactions in my case with regalloc (longer liveranges) or if it's something else. In any case having a baseline measurement on "normal" code (not my horrific spaghetti) would be useful I suspect :-) |
I'm not sure I understand the way you phrased this question, so I'm going to try answering a different question and see if that helps. If we decide that a formal block parameter can be replaced by another value, then we want to forward that value in the dataflow analysis if that formal parameter is used as the actual parameter for another branch. I'm thinking of it like continuing the analysis as if one rewrite had already happened, to see if that makes more rewrites visible. The existing implementation did that, except it forwarded any abstract value from the formal parameter, including the As a result it turns out that the only abstract value we can get from that check is the By the way, one thing I worried about is whether this dataflow transfer function is monotonic, and I had trouble convincing myself. Since this is a fix-point, it can optimistically assume that a rewrite will succeed until it finds evidence to the contrary, at which point that evidence propagates. But that means we can have a formal parameter which we assigned an abstract value of Now that I've written this out I remember the answer is trivially "no": we join the previous abstract value with the new one, and
Sightglass on the default benchmark set says there is no significant difference in:
During execution, by instructions retired, this PR is very slightly worse on all three benchmarks. By CPU cycles, bz2 is faster with this PR and the others are a wash. Now that I know this, I'm not sure what to do with it.
|
OK, cool; that's what I had understood, I was mostly just wondering if there was some sort of doc-comment we could add to clarify. I think it's probably fine as-is. Re: termination and monotonicity, I agree. The stickiness of values and using the meet-function of a finite-height lattice alone is enough to ensure termination, but a unique solution (i.e. ensuring that the result is not dependent on processing order or whatnot) requires monotonicity of the transfer functions too (i.e. Re: performance: the mixed results are a little concerning combined with my observations from earlier; maybe this is something we could dig a bit further into? I agree the change seems very nice; but if in practice it does bad things to regalloc today then perhaps we can keep it around 'til those issues are resolved or we understand the conditions better... |
Oh right, uniqueness of the solution would be nice too. Yeah, this is what I had trouble reasoning about. Let's say we see that Now later let's say we discover that I think there's something here about the fact that we do this analysis pass in reverse post-order that helps, maybe only for reducible flow graphs. But I also wonder if this needs a different lattice, distinguishing between "the one value coming from a blockparam" versus "the one value coming from a concrete source such as an instruction result".
Let me make the performance question more complicated 😅 I discovered this missed-optimization bug because I was trying to address #7639, which was prompted by work Amanieu wanted to do in bytecodealliance/regalloc2#170. This PR establishes the guarantee that any block which has only one predecessor has no incoming block-params, and that makes it easier to avoid having any block-params on branches that have multiple outgoing destinations, since then only the critical edges have block-params and you can move them into the block that's created for splitting the critical edge. I've done a Sightglass run comparing this PR against the local branch I have for eliminating block-params from branches with multiple destinations; that branch is based on this one. Measured by instructions retired during execution, all the usual benchmarks are a little faster on that branch. Spidermonkey is also slightly faster by CPU cycles on that branch.
I'll do another Sightglass run comparing that branch to |
Okay, I've compared Overall effect size is small in all cases. The biggest percentage change is for pulldown-cmark, where the combination is maybe 1% slower during execution by either instructions retired or CPU cycles. No significant difference in compilation. pulldown-cmark Sightglass results
Spidermonkey is marginally faster during both compilation and execution, by instructions retired; no significant difference in CPU cycles. spidermonkey Sightglass results
And bz2 is slower by CPU cycles but faster by instructions retired, during execution. No significant difference during compilation. bz2 Sightglass results
|
Interesting; given the choice between judging performance by instructions-retired and judging performance by cycles, let's take the latter since it's what the user sees; so we have slower (pulldown-cmark), no change (spidermonkey), slower (bz2). Unfortunately it seems there's more to understand here. I share your (assumed) frustration that a clear and obvious optimization doesn't result in a speedup -- in fact I also implemented redundant-phi elimination in waffle yesterday just to see if it changed with wasm-producer-side vs native-side optimizations and got similar slowdowns. I suspect it means we have work to do in regalloc :-/ |
When we're trying to delete block-params that can be replaced by a single dominating value, we weren't handling some cases.
In particular, if we concluded that a block formal parameter (say,
v3
) had more than one value, then we believed that any uses of that parameter (say, defining another formal parameterv4
) also had more than one value, and therefore could not be deleted.However, in such cases we can try using
v3
itself as the definition ofv4
. If there are no other definitions ofv4
, then it can be deleted.With this change, if a block has only one predecessor, it is now true that this pass will delete all of its block params. We couldn't rely on that property before.