-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cranelift: Delete scalar {u,s}loadNN
and storeNN
instructions
#8785
Conversation
Instead of using the `istore8` special store opcodes, perform a ireduce and then store.
These instructions are essentially a combination of a `ireduce+store`, so remove the special store, and start recognizing that idiom in the backends. Since `ireduce` is a no-op from the point of view of the backends, we can simply delete those lowering rules. they don't add any novel instructions other than just separately lowering both `ireduce` and `store`. There are a lot of s390x tests delete here, I've checked and there also exist equivalent `store` based tests for those, so no point in keeping them around.
These are going to be removed in follow up commits
For context I brought this up in the Cranelift meeting today as I was curious about the genesis of the instructions. My assumption was that these were added under the assumption that the pattern of load+extend or ireduce+store would be so common that having fused ops in Cranelift would reduce the size of the resident IR during compilation and perhaps have other various memory/compile-time benefits. The conclusion though was that while this was probably the predicted purpose of the instructions no one was aware of any benchmarking one way or another to show the impact. I think it'd be worthwhile to put this through sightglass to ensure there's not, for example, a 10% slowdown compiling spidermonkey, but otherwise I think we're all in the abstract all for cleanups that simplify the IR. |
Ran some quick benchmarks on this, doesn't seem to make too much of a difference, except for spidermonkey where there is a slight compile time regression. I should also note that this version suffers from the issue described in #8787, but it looks like that doesn't affect larger programs too much? It does affect the test cases which slightly bothers me. Sightglass results
|
Thanks for running those benchmarks! IMHO, a 1-3% compile time regression is unfortunately a bit too significant to take for a pure "cleanliness win" change (others may disagree, happy to discuss; in the CL meeting this morning I gave a clean 1% as an example of a number I'd personally be fine with, 10% as an example clearly not, 3% is somewhere in the middle). I wonder if it gets any better if we do handle merging better, along the lines of #8787 -- the compile slowdown could arise because of larger VCode on average rather than larger CLIF, as well. Worth trying to resolve that first then coming back to benchmark this again perhaps? |
Yeah, I don't think it's worth it if we have this regression.
Maybe, I'm not too familiar with how we do elaboration on egraphs so it might be slightly harder for me to pick that up. Reading #6154 got me interested in the jump-threading pass idea, even though I think moving the extends next to the loads would be more effective in this case. I probably won't have time to look at it now but maybe later. |
By the way, I was doing some unrelated benchmarking and found out that sightglass is not consistent enough on my machine to give me confidence that the result above is real / meaningful. I'm going to try to run the benchmarks again on a more stable setup when I get some time, although I suspect it's probably going to amount to the same. |
@afonso360 you could try measuring instructions retired instead of cycles, pass something like |
I re-ran the benchmarks with However the results are now completely different from the results above. Sightglass Results
Weirdly there is now no longer any compilation difference, but there are execution differences. I'm going to close this for now, however I'm planning on redoing these benchmarks when we have a jump threading pass, which I'm currently working on (slowly). |
👋 Hey,
This is a WIP PR to delete the scalar load+extend instructions. I'm still working on cleaning up the s390x backend, but I saw there was some discussion on today's meeting regarding this. (Unfortunately these now clash with some other stuff and I'm unable to attend)
I haven't done any benchmarking, and I just noticed there are some cases where we are not correctly fusing the load+extend. Once that is working correctly I'll try to run sightglass on it.
CC: #6056