-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix reinterpret performance #28707
Fix reinterpret performance #28707
Conversation
Glad to see this works. Can we just transparently make this he implementation of |
@Keno Does this make the alignment workaround
as fast as
under 0.6? I needed to change this here |
0dbc329
to
1fe5cbf
Compare
Ok, let's do it that way. We can always add the aligned version if we need it later. |
Probably |
This fixes #25014 by making it more obvious what's going on to LLVM. Instead of a memcpy loop, we use a ccall to :memcpy and turn this into llvm.memcpy at the IR level, which is enough for LLVM to fold everything away. In the benchmark from #25014, we still see some regressions from 0.6, but that is because it needs to dereference through the pointers in the reinterpret and reshape wrappers. In any real code, that dereferencing should be loop-invariantly moved out of the inner loop.
1fe5cbf
to
93164b7
Compare
This seems eligible for 1.0.1, right? |
Yes, already has the backport label. |
This fixes #25014 by making it more obvious what's going on to LLVM.
Instead of a memcpy loop, we use a new intrinsic that puts an actual
llvm.memcpy into the IR, which is enough for LLVM to fold everything
away. In the benchmark from #25014, we still see some regressions from
0.6, but that is because it needs to dereference through the pointers
in the reinterpret and reshape wrappers. In any real code, that
dereferencing should be loop-invariantly moved out of the inner loop.