-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CPU] Slow copy coming from tensor.insert_slice
with dynamic dims
#15195
Comments
and try. That should fix it. |
This gives us This is essentially what we want to be able to fuse now: So either we allow these to be in the same dispatch or we just add |
If we generalize the batch matmul that consumes the
Is there a specific reason this wouldn't be folded away? To me, it just looks like it is a copy with some weird indexing maps (In this case On another note, assuming we can fold away the above dispatch and we don't materialize the concat result. generalizing the batch matmul causes about a 10x slowdown on the batch matmul itself based on the profile. The materialization of the concat result is a bigger hit to performance for now, but this batch matmul will ultimately be slowing down the model quite a bit too, even if the overall performance is better when we generalize it. |
I've reduced the problem into a sequence of 2 transpose operations that are inverses of each other:
I figure we want to simply be folding this away, but instead
Here is the dump compiled to flow for reference: |
huh this should just be folded away |
https://github.com/llvm/llvm-project/blob/fb5047f5244d81aa89f68210a9cd34ddddcc8af4/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp#L1081 should be the pattern that should fold this away. Might be a bug here as to why it is missing this. |
After fixing the above bug, the generalized batch matmul is turning out to be quite expensive. I have uploaded IR dumps and objdumps of a simple example of the transposed batch matmul in question here: There are 2 versions. One with the additional experimental optimizations related to quantized vecmats enabled, and one without. It is possible that these optimizations are impacting performance, so I uploaded both. Here is the input IR as well:
Edit: compile commands default:
quantized matmul changes:
|
I found why this is slow. The quantized mamtul reassociation changes enables split reduction on ops with 2 inputs, so the generalized batch matmul with a dynamic reduction dim was going through the The overall latency of the model is marginally better now, shaving off ~5-10ms for longer context lengths (original latency of 90+ ms). We could probably get a bit better performance if we add splitReduction functionality for dynamic dims too and let both generalized batch matmuls go through splitReduction. |
What happened?
While burning down performance on llama2 for CPU, I ran into a slow copy dispatch that came from a
tensor.insert_slice
op inserting on an inner dim. After doing some rewrites, the insert now looks like this:This should be able to turn into a
flow.tensor.update
op since the inner dimensions are contiguous, but it fails to rewrite due to the dynamic dim:https://github.com/openxla/iree/blob/ebdb098b216c3e59a9977902823ede613f553f71/compiler/src/iree/compiler/Dialect/Flow/Conversion/TensorToFlow/Utils.cpp#L73-L81
I'm figuring on CPU it should be okay to have dynamic dims here since we don't have to worry about round trips to device. Should we enable dynamic dims in the
flow.tensor.update
rewrite here?Steps to reproduce your issue
I have been playing with the relevant workload, and am working with this IR now:
https://gist.github.com/Max191/f9705764cc3cd650f3c547071dcc03a9
Compiling on ToM with:
What component(s) does this issue relate to?
No response
Version information
No response
Additional context
Here is the dump after all from the above IR:
https://drive.google.com/file/d/1rIAF7zbVZ5m4IX69_XJ__-PxJ21AzWY0/view?usp=sharing
The text was updated successfully, but these errors were encountered: