-
Notifications
You must be signed in to change notification settings - Fork 645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CPU] Performance tracking for llama2 with data tiling + ukernels #15566
Comments
I have updated the test branches in the above comment with the latest changes |
Update on ConstEval folding: I have updated the branches in the above comment with some new fixes. I am now seeing ConstEval folding on the transpose and packing of the model weights, and I am getting reasonable benchmarks, but it is still slower than the V1 approach. The repro instructions should be the same as described above:
With this, we get ConstEval folding of the pack ops on the model weights. Here is the tracy profile with a single thread on the benchmark module: From the profile, we can see that there is a lot of time spent in For comparison, this is the tracy profile with the old V1 (VectorContractCustomKernels) approach |
Been looking this morning with @Max191 . Main things so far:
|
Fix for this would be to look at dispatches that have ukernels and turn off loop unrolling for LLVM compilation?
I am not sure turning off that folding based on data-tiling is a good solution. It is valid for the input program to come with |
I think they are mostly batch_mmt4d kernel, so we need to look at below snippet. Looking at the profiler, 85 ns for a launch is definitely too small. We should have larger distribution tile sizes. For batch_mmt4d op, the root cause is that we forces batch_dim being 1. We are able to relax it after landing #15531. I can help on landing it.
The loop unrolling is enabled by default, we can disable it here:
I think we should just make SetEncoding takes ConstractionOpInterface as input argument, check if the rank is all |
There are |
The work laid out in #15158 has been completed, and now we are moving forward with e2e testing of the llama2 7B model with the new changes. This issue will be for tracking performance and remaining e2e issues for full model testing.
As of right now, there are still a few changes that have yet to be landed, so the following branches are needed for IREE and LLVM while I work on landing the remaining changes:
LLVM: https://github.com/Max191/llvm-project/tree/quantized-matmul-v2-testing
IREE: https://github.com/Max191/iree/tree/quantized-matmul-v2-testing
Model file: https://storage.googleapis.com/shark_tank/dan/fp32_i4_cpu_llamas/llama2_7b_int4.mlir
iree-compile
andiree-benchmark-module
commands for performance testing on the llama2 7B model:The 64
1x32x1x128xf32
inputs are dynamic ond1
(1x32x?x128xf32
). This dim is the context length, so we can increase that size to benchmark for larger context lengths.The main thing we have left to achieve is to get ConstEval to kick in and fold the packing of the weights away.
The text was updated successfully, but these errors were encountered: