Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache reorder tensors #331

Open
wants to merge 1 commit into
base: ideep_pytorch
Choose a base branch
from

Conversation

Ryo-not-rio
Copy link

@Ryo-not-rio Ryo-not-rio commented Aug 21, 2024

Reorder weights are cached in the LRU cache on aarch64, avoiding unnecessary reordering. To take advantage of this feature, the IDEEP_CACHE_MATMUL_REORDERS environment variable needs to be set to 1 and the LRU_CACHE_CAPACITY needs to be set to a meaningful amount.

Speedups on aarch64:
gemma 1 thread - 3.27x
gemma 4 threads - 3.64x
gemma 16 threads - 3.73x

clip 1 thread - 1.04x
clip 4 threads - 1.06x
clip 16 threads - 1.09x

Memory usage increase on aarch64:
gemma - 1.36x
clip - 1.08x

Reorder weights are cached in the LRU cache on aarch64,
avoiding unnecessary reordering. To take advantage of
this feature, the IDEEP_CACHE_MATMUL_REORDERS environment
variable needs to be set to 1 and the LRU_CACHE_CAPACITY
needs to be set to a meaningful amount.

Speedups on aarch64:
gemma 1 thread - 3.27x
gemma 4 threads - 3.64x
gemma 16 threads - 3.73x

clip 1 thread - 1.04x
clip 4 threads - 1.06x
clip 1 thread - 1.04x
clip 4 threads - 1.06x
clip 16 threads - 1.09x

Memory usage increase on aarch64:
gemma - 1.36x
clip - 1.08x

Change-Id: I59e53c63bb706708aa15ec712909dfbac75d9f3f
@Ryo-not-rio
Copy link
Author

Hi, could @Xia-Weiwen, @yanbing-j, @jgong5 have a look at this please? Thank you

@Xia-Weiwen
Copy link

Hi @Ryo-not-rio Thanks for your PR. I have two questions. Weights are supposed to be reordered ahead of time so that reordering can be avoided at runtime. Why does not this work in your case? Will your change also benefit X86?

@Ryo-not-rio
Copy link
Author

Ryo-not-rio commented Aug 21, 2024

@Xia-Weiwen In the model we've tested, it doesn't seem to be the case that weights are reordered ahead of time since the do_compute function is called during runtime. Could you point me to the part where the reorders should be happening ahead of time? I suspect it would also benefit x86 if we want to enable it for x86.

@jgong5
Copy link

jgong5 commented Aug 22, 2024

@Xia-Weiwen In the model we've tested, it doesn't seem to be the case that weights are reordered ahead of time since the do_compute function is called during runtime. Could you point me to the part where the reorders should be happening ahead of time? I suspect it would also benefit x86 but I will post performance numbers for x86 shortly.

The upfront reorder happens in the upper-level SW stack in inductor when the model is frozen: https://github.com/pytorch/pytorch/blob/6bddfb95463332070cec0aaedd103f42a74564d0/torch/_inductor/fx_passes/mkldnn_fusion.py#L1028

Please note that we will be moving ideep into PyTorch core. Adding runtime packing support in ideep that depends on special environment variables complicate things. Can we leverage the upfront weight packing design in PyTorch?

@Xia-Weiwen
Copy link

@Ryo-not-rio You will need to run prepare before compute. Weight is reordered when you call prepare. You may also try what Jiong commented.

@Ryo-not-rio
Copy link
Author

@jgong5 This change is to improve performance during eager mode so we can't take advantage of the inductor. @Xia-Weiwen I've also checked and it looks like prepare() is being called but since the reorder_if_differ() is in the do_compute() function which is called after prepare(), there still seems to be reorders happening

@jgong5
Copy link

jgong5 commented Aug 23, 2024

@jgong5 This change is to improve performance during eager mode so we can't take advantage of the inductor. @Xia-Weiwen I've also checked and it looks like prepare() is being called but since the reorder_if_differ() is in the do_compute() function which is called after prepare(), there still seems to be reorders happening

In my opinion, if we want to do that for eager mode, it is more Pythonic to leverage PyTorch frontend APIs (e.g. https://github.com/pytorch/pytorch/blob/7c93c4f8cf343774d26486bedc5a85892df08024/torch/utils/mkldnn.py, https://github.com/pytorch/pytorch/blob/7c93c4f8cf343774d26486bedc5a85892df08024/torch/fx/experimental/optimization.py#L131) to let user opt in, instead of via a low-level environment variable that controls the behavior of the underlying library. Also, more benefit of a frontend API is that you don't have to keep two copies of weights. The old weights can be swapped out then.

@Ryo-not-rio
Copy link
Author

@jgong5 Unfortunately I don't think we can leverage the frontend API since we don't know the required weight format at the frontend level for aarch64. The weight format is determined in oneDNN after requesting the weight format from ComputeLibrary here

@Xia-Weiwen
Copy link

Hi @Ryo-not-rio. On X86, we reorder weights ahead-of-time, which is called prepacking. Example are https://github.com/pytorch/pytorch/blob/cf11fc0dcbb9c907cf6e851109b92f4157e445c9/aten/src/ATen/native/quantized/cpu/qlinear_prepack.cpp#L292 and https://github.com/pytorch/pytorch/blob/cf11fc0dcbb9c907cf6e851109b92f4157e445c9/aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp#L493
Prepacking can be done when you prepare/quantize the model for inference or on the first run of inference. So, we don't have to reorder weight at runtime in most cases. May I know any difficulties preventing you doing that?

@Ryo-not-rio
Copy link
Author

@Xia-Weiwen this patch is targeting aarch64 machines where we don't prepack and the expected weight format is not known at the pytorch level

@Xia-Weiwen
Copy link

we don't prepack

Why not? I didn't see any difference between aarch64 and x86 when you call these APIs. Are there any particular reasons that prevent you doing that?

the expected weight format is not known at the pytorch level

You don't have to know the format. Just provide some info and let oneDNN determine.

@Ryo-not-rio
Copy link
Author

@Xia-Weiwen I have implemented your suggestion in pytorch/pytorch#139387. Please have a look and let me know if you have any feedback, thank you

@Xia-Weiwen
Copy link

@Ryo-not-rio Sure. Also CC. @jgong5 for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants