-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VM][Hexagon] Cache operations when bypass mode is enabled #16762
Conversation
abhikran-quic
commented
Mar 21, 2024
- This is needed as Hexagon DMA engine expects cache maintenance by applications.
- This change ensures accuracy in bypass_cache mode.
This is needed as Hexagon DMA engine expects cache maintenance by applications
@abhikran-quic just so that i understand, this PR adds cache flush and invalidate, are these necessary for every DMA operation? Since this likely seems to have performance implications For normal cases where we have an acceleratorI(NPU) and a CPU, we can use DMA from accelerator to copy data into NPU, and cache invalidation, flush is only needed when CPU would like to see that piece of memory. So an optimization would be for the ops that NPU runs, we always do not do flush/invalidation (as they are always coherent from NPU's pov), then only during CPU to NPU or NPU to CPU ops, we do cache flush, invalidation Say our ops is
We only do cache flush after |
Hi @tqchen , In case of Hexagon, there's a dedicated DMA engine to allow async copy of data into TCM. The DMA engine supports a mode where cache(L1/L2) can be bypassed and data is copied directly from DDR -> TCM. In such a scenario, the hardware engine expects the application to manage cache operations. Otherwise stale data gets picked up leading to inaccuracy. Hence, this change is introduced specifically for Hexagon and might not be applicable for normal CPU <-> NPU communication. |
@abhikran-quic thank you! can you give an example of the intended usecase?Just so we can understand more about background context. The PR now seems to suggest if bypass_cache=True, then flush/invalidation will also happen to ensure correctness, but can that cause some optimal performance (i am just using the NPU example, e.g. is flush/invalidation always necessary)? |
Hi @tqchen , To give you some more backgroud, the goal of DMA builtins(
The IR above shows that In the present PR, we intend to use For your question on whether cache flush/invalidation can cause some performance degradation, in theory this is not expected and when we tried some experiments, we observe about ~7-10% performance improvement when |
get it, one further question. Is there a case where we choose to by pass, and not explicitly flush/invalidate at all? Or would it helpful to have some explicit memory barrier once for certain kernels. Say for example, say we have multiple DRAM=>DMA=>VTCM copies from slices of the same buffer in a kernel, would it be more efficient to do cache invalidate once for entire buffer then always bypass without invalidation
|
Ah i see it is vm.builtin and not the normal dma_wait in tir loops, now it makes sense since we don't normally do slice access at relax level. It makes sense to me now. Thank you @abhikran-quic |
Thank you @tqchen ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @abhikran-quic and @tqchen