Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VM][Hexagon] Cache operations when bypass mode is enabled #16762

Merged
merged 1 commit into from
Mar 23, 2024

Conversation

abhikran-quic
Copy link
Contributor

  • This is needed as Hexagon DMA engine expects cache maintenance by applications.
  • This change ensures accuracy in bypass_cache mode.

This is needed as Hexagon DMA engine expects cache maintenance by applications
@tqchen
Copy link
Member

tqchen commented Mar 21, 2024

@abhikran-quic just so that i understand, this PR adds cache flush and invalidate, are these necessary for every DMA operation? Since this likely seems to have performance implications

For normal cases where we have an acceleratorI(NPU) and a CPU, we can use DMA from accelerator to copy data into NPU, and cache invalidation, flush is only needed when CPU would like to see that piece of memory.

So an optimization would be for the ops that NPU runs, we always do not do flush/invalidation (as they are always coherent from NPU's pov), then only during CPU to NPU or NPU to CPU ops, we do cache flush, invalidation

Say our ops is

cpuopA => npuOpB=> npuOpC => npuOpD => cpuopE

We only do cache flush after cpuopA(so NPU can see the result in DRAM), and cache invalidate after npuOpD

@abhikran-quic
Copy link
Contributor Author

abhikran-quic commented Mar 22, 2024

@abhikran-quic just so that i understand, this PR adds cache flush and invalidate, are these necessary for every DMA operation? Since this likely seems to have performance implications

For normal cases where we have an acceleratorI(NPU) and a CPU, we can use DMA from accelerator to copy data into NPU, and cache invalidation, flush is only needed when CPU would like to see that piece of memory.

So an optimization would be for the ops that NPU runs, we always do not do flush/invalidation (as they are always coherent from NPU's pov), then only during CPU to NPU or NPU to CPU ops, we do cache flush, invalidation

Say our ops is

cpuopA => npuOpB=> npuOpC => npuOpD => cpuopE

We only do cache flush after cpuopA(so NPU can see the result in DRAM), and cache invalidate after npuOpD

Hi @tqchen ,
Yes, you are right. For a normal CPU <-> NPU communication, cache operations are needed when CPU would share the data with NPU(providing input to npuOpB) and to read the final output of a model(For ex. output of npuOpD).

In case of Hexagon, there's a dedicated DMA engine to allow async copy of data into TCM. The DMA engine supports a mode where cache(L1/L2) can be bypassed and data is copied directly from DDR -> TCM. In such a scenario, the hardware engine expects the application to manage cache operations. Otherwise stale data gets picked up leading to inaccuracy. Hence, this change is introduced specifically for Hexagon and might not be applicable for normal CPU <-> NPU communication.

@tqchen
Copy link
Member

tqchen commented Mar 22, 2024

@abhikran-quic thank you! can you give an example of the intended usecase?Just so we can understand more about background context. The PR now seems to suggest if bypass_cache=True, then flush/invalidation will also happen to ensure correctness, but can that cause some optimal performance (i am just using the NPU example, e.g. is flush/invalidation always necessary)?

@abhikran-quic
Copy link
Contributor Author

abhikran-quic commented Mar 22, 2024

@abhikran-quic thank you! can you give an example of the intended usecase?Just so we can understand more about background context. The PR now seems to suggest if bypass_cache=True, then flush/invalidation will also happen to ensure correctness, but can that cause some optimal performance (i am just using the NPU example, e.g. is flush/invalidation always necessary)?

Hi @tqchen ,

To give you some more backgroud, the goal of DMA builtins(dma_wait and dma_copy) for Hexagon is to replace synchronous(blocking) copy operations(which can lead to stalls at runtime) with asynchronous copy operations. DMA copy operations can be performed in parallel while running some operators on Hexagon vector/scalar core. An example is mentioned below:

@R.function
    def main(input: R.Tensor((...), dtype="uint8"), weight: R.Tensor((...), dtype="uint8")) -> R.Tensor((...), dtype="uint8"):
        R.func_attr({"operator_name": "main"})
        cls = Module
        with R.dataflow():
          lv0 = R.call_builtin_with_ctx("vm.builtin.hexagon.dma_copy", (weight), mem_scope = "global.vtcm")
          lv1 = R.call_tir(bias_add, (input,), out_sinfo=R.Tensor((...), dtype="uint8"))
          lv2 = R.call_tir(relu, (lv1,), out_sinfo=R.Tensor((...), dtype="uint8"))
          lv3 =R.call_builtin_with_ctx("vm.builtin.hexagon.dma_wait", (lv0,), mem_scope = "global.vtcm")
          gv = R.call_tir(conv, (lv2, lv3), out_sinfo=R.Tensor((...), dtype="uint8"))
          R.output(gv)
       return gv

The IR above shows that dma_copy operation will be the first op in the graph and would inform the DMA engine to copy weights from DDR to VTCM. While the async copy happens, bias_add and relu ops can be executed on HVX/scalar hexagon core. A dma_wait operation is introduced before conv operation to ensure that DMA engine will finish copying weights and the data is available in VTCM for conv operation to proceed.

In the present PR, we intend to use bypass_cache mode supported by DMA engine to copy data faster to VTCM. This is expected to be faster than going via cache.

For your question on whether cache flush/invalidation can cause some performance degradation, in theory this is not expected and when we tried some experiments, we observe about ~7-10% performance improvement when bypass_cache is enabled.

@tqchen
Copy link
Member

tqchen commented Mar 22, 2024

get it, one further question. Is there a case where we choose to by pass, and not explicitly flush/invalidate at all? Or would it helpful to have some explicit memory barrier once for certain kernels.

Say for example, say we have multiple DRAM=>DMA=>VTCM copies from slices of the same buffer in a kernel, would it be more efficient to do cache invalidate once for entire buffer then always bypass without invalidation

invalidate_cache(buffer)
for i in range():
    # no need to invalidate buffer again, since we already invalidated the buffer once before dma copy
     dma_copy(buffer[i*256: i*256+256], dst_vtcm, bypass=True)

@tqchen
Copy link
Member

tqchen commented Mar 22, 2024

Ah i see it is vm.builtin and not the normal dma_wait in tir loops, now it makes sense since we don't normally do slice access at relax level. It makes sense to me now. Thank you @abhikran-quic

@abhikran-quic
Copy link
Contributor Author

Thank you @tqchen !

Copy link
Contributor

@quic-sanirudh quic-sanirudh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @abhikran-quic and @tqchen

@quic-sanirudh quic-sanirudh merged commit 1223989 into apache:main Mar 23, 2024
20 checks passed
@abhikran-quic abhikran-quic deleted the abhikran/dma_cache branch March 23, 2024 16:02
thaisacs pushed a commit to thaisacs/tvm that referenced this pull request Apr 3, 2024
)

This is needed as Hexagon DMA engine expects cache maintenance by applications
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants