[VM][Hexagon] Cache operations when bypass mode is enabled #16762

abhikran-quic · 2024-03-21T14:05:10Z

This is needed as Hexagon DMA engine expects cache maintenance by applications.
This change ensures accuracy in bypass_cache mode.

This is needed as Hexagon DMA engine expects cache maintenance by applications

tqchen · 2024-03-21T15:29:11Z

@abhikran-quic just so that i understand, this PR adds cache flush and invalidate, are these necessary for every DMA operation? Since this likely seems to have performance implications

For normal cases where we have an acceleratorI(NPU) and a CPU, we can use DMA from accelerator to copy data into NPU, and cache invalidation, flush is only needed when CPU would like to see that piece of memory.

So an optimization would be for the ops that NPU runs, we always do not do flush/invalidation (as they are always coherent from NPU's pov), then only during CPU to NPU or NPU to CPU ops, we do cache flush, invalidation

Say our ops is

cpuopA => npuOpB=> npuOpC => npuOpD => cpuopE

We only do cache flush after cpuopA(so NPU can see the result in DRAM), and cache invalidate after npuOpD

abhikran-quic · 2024-03-22T03:13:40Z

@abhikran-quic just so that i understand, this PR adds cache flush and invalidate, are these necessary for every DMA operation? Since this likely seems to have performance implications

For normal cases where we have an acceleratorI(NPU) and a CPU, we can use DMA from accelerator to copy data into NPU, and cache invalidation, flush is only needed when CPU would like to see that piece of memory.

So an optimization would be for the ops that NPU runs, we always do not do flush/invalidation (as they are always coherent from NPU's pov), then only during CPU to NPU or NPU to CPU ops, we do cache flush, invalidation

Say our ops is
cpuopA => npuOpB=> npuOpC => npuOpD => cpuopE
We only do cache flush after cpuopA(so NPU can see the result in DRAM), and cache invalidate after npuOpD

Hi @tqchen ,
Yes, you are right. For a normal CPU <-> NPU communication, cache operations are needed when CPU would share the data with NPU(providing input to npuOpB) and to read the final output of a model(For ex. output of npuOpD).

In case of Hexagon, there's a dedicated DMA engine to allow async copy of data into TCM. The DMA engine supports a mode where cache(L1/L2) can be bypassed and data is copied directly from DDR -> TCM. In such a scenario, the hardware engine expects the application to manage cache operations. Otherwise stale data gets picked up leading to inaccuracy. Hence, this change is introduced specifically for Hexagon and might not be applicable for normal CPU <-> NPU communication.

tqchen · 2024-03-22T14:02:20Z

@abhikran-quic thank you! can you give an example of the intended usecase?Just so we can understand more about background context. The PR now seems to suggest if bypass_cache=True, then flush/invalidation will also happen to ensure correctness, but can that cause some optimal performance (i am just using the NPU example, e.g. is flush/invalidation always necessary)?

abhikran-quic · 2024-03-22T17:09:21Z

@abhikran-quic thank you! can you give an example of the intended usecase?Just so we can understand more about background context. The PR now seems to suggest if bypass_cache=True, then flush/invalidation will also happen to ensure correctness, but can that cause some optimal performance (i am just using the NPU example, e.g. is flush/invalidation always necessary)?

Hi @tqchen ,

To give you some more backgroud, the goal of DMA builtins(dma_wait and dma_copy) for Hexagon is to replace synchronous(blocking) copy operations(which can lead to stalls at runtime) with asynchronous copy operations. DMA copy operations can be performed in parallel while running some operators on Hexagon vector/scalar core. An example is mentioned below:

@R.function
    def main(input: R.Tensor((...), dtype="uint8"), weight: R.Tensor((...), dtype="uint8")) -> R.Tensor((...), dtype="uint8"):
        R.func_attr({"operator_name": "main"})
        cls = Module
        with R.dataflow():
          lv0 = R.call_builtin_with_ctx("vm.builtin.hexagon.dma_copy", (weight), mem_scope = "global.vtcm")
          lv1 = R.call_tir(bias_add, (input,), out_sinfo=R.Tensor((...), dtype="uint8"))
          lv2 = R.call_tir(relu, (lv1,), out_sinfo=R.Tensor((...), dtype="uint8"))
          lv3 =R.call_builtin_with_ctx("vm.builtin.hexagon.dma_wait", (lv0,), mem_scope = "global.vtcm")
          gv = R.call_tir(conv, (lv2, lv3), out_sinfo=R.Tensor((...), dtype="uint8"))
          R.output(gv)
       return gv

The IR above shows that dma_copy operation will be the first op in the graph and would inform the DMA engine to copy weights from DDR to VTCM. While the async copy happens, bias_add and relu ops can be executed on HVX/scalar hexagon core. A dma_wait operation is introduced before conv operation to ensure that DMA engine will finish copying weights and the data is available in VTCM for conv operation to proceed.

In the present PR, we intend to use bypass_cache mode supported by DMA engine to copy data faster to VTCM. This is expected to be faster than going via cache.

For your question on whether cache flush/invalidation can cause some performance degradation, in theory this is not expected and when we tried some experiments, we observe about ~7-10% performance improvement when bypass_cache is enabled.

tqchen · 2024-03-22T17:38:25Z

get it, one further question. Is there a case where we choose to by pass, and not explicitly flush/invalidate at all? Or would it helpful to have some explicit memory barrier once for certain kernels.

Say for example, say we have multiple DRAM=>DMA=>VTCM copies from slices of the same buffer in a kernel, would it be more efficient to do cache invalidate once for entire buffer then always bypass without invalidation

invalidate_cache(buffer)
for i in range():
    # no need to invalidate buffer again, since we already invalidated the buffer once before dma copy
     dma_copy(buffer[i*256: i*256+256], dst_vtcm, bypass=True)

tqchen · 2024-03-22T17:40:10Z

Ah i see it is vm.builtin and not the normal dma_wait in tir loops, now it makes sense since we don't normally do slice access at relax level. It makes sense to me now. Thank you @abhikran-quic

abhikran-quic · 2024-03-23T06:45:01Z

Thank you @tqchen !

quic-sanirudh

Thanks @abhikran-quic and @tqchen

) This is needed as Hexagon DMA engine expects cache maintenance by applications

[VM][Hexagon] Cache operations when bypass mode is enabled

c061048

This is needed as Hexagon DMA engine expects cache maintenance by applications

tqchen approved these changes Mar 22, 2024

View reviewed changes

quic-sanirudh approved these changes Mar 23, 2024

View reviewed changes

quic-sanirudh merged commit 1223989 into apache:main Mar 23, 2024
20 checks passed

abhikran-quic deleted the abhikran/dma_cache branch March 23, 2024 16:02

thaisacs pushed a commit to thaisacs/tvm that referenced this pull request Apr 3, 2024

[VM][Hexagon] Cache operations when bypass mode is enabled (apache#16762

2a3307c

) This is needed as Hexagon DMA engine expects cache maintenance by applications

ysh329 mentioned this pull request Apr 21, 2024

[Release] v0.16.0 Release Candidate Notes #16911

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VM][Hexagon] Cache operations when bypass mode is enabled #16762

[VM][Hexagon] Cache operations when bypass mode is enabled #16762

abhikran-quic commented Mar 21, 2024

tqchen commented Mar 21, 2024 •

edited

Loading

abhikran-quic commented Mar 22, 2024 •

edited

Loading

tqchen commented Mar 22, 2024

abhikran-quic commented Mar 22, 2024 •

edited

Loading

tqchen commented Mar 22, 2024 •

edited

Loading

tqchen commented Mar 22, 2024

abhikran-quic commented Mar 23, 2024

quic-sanirudh left a comment

[VM][Hexagon] Cache operations when bypass mode is enabled #16762

[VM][Hexagon] Cache operations when bypass mode is enabled #16762

Conversation

abhikran-quic commented Mar 21, 2024

tqchen commented Mar 21, 2024 • edited Loading

abhikran-quic commented Mar 22, 2024 • edited Loading

tqchen commented Mar 22, 2024

abhikran-quic commented Mar 22, 2024 • edited Loading

tqchen commented Mar 22, 2024 • edited Loading

tqchen commented Mar 22, 2024

abhikran-quic commented Mar 23, 2024

quic-sanirudh left a comment

Choose a reason for hiding this comment

tqchen commented Mar 21, 2024 •

edited

Loading

abhikran-quic commented Mar 22, 2024 •

edited

Loading

abhikran-quic commented Mar 22, 2024 •

edited

Loading

tqchen commented Mar 22, 2024 •

edited

Loading