Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MetaSchedule][Hexagon] Add MultiLevelTilingHexagon to schedule asyn… #13721

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

nverke
Copy link
Contributor

@nverke nverke commented Jan 7, 2023

…c pipelines that utilize DMA

This will create schedules that utilize async dma pipelines on hexagon. Currently this has some limitations due to the need to have dma lowering be contiguous memory copies. This PR is reliant on #13720 and will not build until that is merged into main.

Best perf that I found was around 5 GFLOPS for a the conv2d in the test.

These changes were made in collaboration with @masahi

@adstraw @csullivan @Lunderberg

@tvm-bot
Copy link
Collaborator

tvm-bot commented Jan 7, 2023

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Generated by tvm-bot

@nverke
Copy link
Contributor Author

nverke commented Jan 13, 2023

@tvm-bot rerun

@nverke nverke force-pushed the mlt-hexagon branch 2 times, most recently from c408d94 to 29b17f6 Compare January 17, 2023 21:52
@nverke nverke changed the title [MetaScheduler][Hexagon] Add MultiLevelTilingHexagon to schedule asyn… [MetaSchedule][Hexagon] Add MultiLevelTilingHexagon to schedule asyn… Jan 18, 2023
* names of tensor intrinsics, must be registered via
* TensorIntrin.register(...) beforehand
* \param structure The tiling structure. Recommended:
* - 'SRSRS' on Hexagon
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might map SRSRS to the layout NCHWc in the comment

Copy link
Contributor Author

@nverke nverke Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be applicable outside of schedules that have input of NCHWc but are you suggesting adding that to help better map this to something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just took me a minute to make the connection between the Hexagon layout (NCHWc) and the tiling structure (SRSRS) which led me to suggest a comment to clarify. The key thing for my understanding when I originally reviewed the code was to connect R=Reduction to the C=Channel axes.

TensorIntrin.register(...) beforehand
structure : str
The tiling structure. Recommended:
- 'SRSRS' on Hexagon
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might map SRSRS to the layout NCHWc in the comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

* \param tile_binds For each level of tiles, which thread axis it is bound to. These are not
* supported on hexagon.
* \param max_innermost_factor The maximum size of the innermost factor. NullOpt means no limit
* \param vector_load_lens The length of vector lane in vectorized cooperative fetching.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused as to why we have both the vector load length and the max innermost factor. These seem redundant. Aren't we always going to vectorize over the innermost loop? And, if vectorization is enabled, won't we use vector_load_lens for the size of the innermost loop? Also a "max" for max_innermost_factor seems strange. E.g. if the user says the max is 8 I imagine that trying 7 is a bad choice whereas a list of [2, 4, 8] are likely good choices. Anyway... I see that this API is inherited from MultiLevelTilingInitCommon so I won't harp on it. Just an observation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These have two separate uses, vector_load_lens is used for vector loads outside of the tiling loops where as max_innermost_factor is for loop tiling. As for the naming and usage I am not 100% sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I said, just an observation given you are inheriting this API.

@@ -68,12 +69,12 @@ def sync_dma_load_impl(a: T.handle, c: T.handle) -> None:
return sync_dma_load_desc, sync_dma_load_impl


def generate_dot_product_32x4_u8u8i32(mem_scope="global"):
def generate_dot_product_32x4_u8u8i32(read_mem_scope="global", write_mem_scope="global"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the defaults here "global" instead of "global.vtcm"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because typically these are used without vtcm

if (!use_software_pipeline) {
return {state};
}
// The current config is not suitable for software pipelining.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update comment to indicate what r_indices_ represents (reduction axes) since it's not a member of MultiLevelTilingHexagonNode class. And also why we need at least 2 of them as I can't quite figure out what that's the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment on this!

reduction_length *= extent->value;
}
}
if (reduction_length <= 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious use of <= 1 here as opposed to == 1. Do we support zero or negative extents?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not be possible but since extents with value 0 can happen then the length could end up being 0.

Array<Integer> software_pipeline_stage;
Array<Integer> software_pipeline_order;
Array<Integer> software_pipeline_async_stages;
if (cache_read_count == 2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct but this notation is difficult to read for many folks. Some comments might help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to explain as best I could!

sch->Annotate(fused, tir::attr::software_pipeline_async_stages, software_pipeline_async_stages);

// TODO(nverke): Add support for nested async pipelines.
// TODO(nverke): Add support for async cache writes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the lack of cache write support here due to the issue in the InjectSWPipeline pass where there is no "wait" on cache write stage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No its just from limiting complexity of this to start.

executor = relay.backend.Executor("graph", {"link-params": True})
mod = mod.with_attr("executor", executor)

use_async = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems strange especially with no else case below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this!


std::vector<State> MultiLevelTilingHexagonNode::ApplySubRules(std::vector<State> states) {
states = MultiLevelTilingWithIntrinNode::ApplySubRules(states);
states = SubRule(std::move(states), [&](State state) { return AddSoftwarePipeline(state); });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that MultiLevelTilingHexagon could be its own schedule rule as adds the ability to AddSoftwarePipeline but otherwise defers to MultiLevelTilingWithIntrin. I understand the main reason for the inheritance here is to control the order of the application of the schedule rules, that tiling must precede software pipelining. Wondering there is some other solution besides inheritance to solve this problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow. If we don't inherit we would have to copy all of the logic needed over from MLT with intrin. But we don't want to add software pipelines to all usages of MLT with intrin as this will try to optimize for hexagon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment may be naïve. My read of the code was that the only connection between MultiLevelTilingWithIntrin and AddSoftwarePipeline schedule rules was the order in which they were applied. And that we were using inheritance to ensure that order was maintained. And that there was no further value to the inheritance beyond maintaining that order. If any of that is wrong, let me know and we can scrap this comment. If I am correct, perhaps there is a way to create two separate passes ... one for software pipeline and one for multi-level tiling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants