-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MetaSchedule][Hexagon] Add MultiLevelTilingHexagon to schedule asyn… #13721
base: main
Are you sure you want to change the base?
Conversation
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
@tvm-bot rerun |
c408d94
to
29b17f6
Compare
* names of tensor intrinsics, must be registered via | ||
* TensorIntrin.register(...) beforehand | ||
* \param structure The tiling structure. Recommended: | ||
* - 'SRSRS' on Hexagon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might map SRSRS
to the layout NCHWc
in the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be applicable outside of schedules that have input of NCHWc but are you suggesting adding that to help better map this to something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just took me a minute to make the connection between the Hexagon layout (NCHWc) and the tiling structure (SRSRS) which led me to suggest a comment to clarify. The key thing for my understanding when I originally reviewed the code was to connect R=Reduction to the C=Channel axes.
TensorIntrin.register(...) beforehand | ||
structure : str | ||
The tiling structure. Recommended: | ||
- 'SRSRS' on Hexagon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might map SRSRS
to the layout NCHWc
in the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
* \param tile_binds For each level of tiles, which thread axis it is bound to. These are not | ||
* supported on hexagon. | ||
* \param max_innermost_factor The maximum size of the innermost factor. NullOpt means no limit | ||
* \param vector_load_lens The length of vector lane in vectorized cooperative fetching. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confused as to why we have both the vector load length and the max innermost factor. These seem redundant. Aren't we always going to vectorize over the innermost loop? And, if vectorization is enabled, won't we use vector_load_lens
for the size of the innermost loop? Also a "max" for max_innermost_factor
seems strange. E.g. if the user says the max is 8
I imagine that trying 7
is a bad choice whereas a list of [2, 4, 8] are likely good choices. Anyway... I see that this API is inherited from MultiLevelTilingInitCommon
so I won't harp on it. Just an observation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These have two separate uses, vector_load_lens is used for vector loads outside of the tiling loops where as max_innermost_factor is for loop tiling. As for the naming and usage I am not 100% sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I said, just an observation given you are inheriting this API.
@@ -68,12 +69,12 @@ def sync_dma_load_impl(a: T.handle, c: T.handle) -> None: | |||
return sync_dma_load_desc, sync_dma_load_impl | |||
|
|||
|
|||
def generate_dot_product_32x4_u8u8i32(mem_scope="global"): | |||
def generate_dot_product_32x4_u8u8i32(read_mem_scope="global", write_mem_scope="global"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the defaults here "global" instead of "global.vtcm"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because typically these are used without vtcm
if (!use_software_pipeline) { | ||
return {state}; | ||
} | ||
// The current config is not suitable for software pipelining. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update comment to indicate what r_indices_
represents (reduction axes) since it's not a member of MultiLevelTilingHexagonNode
class. And also why we need at least 2 of them as I can't quite figure out what that's the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment on this!
reduction_length *= extent->value; | ||
} | ||
} | ||
if (reduction_length <= 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious use of <= 1
here as opposed to == 1
. Do we support zero or negative extents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not be possible but since extents with value 0 can happen then the length could end up being 0.
Array<Integer> software_pipeline_stage; | ||
Array<Integer> software_pipeline_order; | ||
Array<Integer> software_pipeline_async_stages; | ||
if (cache_read_count == 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks correct but this notation is difficult to read for many folks. Some comments might help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to explain as best I could!
sch->Annotate(fused, tir::attr::software_pipeline_async_stages, software_pipeline_async_stages); | ||
|
||
// TODO(nverke): Add support for nested async pipelines. | ||
// TODO(nverke): Add support for async cache writes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the lack of cache write support here due to the issue in the InjectSWPipeline pass where there is no "wait" on cache write stage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No its just from limiting complexity of this to start.
executor = relay.backend.Executor("graph", {"link-params": True}) | ||
mod = mod.with_attr("executor", executor) | ||
|
||
use_async = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems strange especially with no else
case below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed this!
|
||
std::vector<State> MultiLevelTilingHexagonNode::ApplySubRules(std::vector<State> states) { | ||
states = MultiLevelTilingWithIntrinNode::ApplySubRules(states); | ||
states = SubRule(std::move(states), [&](State state) { return AddSoftwarePipeline(state); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that MultiLevelTilingHexagon
could be its own schedule rule as adds the ability to AddSoftwarePipeline
but otherwise defers to MultiLevelTilingWithIntrin
. I understand the main reason for the inheritance here is to control the order of the application of the schedule rules, that tiling must precede software pipelining. Wondering there is some other solution besides inheritance to solve this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I follow. If we don't inherit we would have to copy all of the logic needed over from MLT with intrin. But we don't want to add software pipelines to all usages of MLT with intrin as this will try to optimize for hexagon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment may be naïve. My read of the code was that the only connection between MultiLevelTilingWithIntrin
and AddSoftwarePipeline
schedule rules was the order in which they were applied. And that we were using inheritance to ensure that order was maintained. And that there was no further value to the inheritance beyond maintaining that order. If any of that is wrong, let me know and we can scrap this comment. If I am correct, perhaps there is a way to create two separate passes ... one for software pipeline and one for multi-level tiling.
… pipelines that utilize DMA
…c pipelines that utilize DMA
This will create schedules that utilize async dma pipelines on hexagon. Currently this has some limitations due to the need to have dma lowering be contiguous memory copies. This PR is reliant on #13720 and will not build until that is merged into main.
Best perf that I found was around 5 GFLOPS for a the conv2d in the test.
These changes were made in collaboration with @masahi
@adstraw @csullivan @Lunderberg