Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TIR] Add software pipelining #10066

Merged
merged 18 commits into from
Feb 18, 2022
Merged

Conversation

vinx13
Copy link
Member

@vinx13 vinx13 commented Jan 25, 2022

This PR added a pass InjectSoftwarePipeline that transform a (annotated) loop into a pipelined one.

Co-authored-by: Junru Shao [email protected]
Co-authored-by: Xiyou Zhou [email protected]
Co-authored-by: Bohan Hou [email protected]
Co-authored-by: Siyuan Feng [email protected]
Co-authored-by: Hongyi Jin [email protected]
Co-authored-by: Ruihang Lai [email protected]

@junrushao1994 @masahi @JosephTheOctonaut @Hzfengsy @spectrometerHBH @jinhongyii @MasterJH5574

Co-authored-by: Junru Shao <[email protected]>
Co-authored-by: Xiyou Zhou <[email protected]>
Co-authored-by: Bohan Hou <[email protected]>
Co-authored-by: Siyuan Feng <[email protected]>
Co-authored-by: Hongyi Jin <[email protected]>
Co-authored-by: Ruihang Lai <[email protected]>
return std::move(load);
}

int GetWmmaFragmentSize(const Buffer& buffer) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we decouple the logic with TensorCore? i.e. Let Software pipeline work for all possible backends

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is backend independent. However, it need to analyze and rewrite buffer accesses, including regular ones via BufferLoad, BufferStore and opaque accesses. For opaque accesses, we need to add specific rule like this one. If there are new backends or new intrinsics, the only thing need to do is to add another rule here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just came here to ask the exact same question :) I really want to see the decoupled implementation here if possible. I can imagine things becoming quickly messy as we add more backends.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making each specific backend inherit from this class?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the logic the admittedly backend independent, I believe that backend-specific logic should be split out as separate functionalities. For example, in this particular case, the refactoring we need is:

  • Move this method into FragmentInfo::GetSize
  • The caller side uses fragment_info_.at(old_buffer->data.get()).GetSize()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Furthermore, we might want to consolidate all these special handling logic into special classes, where WMMA could be one of those instances

T.reads(A[tx, i, 0:16])
T.writes(C[tx, i, 0:16])
A_shared = T.alloc_buffer((16, 1, 16), dtype="float32", scope="shared")
for j in T.serial(0, 16):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this example as well as the last two (nested_pipeline_interleaving and nested_pipeline_double_buffer ), the same index variable is used in multiple loops. While not incorrect, it can make it hard to compare the pre- and post-transformed TIR because the variable (j in the examples) could belong to multiple source loops. It might make the mapping clearer to use all unique index vars.

T.reads([A_shared[tx, 0, j]])
T.writes([A_local[(i + 1) % 2, 0, 0, j]])
T.block_attr({"double_buffer_scope": 0})
A_local[(i + 1) % 2, 0, 0, j] = A_shared[tx, i + 1, j]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the annotation "double_buffer_scope": 0 force buffer access to use index 0? Here it's using the normal results of the transform (i+1) % 2.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double_buffer_scope is a hint for buffer resizing inside the software pipeline. In some cases, the buffer doesn't need to be resized, this annotation forces it to be resized (double-buffered). double_buffer_scope: 0 means double buffer should be applied to 0-th buffer written by this block, i.e. A_local. The buffer index is always (i + loop_offset) / num_versions)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see; I misunderstood the "override" that was happening. The annotation stops the transform from "optimizing" away the rewrite to make it double-buffered. The argument refers to the i th buffer.

Thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JosephTheOctonaut do you think we could improve our doc to make it less misleading? If so, would you love to suggest some change? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junrushao1994 Sure, I'd be happy to! I think I need a bit more time to finish understanding everything going on in the examples, but afterwards I'll try to put together some coherent suggestions or edits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great! Let’s work together to make our doc really slick :-)

PrimExpr VisitExpr_(const CallNode* op) final {
// Intrinsic calls should be handled explicitly here as they are opaque accesses to
// buffer.
static const auto& load_matrix_sync = builtin::tvm_load_matrix_sync();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we need to handle a few opaque intrinsics to make our analysis possible. Given the set of intrinsics could expand, do you think it's possible to generalize this mechanism and make the logic more clear?

@vinx13
Copy link
Member Author

vinx13 commented Feb 15, 2022

comments are addressed

@junrushao
Copy link
Member

Hey thanks @vinx13 for the huge effort! The PR overall is in pretty good shape, but there are only one thing we need to further improve: I noticed that there are a few interesting special handlings for:

  • GetWmmaFragmentSize
  • RewriteWmmaFragmentIndex
  • tvm_load_matrix_sync
  • tvm_store_matrix_sync
  • tvm_mma_sync
  • tvm_access_ptr

A good news to me is that all of the above are fundamentally used in only one method RewriteOpaqueAccesses. To make sure future developers are able to easier extend the system, we would probably want to make sure that the pass itself is intrinsic agnostic, and the special handling logic is split into a separate method out of the class.

Let me know if it could work! Thanks a lot!

@vinx13
Copy link
Member Author

vinx13 commented Feb 17, 2022

@junrushao1994 the comment has been addressed. I've refactored all wmma related logics into a class

Copy link
Member

@junrushao junrushao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vinx13! This is huge effort!!

@junrushao junrushao merged commit 316c506 into apache:main Feb 18, 2022
pfk-beta pushed a commit to pfk-beta/tvm that referenced this pull request Apr 11, 2022
* [TIR] Add software pipelining

Co-authored-by: Junru Shao <[email protected]>
Co-authored-by: Xiyou Zhou <[email protected]>
Co-authored-by: Bohan Hou <[email protected]>
Co-authored-by: Siyuan Feng <[email protected]>
Co-authored-by: Hongyi Jin <[email protected]>
Co-authored-by: Ruihang Lai <[email protected]>

* fix

* fix

* lint

* fix

* format

* doc

* remove print

* lint

* lint

* doc

* Apply suggestions from code review

Co-authored-by: Junru Shao <[email protected]>

* address comments

* address comments

* refactor FragmentInfo::GetSize

* remove unused

* refactor

* address comments

Co-authored-by: Junru Shao <[email protected]>
Co-authored-by: Xiyou Zhou <[email protected]>
Co-authored-by: Bohan Hou <[email protected]>
Co-authored-by: Siyuan Feng <[email protected]>
Co-authored-by: Hongyi Jin <[email protected]>
Co-authored-by: Ruihang Lai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants