Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Scalable Matrix Extension enablement #107

Merged
merged 4 commits into from
Mar 19, 2024
Merged

Conversation

lhutton1
Copy link
Contributor

@lhutton1 lhutton1 commented Jan 31, 2024

A RFC for enabling Scalable Matrix Extension code generation in TVM.

Rendered

A RFC for enabling Scalable Matrix Extension code generation in TVM.

Change-Id: If2cc84de2ccc09ec8c526bf154ba099715e46596
@tqchen
Copy link
Member

tqchen commented Feb 1, 2024

I like how we can leverage tensorization and keep most things within the existing infrastructure. Would love to see how we can align some of the scheduling support towards IRModule=>IRModule transformation in dlight style mechanisms, so we can get even better composability.

I take sometime to write down related thoughts here https://discuss.tvm.apache.org/t/discuss-tvm-core-strategy-for-operator-scheduling-and-tuning/16352 which should help clarify some of the context.

@lhutton1
Copy link
Contributor Author

Thanks for taking a look @tqchen! Since scheduling will be completed with TensorIR, it will provide the building blocks for being plugged into an IRModule=>IRModule transformation pass. For our current use-case, it's important to be able to fallback to previous optimizations in the form of TE schedules / TOPI where coverage of the TensorIR schedules doesn't exist.

From the proposed strategy, I understand it's important to ensure the schedule can operate on a generic compute definition of the operation. In the case of matmul-style operations, we'd want to apply "array packing" to the input, which is currently expressed via the compute definition. Is it possible to express this through TIR scheduling alone?

@tqchen
Copy link
Member

tqchen commented Feb 14, 2024

to clarify a bit, we do not need have to ask for doing everything as form of schedule, so it is OK for example to generate a compute definition that already contains packing (you can view that as one special dispatch pass).

The main ask is that the TIR schedule pass should detect the already packed TIR and continue schedule it(one way might be detect an attached tag in block). So the ApplySchedule pass can be done independently from the compute definition

This being said, i think it should be possible to insert array packing through cache_read and transform layout

@lhutton1
Copy link
Contributor Author

Got it, thanks @tqchen :) It sounds as though we're already doing something similar by adding a tag in the compute definition to identify the block during scheduling.

… error in example

Change-Id: I042523e0bd34dc3b8bc62176e983604a6af33b4d
@lhutton1
Copy link
Contributor Author

Thanks for the discussion so far @tqchen, I added a small example detailing how we're registering schedules for the Relay flow. I believe this will have minimal impact for how the schedule might be used in a Relax based flow, but it would be good to hear your thoughts.

@tqchen
Copy link
Member

tqchen commented Mar 12, 2024

Thanks @lhutton1 , for relax and moving forward, one canonical example that can be helpful is the https://github.com/apache/tvm/tree/main/python/tvm/dlight package, which defines pattern matching and apply of transforms, that can then be used as part of pass.

Right now dlight started from gpu based schedule for LLMs but would be great to expand it to include CPU flows. Notably the operator definition still resides in topi or other places and dlight focuses on detecting TIR pattern and apply transformations

@lhutton1
Copy link
Contributor Author

Thanks @tqchen, at the moment the Relax flow would be out of scope for our current use-cases, although we'd want to make sure this RFC doesn't introduce obstacles for porting to the Relax flow in the future. Do you foresee any blockers with the current approach, or could we consider merging?

@tqchen
Copy link
Member

tqchen commented Mar 12, 2024

I think it is helpful to add a discussion about how the flow would fit into the DLight usecases. I don't think it would likely cause too much of an overhead :)

Change-Id: Icefa54694706faef0330c1988af3a2528394540d
Change-Id: I2f239b3eaeb76245c8e79057126578ee5830796e
@leandron
Copy link
Contributor

This has been approved a few days back, so merging it now so that we continue the discussions in the context of the tracking issue and upcoming PRs.

Thank you for all the discussion everyone!

@leandron leandron merged commit 176a14e into apache:main Mar 19, 2024
@lhutton1 lhutton1 deleted the sme-rfc branch March 19, 2024 09:28
lhutton1 added a commit to lhutton1/tvm that referenced this pull request Apr 24, 2024
This commit adds a new scalable fp32 dense schedule that calls SME
intrinsics according to the SME RFC:
apache/tvm-rfcs#107.

Currently the schedule does not make use of predication, meaning the
output from the matmul compute must be copied in a subsequent compute
stage. This will be removed once support for predication is added.

Change-Id: I9d5ec03d10b03b0637a48116d0cb4076f0ca8192
lhutton1 added a commit to lhutton1/tvm that referenced this pull request May 8, 2024
This commit adds a new scalable fp32 dense schedule that calls SME
intrinsics according to the SME RFC:
apache/tvm-rfcs#107.

Currently the schedule does not make use of predication, meaning the
output from the matmul compute must be copied in a subsequent compute
stage. This will be removed once support for predication is added.

Change-Id: I9d5ec03d10b03b0637a48116d0cb4076f0ca8192
lhutton1 added a commit to apache/tvm that referenced this pull request May 15, 2024
This commit adds a new scalable fp32 dense schedule that calls SME intrinsics according to the SME RFC: apache/tvm-rfcs#107.

Currently the schedule does not make use of predication, meaning the output from the matmul compute must be copied in a subsequent compute stage. This will be removed once support for predication is added.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants