Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Relax][Pass] Lowering passes for GPU IPC memory and allreduce #16759

Merged

Conversation

MasterJH5574
Copy link
Contributor

This PR introduces the lowering passes for GPU IPC memory and all-reduce. It contains the following changes:

  1. a pass IPCAllreduceRewrite which rewrites "runtime.disco.allreduce" to "runtime.disco.cuda_ipc.custom_allreduce", and rewrites the storage scopes of the all-reduce inputs's from "global" to "ipc_memory" accordingly.

  2. memory planning enhancement, making the planning be aware of storage scopes. So each storage scope will be planned independently.

  3. a pass LowerGPUIPCAllocStorage that rewrites the storage allocation of IPC memory from builtin ops to calls to function "runtime.disco.cuda_ipc.alloc_storage".

  4. supports the op relax.builtin.alloc_tensor with storage scope. The default storage scope is "global".

We write the new passes in Python for experiment and fast development. These are good demos showing we can have efficient development with the architecture enabled by TVM.

This PR introduces the lowering passes for GPU IPC memory and
all-reduce. It contains the following changes:

1. a pass `IPCAllreduceRewrite` which rewrites `"runtime.disco.allreduce"`
to `"runtime.disco.cuda_ipc.custom_allreduce"`, and rewrites
the storage scopes of the all-reduce inputs's from "global" to
"ipc_memory" accordingly.

2. memory planning enhancement, making the planning be aware of
storage scopes. So each storage scope will be planned independently.

3. a pass `LowerGPUIPCAllocStorage` that rewrites the storage allocation
of IPC memory from builtin ops to calls to function `"runtime.disco.cuda_ipc.alloc_storage"`.

4. supports the op `relax.builtin.alloc_tensor` with storage scope.
The default storage scope is `"global"`.

We write the new passes in Python for experiment and fast development.
These are good demos showing we can have efficient development
with the architecture enabled by TVM.
@MasterJH5574 MasterJH5574 force-pushed the tvm-dev/2024-03-20-lowering-ipc-mem branch from 0211a3a to 3b8183f Compare March 21, 2024 04:02
@tqchen tqchen merged commit 858486f into apache:main Mar 21, 2024
19 checks passed
thaisacs pushed a commit to thaisacs/tvm that referenced this pull request Apr 3, 2024
…e#16759)

This PR introduces the lowering passes for GPU IPC memory and
all-reduce. It contains the following changes:

1. a pass `IPCAllreduceRewrite` which rewrites `"runtime.disco.allreduce"`
to `"runtime.disco.cuda_ipc.custom_allreduce"`, and rewrites
the storage scopes of the all-reduce inputs's from "global" to
"ipc_memory" accordingly.

2. memory planning enhancement, making the planning be aware of
storage scopes. So each storage scope will be planned independently.

3. a pass `LowerGPUIPCAllocStorage` that rewrites the storage allocation
of IPC memory from builtin ops to calls to function `"runtime.disco.cuda_ipc.alloc_storage"`.

4. supports the op `relax.builtin.alloc_tensor` with storage scope.
The default storage scope is `"global"`.

We write the new passes in Python for experiment and fast development.
These are good demos showing we can have efficient development
with the architecture enabled by TVM.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants