MeshBuffer integration in TTNN #17215

omilyutin-tt · 2025-01-28T04:05:56Z

As part of the TT-Distributed effort, MeshBuffer will be integrated in TTNN to abstract away allocations made across the entire mesh of devices.

The integration work includes:

Add MeshBuffer backed storage as one of the tensor storage variants.
Extend the processing logic to handle this new case accordingly. E.g. allocations/reads/writes should go through the corresponding mesh APIs.
Switch to the MeshBuffer backed storage for all of initializations of storage across a mesh of devices.

Follow up work will include refactoring tensor storage variants for unification of single- and multi- device code path.

The text was updated successfully, but these errors were encountered:

### Ticket #17215 ### Problem description Explicit deallocation at the `MeshBuffer` is required because TTNN allows users to explicitly deallocate tensors that are not in use any more. Setting tensors to `None` or using `del` is one option forward, but this is a larger effort that requires refactoring hundreds of files. ### What's changed Added `is_allocated` and `deallocate` methods, modified the corresponding test. Minor changes to documentation and code style. No functional changes ### Checklist - [ ] [All post commit tests](https://github.com/tenstorrent/tt-metal/actions/runs/13020708085) - CI seems to be entirely broken. Ran `distributed_unit_tests` locally - these are the only ones that can potentially affect `MeshBuffer`. - [X] New/Existing tests provide coverage for changes - ran the `MeshBuffer.Deallocation` test locally.

### Ticket #17215 ### Problem description See #17215 ### What's changed Extend `MultiDeviceStorage` with `MeshBuffer`, which is optionally created to back the individual per-device shards. This allows to incrementally switch over to `MeshBuffer` backed variant, while not breaking any of the existing ops. Long term plan for tensor storage: * `MeshBuffer` backed `MultiDeviceStorage` will become the default in TTNN. It will eventually be renamed to `MeshDeviceStorage`, with the `DeviceStorage` being removed. * Interactions with `MeshBuffer` will be entirely synchronous and will be done on the main thread. This allows to get rid of any of the async code in `Tensor`. Next steps in terms of integrating with `MeshBuffer`: - [X] Implement explicit dealloc routine for `MeshBuffer` (done in #17265 and integrated here). - [ ] Implement read / write shards APIs for `MeshBuffer`. From the TTNN perspective, interacting with these APIs will be entirely synchronous. - [ ] Use read / write shards APIs when writing data to `MeshBuffer` backed `MultiDeviceStorage`. - [ ] When launching multi-device operations, create a `MeshBuffer` backed `MultiDeviceStorage` first, then supply the individual shards into ops. This way allows to perform allocation in lock-step across mesh, while maintaining compatibility with the existing ops infra. Note this will change with the introduction of `MeshWorkload`, and this will require further exploration. ### Checklist - [X] [Post commit CI passes](https://github.com/tenstorrent/tt-metal/actions/runs/13036677827) - [X] [T3K unit tests](https://github.com/tenstorrent/tt-metal/actions/runs/13036686228) - [X] New/Existing tests provide coverage for changes

### Ticket #17215 ### Problem description Explicit deallocation at the `MeshBuffer` is required because TTNN allows users to explicitly deallocate tensors that are not in use any more. Setting tensors to `None` or using `del` is one option forward, but this is a larger effort that requires refactoring hundreds of files. ### What's changed Added `is_allocated` and `deallocate` methods, modified the corresponding test. Minor changes to documentation and code style. No functional changes ### Checklist - [ ] [All post commit tests](https://github.com/tenstorrent/tt-metal/actions/runs/13020708085) - CI seems to be entirely broken. Ran `distributed_unit_tests` locally - these are the only ones that can potentially affect `MeshBuffer`. - [X] New/Existing tests provide coverage for changes - ran the `MeshBuffer.Deallocation` test locally.

### Ticket #17215 ### Problem description See #17215 ### What's changed Extend `MultiDeviceStorage` with `MeshBuffer`, which is optionally created to back the individual per-device shards. This allows to incrementally switch over to `MeshBuffer` backed variant, while not breaking any of the existing ops. Long term plan for tensor storage: * `MeshBuffer` backed `MultiDeviceStorage` will become the default in TTNN. It will eventually be renamed to `MeshDeviceStorage`, with the `DeviceStorage` being removed. * Interactions with `MeshBuffer` will be entirely synchronous and will be done on the main thread. This allows to get rid of any of the async code in `Tensor`. Next steps in terms of integrating with `MeshBuffer`: - [X] Implement explicit dealloc routine for `MeshBuffer` (done in #17265 and integrated here). - [ ] Implement read / write shards APIs for `MeshBuffer`. From the TTNN perspective, interacting with these APIs will be entirely synchronous. - [ ] Use read / write shards APIs when writing data to `MeshBuffer` backed `MultiDeviceStorage`. - [ ] When launching multi-device operations, create a `MeshBuffer` backed `MultiDeviceStorage` first, then supply the individual shards into ops. This way allows to perform allocation in lock-step across mesh, while maintaining compatibility with the existing ops infra. Note this will change with the introduction of `MeshWorkload`, and this will require further exploration. ### Checklist - [X] [Post commit CI passes](https://github.com/tenstorrent/tt-metal/actions/runs/13036677827) - [X] [T3K unit tests](https://github.com/tenstorrent/tt-metal/actions/runs/13036686228) - [X] New/Existing tests provide coverage for changes

### Ticket #17215 ### Problem description Explicit deallocation at the `MeshBuffer` is required because TTNN allows users to explicitly deallocate tensors that are not in use any more. Setting tensors to `None` or using `del` is one option forward, but this is a larger effort that requires refactoring hundreds of files. ### What's changed Added `is_allocated` and `deallocate` methods, modified the corresponding test. Minor changes to documentation and code style. No functional changes ### Checklist - [ ] [All post commit tests](https://github.com/tenstorrent/tt-metal/actions/runs/13020708085) - CI seems to be entirely broken. Ran `distributed_unit_tests` locally - these are the only ones that can potentially affect `MeshBuffer`. - [X] New/Existing tests provide coverage for changes - ran the `MeshBuffer.Deallocation` test locally.

### Ticket #17215 ### Problem description See #17215 ### What's changed Extend `MultiDeviceStorage` with `MeshBuffer`, which is optionally created to back the individual per-device shards. This allows to incrementally switch over to `MeshBuffer` backed variant, while not breaking any of the existing ops. Long term plan for tensor storage: * `MeshBuffer` backed `MultiDeviceStorage` will become the default in TTNN. It will eventually be renamed to `MeshDeviceStorage`, with the `DeviceStorage` being removed. * Interactions with `MeshBuffer` will be entirely synchronous and will be done on the main thread. This allows to get rid of any of the async code in `Tensor`. Next steps in terms of integrating with `MeshBuffer`: - [X] Implement explicit dealloc routine for `MeshBuffer` (done in #17265 and integrated here). - [ ] Implement read / write shards APIs for `MeshBuffer`. From the TTNN perspective, interacting with these APIs will be entirely synchronous. - [ ] Use read / write shards APIs when writing data to `MeshBuffer` backed `MultiDeviceStorage`. - [ ] When launching multi-device operations, create a `MeshBuffer` backed `MultiDeviceStorage` first, then supply the individual shards into ops. This way allows to perform allocation in lock-step across mesh, while maintaining compatibility with the existing ops infra. Note this will change with the introduction of `MeshWorkload`, and this will require further exploration. ### Checklist - [X] [Post commit CI passes](https://github.com/tenstorrent/tt-metal/actions/runs/13036677827) - [X] [T3K unit tests](https://github.com/tenstorrent/tt-metal/actions/runs/13036686228) - [X] New/Existing tests provide coverage for changes

…ent#17259) ### Ticket tenstorrent#17215 ### Problem description See tenstorrent#17215 ### What's changed Extend `MultiDeviceStorage` with `MeshBuffer`, which is optionally created to back the individual per-device shards. This allows to incrementally switch over to `MeshBuffer` backed variant, while not breaking any of the existing ops. Long term plan for tensor storage: * `MeshBuffer` backed `MultiDeviceStorage` will become the default in TTNN. It will eventually be renamed to `MeshDeviceStorage`, with the `DeviceStorage` being removed. * Interactions with `MeshBuffer` will be entirely synchronous and will be done on the main thread. This allows to get rid of any of the async code in `Tensor`. Next steps in terms of integrating with `MeshBuffer`: - [X] Implement explicit dealloc routine for `MeshBuffer` (done in tenstorrent#17265 and integrated here). - [ ] Implement read / write shards APIs for `MeshBuffer`. From the TTNN perspective, interacting with these APIs will be entirely synchronous. - [ ] Use read / write shards APIs when writing data to `MeshBuffer` backed `MultiDeviceStorage`. - [ ] When launching multi-device operations, create a `MeshBuffer` backed `MultiDeviceStorage` first, then supply the individual shards into ops. This way allows to perform allocation in lock-step across mesh, while maintaining compatibility with the existing ops infra. Note this will change with the introduction of `MeshWorkload`, and this will require further exploration. ### Checklist - [X] [Post commit CI passes](https://github.com/tenstorrent/tt-metal/actions/runs/13036677827) - [X] [T3K unit tests](https://github.com/tenstorrent/tt-metal/actions/runs/13036686228) - [X] New/Existing tests provide coverage for changes

…#17513) ### Ticket #17215 ### Problem description Tensors allocated on mesh buffer (aka "mesh tensors") need write and read APIs exposed to TTNN. ### What's changed * Extended mesh CQ interface to read / write shards, to accommodate TTNN multi-device sharding APIs. * The future work includes parallelizing the per-device dispatches internally, within Metal. * Add `to_device_mesh_tensor` and `to_host_mesh_tensor` that will be the main API used in TTNN to read/write the mesh buffer tensors. ### Checklist - [X] [Post commit CI passes](https://github.com/tenstorrent/tt-metal/actions/runs/13167605541) - pending - [X] [T3K unit tests](https://github.com/tenstorrent/tt-metal/actions/runs/13167605541) - [X] New/Existing tests provide coverage for changes

omilyutin-tt added the P0 label Jan 28, 2025

omilyutin-tt self-assigned this Jan 28, 2025

This was referenced Jan 28, 2025

Initial Support for Data-Distribution and MeshBuffer Infrastructure to tt::distributed #16242

Closed

#17215: Initial MeshBuffer integration with TTNN #17259

Merged

#17215: Add explicit dealloc for mesh buffer #17265

Merged

omilyutin-tt mentioned this issue Feb 3, 2025

#17215: Add write/read APIs for TTNN tensors allocated on mesh buffer #17513

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MeshBuffer integration in TTNN #17215

MeshBuffer integration in TTNN #17215

omilyutin-tt commented Jan 28, 2025

MeshBuffer integration in TTNN #17215

MeshBuffer integration in TTNN #17215

Comments

omilyutin-tt commented Jan 28, 2025