[Review] Adding multi-device support through the IREE compilation pipelines. #17482

benvanik · 2024-05-23T15:29:21Z

This has been split into several PRs landing into the shared/multi-device branch:

sogartar · 2024-06-05T16:19:31Z

@benvanik, great work!
I was doing a sneak preview and I am wondering if the intended way of setting an affinity of an arbitrary operation is through the stream.affinity attribute?
E.g.

stream.affinity = #hal.device.affinity<@device>

Or there is an op interface for that, which hides this detail?

Another forward looking question that I have is how would dynamic number of device/queues be handled? If we bake things into an attribute we can't handle that.

stellaraccident

Pass ~1/2.

stellaraccident · 2024-07-22T21:40:30Z

compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodings.cpp

+      // This pass can't handle that and assumes it's been checked earlier by
+      // spooky action at a distance. This needs to be fixed.
+      if (executableTargets->size() != 1) {
+        funcOp.emitOpError() << "has multiple executable targets and CPU data "


oof. I had run across this before and wandered what the plan was. Now I know.

I hear this will be going away soon 🤞

stellaraccident · 2024-07-22T21:54:31Z

compiler/src/iree/compiler/Dialect/HAL/Analysis/Attributes/DeviceTargetPVS.cpp

+  // We only support immutable initialized device globals.
+  // We could track usage up through stores to handle the mutable case but
+  // the compiler does not generate such programs today.
+  auto *globalInfo = solver.getExplorer().getGlobalInfo(globalOp);


Does it ever make sense to have a mutable device global? Just wondering whether, if not, we had some sort of verifier. Catch more illegal programs vs failing analysis.

My thinking is that we'll have cases where we want to adopt devices or pass devices across module boundaries. E.g. a top-level module acting as a pipeline/the application could pick the device and pass the handle in to a lower-level module, or a low-level module that is device-specific could pick a vmfb compiled for the specific device and a higher-level module (e.g. pipeline) could be compiled to work with many and just inherit that device. I believe the latter should work today if the higher-level modules are always compiled with a superset of the devices in the low-level ones, but the former needs mutable globals as the global would be passed into a set_device method or something that shares the device instance instead of relying on enumeration to select the same device from the list of available devices.

May be a YAGNI, but hopefully we get more pipelines authored in torch/mlir/etc and this becomes a normal working mode :)

compiler/src/iree/compiler/Dialect/HAL/IR/HALAttrs.cpp

compiler/src/iree/compiler/Dialect/HAL/Transforms/AssignTargetDevices.cpp

compiler/src/iree/compiler/Dialect/HAL/Transforms/MaterializeResourceCaches.cpp

stellaraccident

Aside from the one suspected crash bug in the string manipulation, this looks good to me. It is certainly a lot of code but represents a relatively simple "rotation" of the design. Would have certainly been easier to have started this way -- thanks for doing it. I'm not sure many could have.

Aside from the structural changes, the analyses and associated passes represent a large amount of the meat, and I reviewed them closely.

This will fail on cases where a query can't be tracked to a single device but it's possible in the future to hoist/propagate across CFG edges before running this pass such that it doesn't happen. Today we inline most things and don't deduplicate functions so it'll be rare that we end up being unable to memoize. Hopefully.

This materializes device globals early on and sets the affinity so that all following passes can assume the affinity exists.

This allows for devices to be referenced prior to materialization.

This changes the passes to be module-level and lookup their targets based on their function context. The passes are not long for this world in their current form and the spaghettification that happened with the VMVX and LLVM-CPU paths makes it near impossible to factor properly without a rewrite.

I think we can generate one benchmark per device and only include dispatches used on that device but for now that is left as follow-on work.

This allows for distinguishing multiple devices matching the same requirements such as multiple GPUs on the same node.

This allows for less verbose "I don't care, pick something for me" attributes that are expanded into the full target devices and their executable configurations. Resolution happens early in the process so that any flags that may be influencing the resolved configurations are captured and no longer required by the pipeline. Tests and tooling could use these attributes in place of `#hal.device.target` but would need to run the pass as part of their pipeline in order to perform the expansion. Resolving in a pass vs doing so inline also allows for signaling errors and passing in scoped device target registries instead of relying on the globals that are not available in API usage.

These map to an opaque affinity on the tensor import/export ops and act as a seed to placement when lowering into stream.

This allows for frontends to specify a clone of a tensor to a target context. This is lowered into a stream.async.transfer and with analysis will allow for hinting placement. More flow-level optimizations are likely to be required in larger programs but until we start to see those things are kept simple here.

The legacy pass has been moved aside so that the old flags still work but will be removed in the future.

This should make it more efficient to load/store partial values at the cost of possibly transfering multiple slices when loading/storing many values. Those should be changed to use larger staging buffer transfers anyway, though.

This performs whole-program analysis to enable the querying of the ideal affinity for globals, execution ops, and resources. It can run at most phases of compilation (including on linalg/flow IR) though it's primarily used by the stream dialect passes such as conversion. The `AnnotateAffinitiesPass` has been added to aid debugging and the compiler `iree-stream-annotate-input-affinities` flag can be used to turn it on - it has no impact on the program generated but can be useful if affinity analysis fails during conversion.

This reworks some of the prior stack to support transfer ops and analysis to determine the placement of ops for execution and resource control.

benvanik · 2024-07-23T15:58:25Z

Closing now in favor of a shared/multi-device to main merge PR.

@device

**TLDR**: nothing should break, `--iree-hal-target-backends=` is deprecated, use `--iree-hal-target-device=` and appropriate target-specific flags instead. This reworks the target device concept in the IREE pipeline - in some cases introducing the concept (flow and HAL) and in others replacing placeholder mechanisms around stream affinity. This builds upon prior work that added support for enumerating available devices via the HAL and providing multiple devices to the runtime tools by adding the ability to define devices, allowing for execution and storage resources to be assigned a device, and upgrading passes to support multiple devices. "Multi-device" here means several things and all are accomplished with the same mechanism: a single device that may be one of multiple types (multiple CPU/GPU archs, CPU _or_ GPU, etc), multiple homogeneous devices (4 of the same exact GPUs accessed through the same runtime HAL driver), multiple heterogeneous devices (a CPU and a GPU/NPU/etc), and optional devices (a CPU with some portions offloaded to a GPU/NPU if it's compatible and available at runtime). In this way we can provide cross-compilation/targeting, multi-targeting, and multiple devices with one set of flags, compiler analysis, passes dealing with the devices, and runtime infrastructure. Early warning: **it's strongly discouraged to use device information prior to codegen** - any pass using such information earlier on is a red flag that will receive pushback. IREE is designed first and foremost as a cross-compiler with multi-targeting at its core and radically changing program behavior near the frontend makes it nearly impossible to have configuration control over the compilation pipeline. Consider specializing on device prior to codegen tantamount to using C preprocessor macros based on operating system or architecture: it means that a problem has not been solved and a workaround has been taken. There are exceptionally few cases that require device information early, and those that do can do so in generic ways that do not disturb the debuggability of the program. For example, far better than preprocessor macros in C++ are function calls and if statements (as we can do in our programs), and even better than that are virtual interfaces (ops that are only lowered to one of multiple implementations later on). That disclaimer out of the way: it's now possible to query device information after the input pipeline (global opt/preprocessing/flow). Upstream will push back against doing so in nearly all cases but it is a useful mechanism for downstream projects. The key change here is that the `--iree-hal-target-backends=` compiler flag has been deprecated. It continues to work for now with the same behavior as before but usage will shift to the replacement `--iree-hal-target-device=` flag. A single instance of this flag defines a single device within the program and repeated uses of it will define new devices. Devices may be named ("my_device") or anonymous (in which case they will be assigned an ordinal like 0 or 1), and each device may be backed by one or more target devices (Vulkan, local host, HIP, etc). Each target device in the compiler (represented by `IREE::HAL::TargetDevice`) may have any number of backends with various configurations (multiple archs, different deployment formats, etc represented by one or more `IREE::HAL::ExecutableTargetAttr` values). Example flags: ```sh # Two devices, one the local host device and the other a Vulkan device: --iree-hal-target-device=local --iree-hal-target-device=vulkan # One device selecting between Vulkan if available and otherwise use the local host device: --iree-hal-target-device=vulkan,local # Two CUDA devices selected by runtime ordinal; at runtime two --device= # flags are required to configure both devices: --iree-hal-target-device=cuda[0] --iree-hal-target-device=cuda[1] # A fully-defined target specification: --iree-hal-target-device=#hal.device.target<"cuda", {...}, [#hal.executable.target<...>]> # Named device for defining a reference by #hal.device.promise<@some_name>: --iree-hal-target-device=some_name=vulkan ``` The device metadata as specified in the compiler is used to produce enumeration code that executes at runtime and queries the available devices to find the appropriate matches. This means that if the program is compiled to target two CUDA devices then at runtime there must be two CUDA devices specified - the indirection allows for the compiled artifact to work with any two CUDA devices targeted by UUID, device ordinal, etc and not just the first and second CUDA device in the system. E.g. `iree-compile --iree-hal-target-device=cuda[0] --iree-hal-target-device=cuda[1]` and `iree-run-module --device=cuda://UUID_A --device=cuda://UUID_B`. Devices targets in the compiler can now specify the ordinal of the device in order to differentiate between multiple devices at runtime (the `cuda[0]` and `cuda[1]` above indicate the first CUDA device and second CUDA device provided to the runtime). Major new attributes: * `#hal.device.promise<@device>` is a reference to a device that will be provided at a later stage. Frontends can use this as a placeholder for devices that are specified on the command line without needing to say what those devices are when exporting. * `#hal.device.alias<"name">` specifies an `IREE::HAL::TargetDevice` in the compiler (`vulkan`, `local`, `hip`, etc) and expands to a full `#hal.device.target` based on target-specific flags. * `#hal.device.select<[...]>` controls selection by enumerating each device in turn and matching the first found. * `#hal.device.fallback<@other_device>` provides a fallback reference that the device will match if no other device matches. Note that having two devices with the same target will create two copies at runtime - if wanting to use the existing device then the fallback mechanism must be used. * `#hal.device.affinity<@device>` (and optional queue mask) is used on ops to indicate on which device they should execute. All of the above flags are just syntactic sugar that add the above attributes to the program IR and it's possible for frontends to insert these attributes or ops directly depending on use-case. In most cases leaving placeholders in the IR such that the exact target can be specified during compilation is ideal: this allows one output from the frontend to be used with any number of targets and configurations. Online compilers, though, may want to bake in their exact configuration and can do so without the need for flags that may lose information. The general flow of the `buildHALDeviceAssignmentPassPipeline`/`iree-opt --iree-hal-device-assignment-pipeline` is: 1. `--iree-hal-target-device=` flags are parsed and a `hal.device.targets` attribute is added to the module. * `--iree-hal-device-target=cpu_device=local` becomes `hal.device.targets = [#hal.device.alias<"local"> : !hal.device]` * `--iree-hal-device-target=cpu_device=local --iree-hal-device-target=gpu_device=cuda,hip` becomes ```mlir hal.device.targets = { cpu_device = #hal.device.alias<"local"> : !hal.device, gpu_device = #hal.device.select<[#hal.device.alias<"cuda"> : !hal.device, #hal.device.alias<"hip"> : !hal.device]> : !hal.device } ``` 2. The `hal.device.targets` attribute (if any) is expanded into `util.global` ops for each device. These globals are initialized with one of the supported attributes which are much later turned into enumeration/selection logic. The above multi-device example becomes: ```mlir builtin.module attributes {stream.affinity.default = #hal.device.affinity<@cpu_device>} { util.global private @cpu_device = #hal.device.alias<"local"> : !hal.device util.global private @gpu_device = #hal.device.select<[#hal.device.alias<"cuda"> : !hal.device, #hal.device.alias<"hip"> : !hal.device]> : !hal.device } ``` 3. Any `#hal.device.promise` attributes will be changed to reference the globals with the same name. This allows for retargeting of inputs by letting a frontend specify named devices prior to them having been passed on the command line (or inserted by some other pipeline). 4. Any `#hal.device.alias` attributes are converted to full `#hal.device.target` attributes using the appropriate `IREE::HAL::DeviceTarget` implementation. Upon completion of the pipeline there are globals initialized with either a specific device target or a selection mechanism to pick between targets. From that point onward devices are a structural part of the program and can be referenced by symbol name via attributes like `#hal.device.affinity`. Programs are expected to specify the device affinity for all operations either explicitly or implicitly. By default (as today) the first device defined will be used but going forward we will want frontends to start specifying devices. To that end the `flow.tensor.transfer` operation was added to allow a tensor to have a device affinity assigned to it. A new analysis is added that allows all tensors (or stream resources) and ops interacting with them to be queried for which device they should be placed on. For example, a frontend can specify multiple devices be used in a computation by transferring the tensors used: ```mlir util.func private @my_func(%arg0: tensor<4xi32>) -> tensor<4xi32> { %arg0_device_a = flow.tensor.transfer %arg0 : tensor<4xi32> to #hal.device.promise<@device_a> %compute_device_a = arith.addi %arg0_device_a, %arg0_device_a : tensor<4xi32> %transient_device_b = flow.tensor.transfer %compute_device_a : tensor<4xi32> to #hal.device.promise<@device_b> %compute_device_b = arith.muli %transient_device_b, %transient_device_b : tensor<4xi32> util.return %compute_device_b : tensor<4xi32> } ``` To avoid copies there are also ways for frontends to indicate where argument and result tensors are placed. The best way (in that it's most general/powerful) is for the frontends to emit `hal.tensor.import`, `hal.tensor.export`, and `hal.tensor.alias` ops directly as they all now take affinities. When using the default ABI translation pass it's possible to add arg/result attrs to public functions, e.g. `util.func public @my_func(%arg0: tensor<2xi32> {iree.abi.affinity = #hal.device.promise<@device_a>}) -> (tensor<2xi32> {iree.abi.affinity = #hal.device.promise<@device_b>})`. Shorthand is provided to allow specifying an `iree.abi.affinity` on functions themselves for when all arguments and results are placed on the same device. After the point devices are specified, materialized in the program as globals, and referenced either via the magic default attribute, scoped attributes, or explicit transfer operations most of the mechanics are implementation details of the stream and HAL dialect lowerings. Partitioning, allocation, and scheduling in the stream dialect were always affinity-aware and required only minor tweaks as part of this work while the HAL TODOs for multi-device were implemented by memoizing resources per-device and adding the machinery to enumerate and select devices. This was reviewed in the following chunks and tested in a roll-up PR #17482: * #17915 * #17917 * #17916 * #17918 * #17919 * #17920

benvanik force-pushed the users/benvanik/device-attrs branch 5 times, most recently from 8870d72 to ddf78be Compare May 23, 2024 18:18

benvanik force-pushed the users/benvanik/device-attrs branch 6 times, most recently from 0539ca4 to 89d4597 Compare June 11, 2024 16:16

benvanik force-pushed the users/benvanik/device-attrs branch 6 times, most recently from 12630d0 to 67759a4 Compare June 19, 2024 15:14

benvanik force-pushed the users/benvanik/device-attrs branch from 67759a4 to 0956638 Compare June 25, 2024 02:27

benvanik mentioned this pull request Jun 25, 2024

Refactor check dialect to support multiple devices via hal.tensor.device. #17738

Open

4 tasks

benvanik force-pushed the users/benvanik/device-attrs branch 7 times, most recently from 8601e30 to 17f7523 Compare June 27, 2024 21:34

sogartar mentioned this pull request Jun 28, 2024

[model] Write and shard SDXL into 8 partitions nod-ai/SHARK-ModelDev#700

Open

benvanik force-pushed the users/benvanik/device-attrs branch 2 times, most recently from adf845a to 89d248c Compare July 1, 2024 13:51

benvanik force-pushed the users/benvanik/device-attrs branch from ddf26ba to 1cf9b44 Compare July 22, 2024 17:21

stellaraccident reviewed Jul 23, 2024

View reviewed changes

stellaraccident approved these changes Jul 23, 2024

View reviewed changes

benvanik added 20 commits July 23, 2024 08:53

Making MaterializeInterfaces support multiple devices.

d86c1f8

Wiring up AssignTargetDevices and associated passes.

35ce18a

This materializes device globals early on and sets the affinity so that all following passes can assume the affinity exists.

Adding #hal.device.promise and a resolution pass.

62598b6

This allows for devices to be referenced prior to materialization.

Disabling DumpExecutableBenchmarks when multiple devices are present.

4d0f4b3

I think we can generate one benchmark per device and only include dispatches used on that device but for now that is left as follow-on work.

Adding VerifyAffinitiesPass to ensure all ops have some affinity.

356b0f0

Adding ordinal support on #hal.device.target.

74bb1f2

This allows for distinguishing multiple devices matching the same requirements such as multiple GPUs on the same node.

Adding iree.abi.affinity arg/result attrs on the native ABI.

a8251be

These map to an opaque affinity on the tensor import/export ops and act as a seed to placement when lowering into stream.

New AssignTargetDevices pass to replace the legacy one.

3ca4d3e

The legacy pass has been moved aside so that the old flags still work but will be removed in the future.

Stripping affinity attrs earlier in the pipeline.

01653ac

Removing affinity from stream.timepoint.await.

8a1b416

Handling AffinityOpInterface on stream.async.transfer.

6f028f5

Changing stream conversion to use a value/op affinity analysis.

d8ae78b

This reworks some of the prior stack to support transfer ops and analysis to determine the placement of ops for execution and resource control.

Adding vm.select.ref lowering to emitc.

e7b61fa

Updating various tests to the latest changes.

9d6cb08

benvanik force-pushed the users/benvanik/device-attrs branch from 1cf9b44 to 9d6cb08 Compare July 23, 2024 15:56

benvanik changed the title ~~[WIP] Adding multi-device support through the IREE compilation pipelines.~~ [Review] Adding multi-device support through the IREE compilation pipelines. Jul 23, 2024

benvanik closed this Jul 23, 2024

benvanik deleted the users/benvanik/device-attrs branch July 23, 2024 15:58

benvanik mentioned this pull request Jul 23, 2024

Merging multi-device branch to main. #17987

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Review] Adding multi-device support through the IREE compilation pipelines. #17482

[Review] Adding multi-device support through the IREE compilation pipelines. #17482

benvanik commented May 23, 2024 •

edited

Loading

sogartar commented Jun 5, 2024 •

edited

Loading

stellaraccident left a comment

stellaraccident Jul 22, 2024

benvanik Jul 23, 2024

stellaraccident Jul 22, 2024

benvanik Jul 23, 2024

stellaraccident left a comment

benvanik commented Jul 23, 2024

[Review] Adding multi-device support through the IREE compilation pipelines. #17482

[Review] Adding multi-device support through the IREE compilation pipelines. #17482

Conversation

benvanik commented May 23, 2024 • edited Loading

sogartar commented Jun 5, 2024 • edited Loading

stellaraccident left a comment

Choose a reason for hiding this comment

stellaraccident Jul 22, 2024

Choose a reason for hiding this comment

benvanik Jul 23, 2024

Choose a reason for hiding this comment

stellaraccident Jul 22, 2024

Choose a reason for hiding this comment

benvanik Jul 23, 2024

Choose a reason for hiding this comment

stellaraccident left a comment

Choose a reason for hiding this comment

benvanik commented Jul 23, 2024

benvanik commented May 23, 2024 •

edited

Loading

sogartar commented Jun 5, 2024 •

edited

Loading