[multi-device] Adding `#hal.device.affinity` and related attributes. #17915

benvanik · 2024-07-16T00:52:17Z

This replaces the existing #hal.affinity.queue placeholder with #hal.device.affinity as a way of specifying global device symbols that a particular affinity-aware op executes on. Devices are now backed by normal util.global ops (such as util.global private @my_gpu : !hal.device) and affinities reference them by symbol and optional queue affinity. The lookup logic for resolving devices is now much simpler and separate from device enumeration and selection.

Devices defined by globals are initialized with the existing #hal.device.target attribute that describes the device at runtime. To describe more complex device selection logic the new #hal.device.select attribute can be used to indicate fallback selection - including the reference of other initialized devices. Initialization is provided via an attr interface such that downstream projects can override it with additional queries or priority behavior.

// A logical device that may be implemented by different implementations at runtime:
util.global private @gpu_device = #hal.device.select<[
  #hal.device.target<"cuda"> : !hal.device,
  #hal.device.target<"hip"> : !hal.device,
  #hal.device.target<"metal"> : !hal.device,
  #hal.device.target<"vulkan"> : !hal.device
]> : !hal.device

// Two heterogeneous devices one of which may not exist at runtime:
util.global private @required_device = #hal.device.target<"some_required_device"> : !hal.device
util.global private @optional_device = #hal.device.select<[
  #hal.device.target<"some_optional_device"> : !hal.device,
  #hal.device.fallback<@required_device> : !hal.device
]> : !hal.device

Future changes add device global initialization and assignment, additional attributes such as #hal.device.promise for referencing global devices prior to their definition and #hal.device.alias for retaining command-line like default initialization functionality in IR, and extending #hal.device.target to support runtime selection of devices of the same type (such as multiple GPUs).

(NOTE: this is a staging PR for review - it's not expected this will pass CI)

These allow for device globals to be identified and initialized from available runtime devices. The new InitializeDevicesPass finds globals with the attributes set and builds the appropriate initializers as part of the HAL pipeline.

compiler/src/iree/compiler/Dialect/HAL/Transforms/Passes.cpp

compiler/src/iree/compiler/Dialect/Stream/Transforms/test/schedule_allocation.mlir

compiler/src/iree/compiler/Dialect/HAL/Transforms/InitializeDevices.cpp

compiler/src/iree/compiler/Dialect/HAL/Conversion/StreamToHAL/Patterns.cpp

The queue affinity attr was added as a placeholder to test things but was never used/useful.

@device

**TLDR**: nothing should break, `--iree-hal-target-backends=` is deprecated, use `--iree-hal-target-device=` and appropriate target-specific flags instead. This reworks the target device concept in the IREE pipeline - in some cases introducing the concept (flow and HAL) and in others replacing placeholder mechanisms around stream affinity. This builds upon prior work that added support for enumerating available devices via the HAL and providing multiple devices to the runtime tools by adding the ability to define devices, allowing for execution and storage resources to be assigned a device, and upgrading passes to support multiple devices. "Multi-device" here means several things and all are accomplished with the same mechanism: a single device that may be one of multiple types (multiple CPU/GPU archs, CPU _or_ GPU, etc), multiple homogeneous devices (4 of the same exact GPUs accessed through the same runtime HAL driver), multiple heterogeneous devices (a CPU and a GPU/NPU/etc), and optional devices (a CPU with some portions offloaded to a GPU/NPU if it's compatible and available at runtime). In this way we can provide cross-compilation/targeting, multi-targeting, and multiple devices with one set of flags, compiler analysis, passes dealing with the devices, and runtime infrastructure. Early warning: **it's strongly discouraged to use device information prior to codegen** - any pass using such information earlier on is a red flag that will receive pushback. IREE is designed first and foremost as a cross-compiler with multi-targeting at its core and radically changing program behavior near the frontend makes it nearly impossible to have configuration control over the compilation pipeline. Consider specializing on device prior to codegen tantamount to using C preprocessor macros based on operating system or architecture: it means that a problem has not been solved and a workaround has been taken. There are exceptionally few cases that require device information early, and those that do can do so in generic ways that do not disturb the debuggability of the program. For example, far better than preprocessor macros in C++ are function calls and if statements (as we can do in our programs), and even better than that are virtual interfaces (ops that are only lowered to one of multiple implementations later on). That disclaimer out of the way: it's now possible to query device information after the input pipeline (global opt/preprocessing/flow). Upstream will push back against doing so in nearly all cases but it is a useful mechanism for downstream projects. The key change here is that the `--iree-hal-target-backends=` compiler flag has been deprecated. It continues to work for now with the same behavior as before but usage will shift to the replacement `--iree-hal-target-device=` flag. A single instance of this flag defines a single device within the program and repeated uses of it will define new devices. Devices may be named ("my_device") or anonymous (in which case they will be assigned an ordinal like 0 or 1), and each device may be backed by one or more target devices (Vulkan, local host, HIP, etc). Each target device in the compiler (represented by `IREE::HAL::TargetDevice`) may have any number of backends with various configurations (multiple archs, different deployment formats, etc represented by one or more `IREE::HAL::ExecutableTargetAttr` values). Example flags: ```sh # Two devices, one the local host device and the other a Vulkan device: --iree-hal-target-device=local --iree-hal-target-device=vulkan # One device selecting between Vulkan if available and otherwise use the local host device: --iree-hal-target-device=vulkan,local # Two CUDA devices selected by runtime ordinal; at runtime two --device= # flags are required to configure both devices: --iree-hal-target-device=cuda[0] --iree-hal-target-device=cuda[1] # A fully-defined target specification: --iree-hal-target-device=#hal.device.target<"cuda", {...}, [#hal.executable.target<...>]> # Named device for defining a reference by #hal.device.promise<@some_name>: --iree-hal-target-device=some_name=vulkan ``` The device metadata as specified in the compiler is used to produce enumeration code that executes at runtime and queries the available devices to find the appropriate matches. This means that if the program is compiled to target two CUDA devices then at runtime there must be two CUDA devices specified - the indirection allows for the compiled artifact to work with any two CUDA devices targeted by UUID, device ordinal, etc and not just the first and second CUDA device in the system. E.g. `iree-compile --iree-hal-target-device=cuda[0] --iree-hal-target-device=cuda[1]` and `iree-run-module --device=cuda://UUID_A --device=cuda://UUID_B`. Devices targets in the compiler can now specify the ordinal of the device in order to differentiate between multiple devices at runtime (the `cuda[0]` and `cuda[1]` above indicate the first CUDA device and second CUDA device provided to the runtime). Major new attributes: * `#hal.device.promise<@device>` is a reference to a device that will be provided at a later stage. Frontends can use this as a placeholder for devices that are specified on the command line without needing to say what those devices are when exporting. * `#hal.device.alias<"name">` specifies an `IREE::HAL::TargetDevice` in the compiler (`vulkan`, `local`, `hip`, etc) and expands to a full `#hal.device.target` based on target-specific flags. * `#hal.device.select<[...]>` controls selection by enumerating each device in turn and matching the first found. * `#hal.device.fallback<@other_device>` provides a fallback reference that the device will match if no other device matches. Note that having two devices with the same target will create two copies at runtime - if wanting to use the existing device then the fallback mechanism must be used. * `#hal.device.affinity<@device>` (and optional queue mask) is used on ops to indicate on which device they should execute. All of the above flags are just syntactic sugar that add the above attributes to the program IR and it's possible for frontends to insert these attributes or ops directly depending on use-case. In most cases leaving placeholders in the IR such that the exact target can be specified during compilation is ideal: this allows one output from the frontend to be used with any number of targets and configurations. Online compilers, though, may want to bake in their exact configuration and can do so without the need for flags that may lose information. The general flow of the `buildHALDeviceAssignmentPassPipeline`/`iree-opt --iree-hal-device-assignment-pipeline` is: 1. `--iree-hal-target-device=` flags are parsed and a `hal.device.targets` attribute is added to the module. * `--iree-hal-device-target=cpu_device=local` becomes `hal.device.targets = [#hal.device.alias<"local"> : !hal.device]` * `--iree-hal-device-target=cpu_device=local --iree-hal-device-target=gpu_device=cuda,hip` becomes ```mlir hal.device.targets = { cpu_device = #hal.device.alias<"local"> : !hal.device, gpu_device = #hal.device.select<[#hal.device.alias<"cuda"> : !hal.device, #hal.device.alias<"hip"> : !hal.device]> : !hal.device } ``` 2. The `hal.device.targets` attribute (if any) is expanded into `util.global` ops for each device. These globals are initialized with one of the supported attributes which are much later turned into enumeration/selection logic. The above multi-device example becomes: ```mlir builtin.module attributes {stream.affinity.default = #hal.device.affinity<@cpu_device>} { util.global private @cpu_device = #hal.device.alias<"local"> : !hal.device util.global private @gpu_device = #hal.device.select<[#hal.device.alias<"cuda"> : !hal.device, #hal.device.alias<"hip"> : !hal.device]> : !hal.device } ``` 3. Any `#hal.device.promise` attributes will be changed to reference the globals with the same name. This allows for retargeting of inputs by letting a frontend specify named devices prior to them having been passed on the command line (or inserted by some other pipeline). 4. Any `#hal.device.alias` attributes are converted to full `#hal.device.target` attributes using the appropriate `IREE::HAL::DeviceTarget` implementation. Upon completion of the pipeline there are globals initialized with either a specific device target or a selection mechanism to pick between targets. From that point onward devices are a structural part of the program and can be referenced by symbol name via attributes like `#hal.device.affinity`. Programs are expected to specify the device affinity for all operations either explicitly or implicitly. By default (as today) the first device defined will be used but going forward we will want frontends to start specifying devices. To that end the `flow.tensor.transfer` operation was added to allow a tensor to have a device affinity assigned to it. A new analysis is added that allows all tensors (or stream resources) and ops interacting with them to be queried for which device they should be placed on. For example, a frontend can specify multiple devices be used in a computation by transferring the tensors used: ```mlir util.func private @my_func(%arg0: tensor<4xi32>) -> tensor<4xi32> { %arg0_device_a = flow.tensor.transfer %arg0 : tensor<4xi32> to #hal.device.promise<@device_a> %compute_device_a = arith.addi %arg0_device_a, %arg0_device_a : tensor<4xi32> %transient_device_b = flow.tensor.transfer %compute_device_a : tensor<4xi32> to #hal.device.promise<@device_b> %compute_device_b = arith.muli %transient_device_b, %transient_device_b : tensor<4xi32> util.return %compute_device_b : tensor<4xi32> } ``` To avoid copies there are also ways for frontends to indicate where argument and result tensors are placed. The best way (in that it's most general/powerful) is for the frontends to emit `hal.tensor.import`, `hal.tensor.export`, and `hal.tensor.alias` ops directly as they all now take affinities. When using the default ABI translation pass it's possible to add arg/result attrs to public functions, e.g. `util.func public @my_func(%arg0: tensor<2xi32> {iree.abi.affinity = #hal.device.promise<@device_a>}) -> (tensor<2xi32> {iree.abi.affinity = #hal.device.promise<@device_b>})`. Shorthand is provided to allow specifying an `iree.abi.affinity` on functions themselves for when all arguments and results are placed on the same device. After the point devices are specified, materialized in the program as globals, and referenced either via the magic default attribute, scoped attributes, or explicit transfer operations most of the mechanics are implementation details of the stream and HAL dialect lowerings. Partitioning, allocation, and scheduling in the stream dialect were always affinity-aware and required only minor tweaks as part of this work while the HAL TODOs for multi-device were implemented by memoizing resources per-device and adding the machinery to enumerate and select devices. This was reviewed in the following chunks and tested in a roll-up PR #17482: * #17915 * #17917 * #17916 * #17918 * #17919 * #17920

benvanik added the compiler/dialects Relating to the IREE compiler dialects (flow, hal, vm) label Jul 16, 2024

benvanik requested review from stellaraccident and ScottTodd July 16, 2024 00:52

benvanik mentioned this pull request Jul 16, 2024

[Review] Adding multi-device support through the IREE compilation pipelines. #17482

Closed

benvanik marked this pull request as ready for review July 16, 2024 05:02

benvanik requested review from hanhanW and MaheshRavishankar as code owners July 16, 2024 05:02

Adding #hal.device.select and related attributes.

3f6d62d

These allow for device globals to be identified and initialized from available runtime devices. The new InitializeDevicesPass finds globals with the attributes set and builds the appropriate initializers as part of the HAL pipeline.

benvanik force-pushed the users/benvanik/multi-device-0 branch from 38af729 to 84fa5af Compare July 16, 2024 20:28

benvanik requested review from kuhar, qedawkins, Groverkss and antiagainst as code owners July 16, 2024 20:28

benvanik changed the base branch from shared/multi-device to main July 16, 2024 20:31

benvanik changed the base branch from main to shared/multi-device July 16, 2024 20:32

benvanik removed request for antiagainst, MaheshRavishankar, kuhar, hanhanW, qedawkins and Groverkss July 16, 2024 23:10

ScottTodd approved these changes Jul 17, 2024

View reviewed changes

Adding the #hal.device.affinity attr replacing #hal.affinity.queue.

5b57b1d

The queue affinity attr was added as a placeholder to test things but was never used/useful.

benvanik force-pushed the users/benvanik/multi-device-0 branch from 84fa5af to 5b57b1d Compare July 18, 2024 03:53

benvanik merged commit b0cbc37 into shared/multi-device Jul 18, 2024
13 of 14 checks passed

benvanik deleted the users/benvanik/multi-device-0 branch July 18, 2024 03:54

benvanik mentioned this pull request Jul 23, 2024

Merging multi-device branch to main. #17987

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[multi-device] Adding `#hal.device.affinity` and related attributes. #17915

[multi-device] Adding `#hal.device.affinity` and related attributes. #17915

benvanik commented Jul 16, 2024 •

edited

Loading

[multi-device] Adding #hal.device.affinity and related attributes. #17915

[multi-device] Adding #hal.device.affinity and related attributes. #17915

Conversation

benvanik commented Jul 16, 2024 • edited Loading

[multi-device] Adding `#hal.device.affinity` and related attributes. #17915

[multi-device] Adding `#hal.device.affinity` and related attributes. #17915

benvanik commented Jul 16, 2024 •

edited

Loading