-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial support for WebGPU #6492
Conversation
All runtime functions are currently unimplemented, and the shader codegen just emits an empty compute shader.
Sychronize device access using a WebGpuContext object, as is done in the other GPU runtimes. Device initialization is asynchronous, so we rely on Emscripten's Asyncify mechanism to wait for it to complete.
Release the device and adapter.
WebGPU uses asynchronous callbacks for error handling, so we need to spin-lock while waiting for them to fire in order to determine success/failure status for any WebGPU APIs that we call.
Create a WGPUBuffer with Storage|CopyDst|CopySrc usages. We'll need a staging buffer to perform host<->device transfers, as a buffer used as a storage buffer cannot be mapped.
Use a staging buffer to copy data from the device, since we cannot map a buffer that is used a storage buffer. This logic will need an overhaul in order to support buffers that represent sub-regions of larger buffers.
Just wait for all submitted commands to complete.
This implements enough of the WGSL codegen required to generate code for a 32-bit version of the apps/blur example. Buffer arguments are emitted as read_write storage buffers. Non-buffer arguments are collected into a structure and generated as a single uniform buffer. Workgroup sizes are currently required to be constant. This can relaxed in the future by using pipeline-overridable constants.
WGSL uses different syntax for type casts.
Create a compute pipeline, create a uniform buffer for non-buffer arguments, and dispatch a compute command to the queue. Does not handle workgroup storage yet.
The queue handle received from wgpuDeviceGetQueue needs to be released, so it makes more sense to automatically get/release this handle in the WgpuContext constructor/destructor.
This requires a native implementation of WebGPU (e.g. Dawn or wgpu).
Link generator tests against the native WebGPU library if specified.
We have to make this a compile-time switch for now, as the supported APIs currently differ between Dawn and Emscripten. We should be able to remove all of these conditionals when the native API stabilizes.
Use Dawn's node bindings to run these tests, by pre-pending some initialization code to the script that nodejs will invoke.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this -- I don't have time today for more than a quick skim, but wanted to provide a handful of (minor) bits of feedback, but the important thing to tell you right away is that including webgpu.h (or any other outside-of-Halide header file) in the runtime code is a no-no, at least for anything beyond initial experimentation. Wanted to be sure you were aware of that sooner rather than later.
dependencies/wasm/CMakeLists.txt
Outdated
@@ -92,8 +92,16 @@ function(add_wasm_executable TARGET) | |||
-s ASSERTIONS=1 | |||
-s ALLOW_MEMORY_GROWTH=1 | |||
-s ENVIRONMENT=node | |||
-s USE_WEBGPU=1 | |||
-s ASYNCIFY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ASYNCIFY is a pretty nontrivial code change IIRC -- What in this PR requires it? I don't think we'd want to enable it for most tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WebGPU is a Javascript API that makes use of asynchronous callbacks for things like mapping memory onto the host, device initialization, waiting for device commands to finish, and error checking. I don't think there's any way to implement this stuff without Asyncify, unfortunately.
We could certainly make the addition of this flag conditional of WebGPU being the target though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be OK everywhere, I just need to re-read the details of the code transformation to decide if it compromises the value of our (other) wasm tests in any way. If it doesn't, no worries. (Definitely worth adding a comment about this flag being necessary for WebGPU though.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, after re-reading the details on Asyncify, it seems invasive enough that I'm inclined to suggest we should try to only use it when necessary, rather than across the board (e.g.: "As mentioned earlier, unoptimized builds with Asyncify can be large and slow. Build with optimizations (say, -O3) to get good results.") No need to worry about it at this point (and I'm more than willing to drop this if point if the emcc folk tell me I'm overthinking it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made the addition of -s ASYNCIFY
conditional on the target including the webgpu
feature for now.
src/CodeGen_WebGPU_Dev.cpp
Outdated
|
||
void CodeGen_WebGPU_Dev::CodeGen_WGSL::visit(const IntImm *op) { | ||
internal_assert(op->type.bits() == 32) | ||
<< "WGSL only supports 32-bit integers"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...well that is disappointing. No other int sizes? or unsigned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, WGSL only supports 32-bit integers, signed and unsigned. Extensions to support 8- and 16-bit types are inevitable though, and potentially could make the core spec for v1 if we can demonstrate enough demand (bearing in mind that we'd have to emulate them on certain platforms).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All I can say is that most of the high-performance, well-scheduled chunks of Halide code I've seen in production make extensive use of 8- and 16-bit vector types. Their absence would be a big limitation for maximizing performance (and memory usage), not to mention making portable pipelines a lot more painful to write and schedule (since pipelines running on typical SIMD or DSP architectures won't want to leave the benefits of having smaller ints, and thus effectively-wider vectors, on the table...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For other type-challenged backends (d3d12) I believe we emulate low bit width types in Halide, so you could take a look at those backends. Feel free to email me for help demonstrating enough demand.
This is makes the generated code a little more human-readable.
This is a verbatim copy of the Emscripten version, just without the include directives.
Tweak atomic primitive usage to avoid the need for this.
(I took the liberty of pacifying clang-tidy and bringing up to date with top-of-tree; please let me know if you'd prefer I avoid doing changes of this sort here in the future, but these seemed harmless) |
OK, really close now! The only things left I see failing:
This ends in "WGPU device lost" -- looks like a real (if obscure) bug.
(basically the same test, just one with LLVM backend, one with C++ backend) |
Which platform/device is this on? This test is passing for me on macOS on Apple M1, AMD, and Intel GPUs, and also on a Linux/Vulkan device.
I'll take a look at this one shortly. |
This is on my x86-64 Mac laptop, running top-of-tree Dawn (and dawn.node) built as of yesterday. Hmm... Let me doublecheck that I was running with |
This fixes the generator_aot_gpu_multi_context_threaded tests.
I've made the staging buffer per-context, which fixes |
Looks good. correctness_multi_way_select still fails for me, full output:
|
Also, now some of the Generators tests (running under Node) now fail with: RuntimeError: Aborted(RuntimeError: unreachable). "unreachable" may be due to ASYNCIFY_STACK_SIZE not being large enough (try increasing it) |
We shouldn't be doing this inside a callback as we use async error checking routines.
Ugh, this is an internal error inside Apple's Metal Shading Language compiler. Are you using an AMD GPU on macOS 12.x or earlier by any chance? We hit a particular bug with this combo a lot, and there's unfortunately not a lot we can do about it. If it's the same one I'm thinking about then it did seem to be fixed in macOS 13.x though.
Should be fixed now. I was doing async stuff inside an async callback 🙃 |
...why, yes, I am. (I didn't even know Apple shipped Intel macs with AMD GPUs...) If this is fixed in macOS 13.x then we can just ignore it, I guess. |
Wait, not so fast: Unfortunately, our only Mac x86 buildbot is a Late-2013 Mac Pro... and pre-2017 Macs can't be upgraded to Ventura. Obviously we shouldn't let that prevent us from landing this but we're gonna need to figure out how to get better test coverage. (The gotcha is where to find a fast x86 Mac we can replace it with...) #7389 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'm declaring this Ready Enough To Land, at least in a known-beta-quality state.
Other folks on the review line, please either complete your review (or remove yourself from reviewership) at your earliest convenience so we can land this according to protocol :-)
@@ -154,6 +156,7 @@ set(RUNTIME_CXX_FLAGS | |||
# Necessary for using virtual functions in the runtime code. | |||
-fno-rtti | |||
-Wall | |||
-Wc++20-designator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean C++20 is required for our runtime now? IIRC, we have so far stuck with C++17 or earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is adding the warning to make sure we don't accidentally use C++20 designators (which I had done earlier in the PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this huge chunk of work!
* [WebGPU] Add runtime stubs and codegen skeleton All runtime functions are currently unimplemented, and the shader codegen just emits an empty compute shader. * [WebGPU] Implement lazy device initialization Sychronize device access using a WebGpuContext object, as is done in the other GPU runtimes. Device initialization is asynchronous, so we rely on Emscripten's Asyncify mechanism to wait for it to complete. * [WebGPU] Implement device release Release the device and adapter. * [WebGPU] Add scoped error handling mechanism WebGPU uses asynchronous callbacks for error handling, so we need to spin-lock while waiting for them to fire in order to determine success/failure status for any WebGPU APIs that we call. * [WebGPU] Implement device malloc/free Create a WGPUBuffer with Storage|CopyDst|CopySrc usages. We'll need a staging buffer to perform host<->device transfers, as a buffer used as a storage buffer cannot be mapped. * [WebGPU] Implement basic host<->device copies Use a staging buffer to copy data from the device, since we cannot map a buffer that is used a storage buffer. This logic will need an overhaul in order to support buffers that represent sub-regions of larger buffers. * [WebGPU] Implement halide_webgpu_device_sync Just wait for all submitted commands to complete. * [WebGPU] Implement shader compilation * [WebGPU] Implement core of WGSL codegen This implements enough of the WGSL codegen required to generate code for a 32-bit version of the apps/blur example. Buffer arguments are emitted as read_write storage buffers. Non-buffer arguments are collected into a structure and generated as a single uniform buffer. Workgroup sizes are currently required to be constant. This can relaxed in the future by using pipeline-overridable constants. * [WebGPU] Implement Cast node WGSL uses different syntax for type casts. * [WebGPU] Implement the float_from_bits() intrinsic * [WebGPU] Implement run function Create a compute pipeline, create a uniform buffer for non-buffer arguments, and dispatch a compute command to the queue. Does not handle workgroup storage yet. * [WebGPU] Move queue into WgpuContext class The queue handle received from wgpuDeviceGetQueue needs to be released, so it makes more sense to automatically get/release this handle in the WgpuContext constructor/destructor. * [WebGPU] Add support for JIT This requires a native implementation of WebGPU (e.g. Dawn or wgpu). * [WebGPU] Enable the gpu_only AOT generator test Link generator tests against the native WebGPU library if specified. * [WebGPU] Add support for targeting dawn-native We have to make this a compile-time switch for now, as the supported APIs currently differ between Dawn and Emscripten. We should be able to remove all of these conditionals when the native API stabilizes. * [WebGPU] Add support for AOT tests when using WASM Use Dawn's node bindings to run these tests, by pre-pending some initialization code to the script that nodejs will invoke. * [WebGPU] Print explicit types for let declarations This is makes the generated code a little more human-readable. * [WebGPU] Address first round of review comments * [WebGPU] Add copy of webgpu.h This is a verbatim copy of the Emscripten version, just without the include directives. * [WebGPU] Add comment about ASYNCIFY requirement * [WebGPU] Remove -Wno-atomic-alignment Tweak atomic primitive usage to avoid the need for this. * pacify clang-tidy * Fix more clang-tidy errors * Only use ASYNCIFY for tests when targeting WebGPU * Fix even more clang-tidy errors * [WebGPU] Add basic support to Makefile * [WebGPU] Don't wrap buffers inside structures This requirement has been removed from the WGSL specification, and the corresponding implementation changes have now landed in Dawn. * [WebGPU] Fix debug message tag * [WebGPU] Update WGPUErrorFilter enum in header * [WebGPU] Update WGSL attribute syntax The WGSL specification recently changed attribute syntax from [[attribute]] to @Attribute. * [WebGPU] Add README_webgpu.md Explains how to configure Halide to target WebGPU for both Emscripten and Dawn native. Also lists several known limitations. * [WebGPU] Move native WebGPU library CMake logic This is the correct place, otherwise the link line order is wrong. * [WebGPU] Implement WGSL codegen for serial loops * [WebGPU] Implement WGSL codegen for Allocate Use array types for stack allocations. Leave GPU shared memory unimplemented for now. * [WebGPU] Implement WGSL codegen for Select Use the WGSL select builtin function, which supports bool vector conditions for component-wise selection too. * [WebGPU] Mark 64-bit types as unsupported * [WebGPU] Implement device_and_host_{malloc,free} Just use the default implementations. * [WebGPU] Fixed WGSL codegen for boolean vectors * [WebGPU] Implement f32 math intrinsics * [WebGPU] Implement inverse and inverse sqrt * [WebGPU] Fixup some errors in WGSL codegen * [WebGPU] Implement logical and/or for bool vectors * [WebGPU] Implement WGSL codegen for Broadcast node * [WebGPU] Implement WGSL codegen for Ramp node * [WebGPU] Emulate 8- and 16-bit integers Use atomics to emulate storage, and widen the values to 32-bits when operating on them. * [WebGPU] Avoid buffer name collisions Buffers are declared as global variables in WGSL, so prefix them with the kernel name to avoid collisions. * [WebGPU] Fix divide-by-power-of-two WGSL requires the RHS of a shift to be unsigned. * [WebGPU] Implement codegen for gpu_thread_barrier * [WebGPU] Implement WGSL codegen for Evaluate This fixes an issue with the halide_unused() intrinsic. * [WebGPU] Add support for shared memory This currently only supports statically-sized shared memory allocations; dynamically-sized allocations require using pipeline-overridable constants which are not yet implemented in Dawn. * [WebGPU] Fix 8/16-bit load/store emulation Loads and stores that implicitly casted to/from 32-bits were casting when they should not have been, and not casting when they should. * Use generic 64-bit support query in gpu_mixed_shared_mem_types This is more robust than checking for specific APIs. * [WebGPU] Fix object cleanup during teardown * [WebGPU] Do not re-allocate device buffers * [WebGPU] Mark maximum vector width as 4 in tests * [WebGPU] Add functions to object lifetime tracker * [WebGPU] Override FloatImm handling We need to explicitly suffix the integer literal with a `u` to make it unsigned. * [WebGPU] Scalarize predicated loads * [WebGPU] Implement if_then_else intrinsic This is generated when scalarizing predicated loads. * [WebGPU] Enable gpu_free_sync test * [WebGPU] Implement print_reinterpret Use WGSL's bitcast<> operator. * [WebGPU] Implement print_extern_call This is just a regular function call in WGSL. * Add missing include and namespace for isnan/isinf * [WebGPU] Avoid short-circuiting operators These can cause issues with WGSL's uniformity analysis. * [WebGPU] Use commas for struct member seperators * [WebGPU] Update API headers and usages Two API functions were renamed. * [WebGPU] Use CodeGen_GPU_C base class for codegen Introduce a new enum value for WGSL's vector syntax. * [WebGPU] Add warning for emulated narrow integers * [WebGPU] Update README with latest status * [WebGPU] Add support for non-contiguous copies Also adds support for buffer cropping/slicing. * [WebGPU] Fix clang-tidy error * [WebGPU] Use atomicCmpXchg for 8/16-bit emulation Halide assumes that write-write data-races are benign when both threads are writing the same value, but this is not true when those writes are implemented using atomicXor. We need to use atomicCompareExchangeWeak in a loop to perform this emulation instead. Unfortunately this makes things even slower :-( * [WebGPU] Support non-32-bit parameter types Expand them to 32-bits in the shader and the runtime. * [WebGPU] Fix mixed types in buffers The atomic emulation of narrow types shouldn't use the allocation type, since the same buffer may be re-used for multiple types. This means we also sometimes need to perform 32-bit accesses using atomics as well. Instead of using the buffer allocation type, we now pre-traverse the IR to check for accesses that will require emulation, and mark the corresponding buffer as such. * [WebGPU] Show validation errors for failed maps * [WebGPU] Round up buffer offsets and sizes The WebGPU API requires that these are multiples of 4. * [WebGPU] Update implementation status in README * [WebGPU] Replace @stage(compute) with @compute * [WebGPU] Polyfill the pow_f32 intrinsic The pow builtin in WGSL only has the correct semantics for x>0, so we need to emulate the behavior for the other cases. * [WebGPU] Skip the gpu_allocation_cache test Allocation cache is not yet implemented for the WebGPU backend, and so this test takes forever. * [WebGPU] Use builtins for inverse hyperbolics * [WebGPU] Map rint() to round() WGSL's round() builtin has round-to-nearest-even semantics. * [WebGPU] Set device lost callback This provides more information about events that cause the device to become inaccessible. * [WebGPU] Use i32 for bool parameters The bool type cannot be used in the uniform storage class in WGSL. * [WebGPU] Raise limits for buffer size and workgroup storage * [WebGPU] Update mini_webgpu.h Disable clang-format for this file. * [WebGPU] Add support for dynamic workgroups This is implemented using pipeline-overridable constants. * [WebGPU] Avoid using 'new' as an identifier This is a reserved keyword in WGSL. * [WebGPU] Do not merge workgroup allocations Since we promote 8- and 16-bit integers to 32-bit in workgroup memory, merging multiple different GPUShared allocations into a single one can cause different regions to overlap and data races ensue. * [WebGPU] Fix Emscripten support * [WebGPU] Use const for integer immediates This avoids some MSL compiler ICEs with the code generated by Dawn, and also makes it easier for the WGSL compiler to constant-fold these values. * [WebGPU] Squelch clang-tidy error * [WebGPU] Note Dawn's dependency on go in README * [WebGPU] Add links to Emscripten vs Dawn issue * [WebGPU] Show error finding WebGPU library fails * [WebGPU] Add link to issue about Windows support * [WebGPU] Rename roundUpToMultipleOf4 ...and use it in one place where we weren't. * [WebGPU] Add links to wrap_native issue * [WebGPU] Use debug_assert for some runtime errors * [WebGPU] Stop using designated initializers Add -Wc++20-designator flag to runtime build to prevent this from happening again. * [WebGPU] Update mini_webgpu.h * [WebGPU] Fix a validation issue with ToT Dawn A bitcast that produces a NaN is invalid in WGSL, so implement the `nan()` intrinsic via `float_from_bits()`. * Update README_webgpu.md * Add is_finite_f32, is_inf_f32, is_nan_f32 * Update isinf, isnan, isfinite; add inf_f32, neg_inf_f32 * correctness_isnan should be skipped for WebGPU * Update isnan.cpp * Update atomics_gpu_8_bit.cpp * Fix python_tutorial_lesson_10_aot_compilation_run * Partial fix for generator_aot_acquire_release() This adds the necessary (horrible hackery) to bring the WebGPU case in line with the other backends... but the test still fails, apparently due to the same copy-to-host bug we suspect for generator_aot_gpu_only.Pushing this anyway because it was annoying to write :-) * [WebGPU] Fix AOT test build for non-WASM * [WebGPU] Move README to root * [WebGPU] Address review comments * wip * Update CMakeLists.txt * Work-in-progress for generator_aot_gpu_multi_context_threaded * [WebGPU] Use a per-context staging buffer This fixes the generator_aot_gpu_multi_context_threaded tests. * [WebGPU] Fix clang-format issue * [WebGPU] Move staging buffer creation We shouldn't be doing this inside a callback as we use async error checking routines. * Revert "Merge branch 'webgpu' of https://github.com/jrprice/Halide into pr/6492" This reverts commit 9d79ac6, reversing changes made to 0673e67. * Skip correctness_multi_way_select on x86 Macs (for now) * clang-format * [WebGPU] Recommit 742db3f
This adds the core of a new WebGPU backend. There are still many holes in the functionality and rough edges around how this interacts with Emscripten, native WebGPU implementations, and testing. Nevertheless, I'm opening this after offline discussion with @steven-johnson to get some initial feedback on this backend (and to raise awareness that this is happening).
While a lot of the implementation still remains to be done, this PR provides enough to run a 32-bit version of
apps/blur
, thegpu_only
AOT generator test, and thecorrectness_bounds
JIT test. I've also got the HelloWasm app running with therender
pipeline targeting the GPU (and the other two pipelines still using WASM CPU).For testing, the AOT generator tests can be made to work for
wasm-32-wasmrt-webgpu
when using Node bindings for Dawn (Chromium's native implementation of WebGPU). Both the AOT tests and JIT tests can also work forhost-webgpu
when using Dawn. JIT is not currently supported forwasm-32-wasmrt-webgpu
.Unlike the other GPU backends, I'm not employing the
dlopen
/dlsym
approach in the runtime for getting the API functions. I'm not sure how to make this work when using Emscripten, and since using Dawn directly is only really needed for testing purposes it doesn't seem /too/ onerous to require direct linking, but I'm open to opinions and suggestions here.Another pain point right now is that the C++ API for WebGPU is not currently stable between different implementations, so there is a build-time switch to toggle between targeting Emscripten vs Dawn. I'm optimistic that this requirement will eventually go away.
There's still some patches that need to land in both Emscripten and Dawn before this backend will work for anyone else, so I'll leave this PR as a draft until those are resolved.
All feedback is very welcome! I've joined the Gitter room as well.