Add initial support for WebGPU #6492

jrprice · 2021-12-10T19:59:43Z

This adds the core of a new WebGPU backend. There are still many holes in the functionality and rough edges around how this interacts with Emscripten, native WebGPU implementations, and testing. Nevertheless, I'm opening this after offline discussion with @steven-johnson to get some initial feedback on this backend (and to raise awareness that this is happening).

While a lot of the implementation still remains to be done, this PR provides enough to run a 32-bit version of apps/blur, the gpu_only AOT generator test, and the correctness_bounds JIT test. I've also got the HelloWasm app running with the render pipeline targeting the GPU (and the other two pipelines still using WASM CPU).

For testing, the AOT generator tests can be made to work for wasm-32-wasmrt-webgpu when using Node bindings for Dawn (Chromium's native implementation of WebGPU). Both the AOT tests and JIT tests can also work for host-webgpu when using Dawn. JIT is not currently supported for wasm-32-wasmrt-webgpu.

Unlike the other GPU backends, I'm not employing the dlopen/dlsym approach in the runtime for getting the API functions. I'm not sure how to make this work when using Emscripten, and since using Dawn directly is only really needed for testing purposes it doesn't seem /too/ onerous to require direct linking, but I'm open to opinions and suggestions here.

Another pain point right now is that the C++ API for WebGPU is not currently stable between different implementations, so there is a build-time switch to toggle between targeting Emscripten vs Dawn. I'm optimistic that this requirement will eventually go away.

There's still some patches that need to land in both Emscripten and Dawn before this backend will work for anyone else, so I'll leave this PR as a draft until those are resolved.

All feedback is very welcome! I've joined the Gitter room as well.

All runtime functions are currently unimplemented, and the shader codegen just emits an empty compute shader.

Sychronize device access using a WebGpuContext object, as is done in the other GPU runtimes. Device initialization is asynchronous, so we rely on Emscripten's Asyncify mechanism to wait for it to complete.

Release the device and adapter.

WebGPU uses asynchronous callbacks for error handling, so we need to spin-lock while waiting for them to fire in order to determine success/failure status for any WebGPU APIs that we call.

Create a WGPUBuffer with Storage|CopyDst|CopySrc usages. We'll need a staging buffer to perform host<->device transfers, as a buffer used as a storage buffer cannot be mapped.

Use a staging buffer to copy data from the device, since we cannot map a buffer that is used a storage buffer. This logic will need an overhaul in order to support buffers that represent sub-regions of larger buffers.

Just wait for all submitted commands to complete.

This implements enough of the WGSL codegen required to generate code for a 32-bit version of the apps/blur example. Buffer arguments are emitted as read_write storage buffers. Non-buffer arguments are collected into a structure and generated as a single uniform buffer. Workgroup sizes are currently required to be constant. This can relaxed in the future by using pipeline-overridable constants.

WGSL uses different syntax for type casts.

Create a compute pipeline, create a uniform buffer for non-buffer arguments, and dispatch a compute command to the queue. Does not handle workgroup storage yet.

The queue handle received from wgpuDeviceGetQueue needs to be released, so it makes more sense to automatically get/release this handle in the WgpuContext constructor/destructor.

This requires a native implementation of WebGPU (e.g. Dawn or wgpu).

Link generator tests against the native WebGPU library if specified.

We have to make this a compile-time switch for now, as the supported APIs currently differ between Dawn and Emscripten. We should be able to remove all of these conditionals when the native API stabilizes.

Use Dawn's node bindings to run these tests, by pre-pending some initialization code to the script that nodejs will invoke.

steven-johnson

Thanks for this -- I don't have time today for more than a quick skim, but wanted to provide a handful of (minor) bits of feedback, but the important thing to tell you right away is that including webgpu.h (or any other outside-of-Halide header file) in the runtime code is a no-no, at least for anything beyond initial experimentation. Wanted to be sure you were aware of that sooner rather than later.

steven-johnson · 2021-12-10T20:10:58Z

dependencies/wasm/CMakeLists.txt

@@ -92,8 +92,16 @@ function(add_wasm_executable TARGET)
        -s ASSERTIONS=1
        -s ALLOW_MEMORY_GROWTH=1
        -s ENVIRONMENT=node
+        -s USE_WEBGPU=1
+        -s ASYNCIFY


ASYNCIFY is a pretty nontrivial code change IIRC -- What in this PR requires it? I don't think we'd want to enable it for most tests.

WebGPU is a Javascript API that makes use of asynchronous callbacks for things like mapping memory onto the host, device initialization, waiting for device commands to finish, and error checking. I don't think there's any way to implement this stuff without Asyncify, unfortunately.

We could certainly make the addition of this flag conditional of WebGPU being the target though.

It may be OK everywhere, I just need to re-read the details of the code transformation to decide if it compromises the value of our (other) wasm tests in any way. If it doesn't, no worries. (Definitely worth adding a comment about this flag being necessary for WebGPU though.)

Added a comment.

OK, after re-reading the details on Asyncify, it seems invasive enough that I'm inclined to suggest we should try to only use it when necessary, rather than across the board (e.g.: "As mentioned earlier, unoptimized builds with Asyncify can be large and slow. Build with optimizations (say, -O3) to get good results.") No need to worry about it at this point (and I'm more than willing to drop this if point if the emcc folk tell me I'm overthinking it).

I've made the addition of -s ASYNCIFY conditional on the target including the webgpu feature for now.

src/runtime/webgpu.cpp

src/runtime/CMakeLists.txt

src/CodeGen_WebGPU_Dev.cpp

steven-johnson · 2021-12-10T20:22:30Z

src/CodeGen_WebGPU_Dev.cpp

+
+void CodeGen_WebGPU_Dev::CodeGen_WGSL::visit(const IntImm *op) {
+    internal_assert(op->type.bits() == 32)
+        << "WGSL only supports 32-bit integers";


...well that is disappointing. No other int sizes? or unsigned?

At the moment, WGSL only supports 32-bit integers, signed and unsigned. Extensions to support 8- and 16-bit types are inevitable though, and potentially could make the core spec for v1 if we can demonstrate enough demand (bearing in mind that we'd have to emulate them on certain platforms).

All I can say is that most of the high-performance, well-scheduled chunks of Halide code I've seen in production make extensive use of 8- and 16-bit vector types. Their absence would be a big limitation for maximizing performance (and memory usage), not to mention making portable pipelines a lot more painful to write and schedule (since pipelines running on typical SIMD or DSP architectures won't want to leave the benefits of having smaller ints, and thus effectively-wider vectors, on the table...)

For other type-challenged backends (d3d12) I believe we emulate low bit width types in Halide, so you could take a look at those backends. Feel free to email me for help demonstrating enough demand.

src/CodeGen_WebGPU_Dev.cpp

This is makes the generated code a little more human-readable.

This is a verbatim copy of the Emscripten version, just without the include directives.

Tweak atomic primitive usage to avoid the need for this.

steven-johnson · 2021-12-15T00:58:21Z

(I took the liberty of pacifying clang-tidy and bringing up to date with top-of-tree; please let me know if you'd prefer I avoid doing changes of this sort here in the future, but these seemed harmless)

steven-johnson · 2023-03-01T02:10:16Z

OK, really close now! The only things left I see failing:

233 - correctness_multi_way_select (Subprocess aborted)

This ends in "WGPU device lost" -- looks like a real (if obscure) bug.

587 - generator_aot_gpu_multi_context_threaded (Subprocess aborted)
588 - generator_aotcpp_gpu_multi_context_threaded (Subprocess aborted)

(basically the same test, just one with LLVM backend, one with C++ backend)
this is one that I already looked into and punted on -- there's clearly a threading issue of some sort, but not sure what.

jrprice · 2023-03-01T18:55:44Z

233 - correctness_multi_way_select (Subprocess aborted)
This ends in "WGPU device lost" -- looks like a real (if obscure) bug.

Which platform/device is this on? This test is passing for me on macOS on Apple M1, AMD, and Intel GPUs, and also on a Linux/Vulkan device.

587 - generator_aot_gpu_multi_context_threaded (Subprocess aborted)
588 - generator_aotcpp_gpu_multi_context_threaded (Subprocess aborted)

I'll take a look at this one shortly.

steven-johnson · 2023-03-01T19:44:09Z

Which platform/device is this on? This test is passing for me on macOS on Apple M1, AMD, and Intel GPUs, and also on a Linux/Vulkan device.

This is on my x86-64 Mac laptop, running top-of-tree Dawn (and dawn.node) built as of yesterday. Hmm... Let me doublecheck that I was running with HL_JIT_TARGET=host-webgpu rather than HL_JIT_TARGET=wasm-32-wasmrt-webgpu (though, if this was wrong, ~all of the correcness_xxx tests would be failing, not just this one)

This fixes the generator_aot_gpu_multi_context_threaded tests.

jrprice · 2023-03-01T22:48:38Z

I've made the staging buffer per-context, which fixes generator_aot_gpu_multi_context_threaded for me. Please take a look at 2e88fe2.

steven-johnson · 2023-03-02T01:43:28Z

Looks good. correctness_multi_way_select still fails for me, full output:

Entering Pipeline f2
Target: x86-64-osx-avx-avx2-debug-f16c-fma-jit-sse41-user_context-webgpu
 Input (void const *) __user_context: 0x7ff7b596d5f0
 Output Buffer f2: buffer(0, 0x0, 0x0, 0, uint32)
WGPU: halide_webgpu_initialize_kernels (user_context: 0x7ff7b596d5f0, state_ptr: 0x10a76a6e0, program: 0x10a769000, size: 5143)
Caching compiled kernel: 0x7fde16706510 id 2 context 0x7fde4680fe00
Exiting Pipeline f2
WGPU: halide_webgpu_finalize_kernels (user_context: 0x7ff7b596d5f0, state_ptr: 0x2
Entering Pipeline f2
Target: x86-64-osx-avx-avx2-debug-f16c-fma-jit-sse41-user_context-webgpu
 Input (void const *) __user_context: 0x7ff7b596d5f0
 Output Buffer f2: buffer(0, 0x0, 0x7fde16706980, 0, uint32)
WGPU: halide_webgpu_initialize_kernels (user_context: 0x7ff7b596d5f0, state_ptr: 0x10a76a6e0, program: 0x10a769000, size: 5143)
WGPU: halide_webgpu_device_malloc (user_context: 0x7ff7b596d5f0, buf: 0x600002c38910)
      Allocated device buffer 0x600001920900
WGPU: halide_webgpu_run (user_context: 0x7ff7b596d5f0, entry: _kernel_f2_s0___outermost___outermost_v2___block_id_x, groups: 1x1x1, threads: 1x1x1, workgroup_mem: 0
WGPU: halide_webgpu_finalize_kernels (user_context: 0x7ff7b596d5f0, state_ptr: 0x2
Error: WGPU device lost (0): Error creating pipeline state Compiler encountered an internal error
 - While handling unexpected error type Internal when allowed errors are (Validation|DeviceLost).
    at Initialize (/Users/srj/GitHub/dawn/src/dawn/native/metal/ComputePipelineMTL.mm:54)
    at CreateComputePipeline (/Users/srj/GitHub/dawn/src/dawn/native/Device.cpp:1473)
�WGPU: unknown error (5): GPU device disconnected
WGPU: unknown error (5): GPU device disconnected

steven-johnson · 2023-03-02T02:09:55Z

Also, now some of the Generators tests (running under Node) now fail with:

RuntimeError: Aborted(RuntimeError: unreachable). "unreachable" may be due to ASYNCIFY_STACK_SIZE not being large enough (try increasing it)
at abort (/Users/srj/GitHub/Halide/build/release/test/generator/generator_aot_cleanup_on_error.js:1:13398)
at runAndAbortIfError (/Users/srj/GitHub/Halide/build/release/test/generator/generator_aot_cleanup_on_error.js:1:49522)
at /Users/srj/GitHub/Halide/build/release/test/generator/generator_aot_cleanup_on_error.js:1:53420
at callUserCallback (/Users/srj/GitHub/Halide/build/release/test/generator/generator_aot_cleanup_on_error.js:1:22442)
at Timeout._onTimeout (/Users/srj/GitHub/Halide/build/release/test/generator/generator_aot_cleanup_on_error.js:1:22545)
at listOnTimeout (node:internal/timers:557:17)
at processTimers (node:internal/timers:500:7)

We shouldn't be doing this inside a callback as we use async error checking routines.

jrprice · 2023-03-02T11:49:36Z

correctness_multi_way_select still fails for me
...
Error creating pipeline state Compiler encountered an internal error

Ugh, this is an internal error inside Apple's Metal Shading Language compiler. Are you using an AMD GPU on macOS 12.x or earlier by any chance? We hit a particular bug with this combo a lot, and there's unfortunately not a lot we can do about it. If it's the same one I'm thinking about then it did seem to be fixed in macOS 13.x though.

Also, now some of the Generators tests (running under Node) now fail with:

RuntimeError: Aborted(RuntimeError: unreachable). "unreachable" may be due to ASYNCIFY_STACK_SIZE not being large enough (try increasing it)

Should be fixed now. I was doing async stuff inside an async callback 🙃

steven-johnson · 2023-03-02T18:00:00Z

Ugh, this is an internal error inside Apple's Metal Shading Language compiler. Are you using an AMD GPU on macOS 12.x or earlier by any chance?

...why, yes, I am. (I didn't even know Apple shipped Intel macs with AMD GPUs...)

If this is fixed in macOS 13.x then we can just ignore it, I guess.

steven-johnson · 2023-03-02T18:23:05Z

Ugh, this is an internal error inside Apple's Metal Shading Language compiler. Are you using an AMD GPU on macOS 12.x or earlier by any chance?

...why, yes, I am. (I didn't even know Apple shipped Intel macs with AMD GPUs...)

If this is fixed in macOS 13.x then we can just ignore it, I guess.

Wait, not so fast: Unfortunately, our only Mac x86 buildbot is a Late-2013 Mac Pro... and pre-2017 Macs can't be upgraded to Ventura. Obviously we shouldn't let that prevent us from landing this but we're gonna need to figure out how to get better test coverage. (The gotcha is where to find a fast x86 Mac we can replace it with...) #7389

…to pr/6492" This reverts commit 9d79ac6, reversing changes made to 0673e67.

steven-johnson

OK, I'm declaring this Ready Enough To Land, at least in a known-beta-quality state.

Other folks on the review line, please either complete your review (or remove yourself from reviewership) at your earliest convenience so we can land this according to protocol :-)

shoaibkamil · 2023-03-13T14:50:50Z

src/runtime/CMakeLists.txt

@@ -154,6 +156,7 @@ set(RUNTIME_CXX_FLAGS
    # Necessary for using virtual functions in the runtime code.
    -fno-rtti
    -Wall
+    -Wc++20-designator


Does this mean C++20 is required for our runtime now? IIRC, we have so far stuck with C++17 or earlier.

No, this is adding the warning to make sure we don't accidentally use C++20 designators (which I had done earlier in the PR).

shoaibkamil

Thanks for this huge chunk of work!

@Attribute

* [WebGPU] Add runtime stubs and codegen skeleton All runtime functions are currently unimplemented, and the shader codegen just emits an empty compute shader. * [WebGPU] Implement lazy device initialization Sychronize device access using a WebGpuContext object, as is done in the other GPU runtimes. Device initialization is asynchronous, so we rely on Emscripten's Asyncify mechanism to wait for it to complete. * [WebGPU] Implement device release Release the device and adapter. * [WebGPU] Add scoped error handling mechanism WebGPU uses asynchronous callbacks for error handling, so we need to spin-lock while waiting for them to fire in order to determine success/failure status for any WebGPU APIs that we call. * [WebGPU] Implement device malloc/free Create a WGPUBuffer with Storage|CopyDst|CopySrc usages. We'll need a staging buffer to perform host<->device transfers, as a buffer used as a storage buffer cannot be mapped. * [WebGPU] Implement basic host<->device copies Use a staging buffer to copy data from the device, since we cannot map a buffer that is used a storage buffer. This logic will need an overhaul in order to support buffers that represent sub-regions of larger buffers. * [WebGPU] Implement halide_webgpu_device_sync Just wait for all submitted commands to complete. * [WebGPU] Implement shader compilation * [WebGPU] Implement core of WGSL codegen This implements enough of the WGSL codegen required to generate code for a 32-bit version of the apps/blur example. Buffer arguments are emitted as read_write storage buffers. Non-buffer arguments are collected into a structure and generated as a single uniform buffer. Workgroup sizes are currently required to be constant. This can relaxed in the future by using pipeline-overridable constants. * [WebGPU] Implement Cast node WGSL uses different syntax for type casts. * [WebGPU] Implement the float_from_bits() intrinsic * [WebGPU] Implement run function Create a compute pipeline, create a uniform buffer for non-buffer arguments, and dispatch a compute command to the queue. Does not handle workgroup storage yet. * [WebGPU] Move queue into WgpuContext class The queue handle received from wgpuDeviceGetQueue needs to be released, so it makes more sense to automatically get/release this handle in the WgpuContext constructor/destructor. * [WebGPU] Add support for JIT This requires a native implementation of WebGPU (e.g. Dawn or wgpu). * [WebGPU] Enable the gpu_only AOT generator test Link generator tests against the native WebGPU library if specified. * [WebGPU] Add support for targeting dawn-native We have to make this a compile-time switch for now, as the supported APIs currently differ between Dawn and Emscripten. We should be able to remove all of these conditionals when the native API stabilizes. * [WebGPU] Add support for AOT tests when using WASM Use Dawn's node bindings to run these tests, by pre-pending some initialization code to the script that nodejs will invoke. * [WebGPU] Print explicit types for let declarations This is makes the generated code a little more human-readable. * [WebGPU] Address first round of review comments * [WebGPU] Add copy of webgpu.h This is a verbatim copy of the Emscripten version, just without the include directives. * [WebGPU] Add comment about ASYNCIFY requirement * [WebGPU] Remove -Wno-atomic-alignment Tweak atomic primitive usage to avoid the need for this. * pacify clang-tidy * Fix more clang-tidy errors * Only use ASYNCIFY for tests when targeting WebGPU * Fix even more clang-tidy errors * [WebGPU] Add basic support to Makefile * [WebGPU] Don't wrap buffers inside structures This requirement has been removed from the WGSL specification, and the corresponding implementation changes have now landed in Dawn. * [WebGPU] Fix debug message tag * [WebGPU] Update WGPUErrorFilter enum in header * [WebGPU] Update WGSL attribute syntax The WGSL specification recently changed attribute syntax from [[attribute]] to @Attribute. * [WebGPU] Add README_webgpu.md Explains how to configure Halide to target WebGPU for both Emscripten and Dawn native. Also lists several known limitations. * [WebGPU] Move native WebGPU library CMake logic This is the correct place, otherwise the link line order is wrong. * [WebGPU] Implement WGSL codegen for serial loops * [WebGPU] Implement WGSL codegen for Allocate Use array types for stack allocations. Leave GPU shared memory unimplemented for now. * [WebGPU] Implement WGSL codegen for Select Use the WGSL select builtin function, which supports bool vector conditions for component-wise selection too. * [WebGPU] Mark 64-bit types as unsupported * [WebGPU] Implement device_and_host_{malloc,free} Just use the default implementations. * [WebGPU] Fixed WGSL codegen for boolean vectors * [WebGPU] Implement f32 math intrinsics * [WebGPU] Implement inverse and inverse sqrt * [WebGPU] Fixup some errors in WGSL codegen * [WebGPU] Implement logical and/or for bool vectors * [WebGPU] Implement WGSL codegen for Broadcast node * [WebGPU] Implement WGSL codegen for Ramp node * [WebGPU] Emulate 8- and 16-bit integers Use atomics to emulate storage, and widen the values to 32-bits when operating on them. * [WebGPU] Avoid buffer name collisions Buffers are declared as global variables in WGSL, so prefix them with the kernel name to avoid collisions. * [WebGPU] Fix divide-by-power-of-two WGSL requires the RHS of a shift to be unsigned. * [WebGPU] Implement codegen for gpu_thread_barrier * [WebGPU] Implement WGSL codegen for Evaluate This fixes an issue with the halide_unused() intrinsic. * [WebGPU] Add support for shared memory This currently only supports statically-sized shared memory allocations; dynamically-sized allocations require using pipeline-overridable constants which are not yet implemented in Dawn. * [WebGPU] Fix 8/16-bit load/store emulation Loads and stores that implicitly casted to/from 32-bits were casting when they should not have been, and not casting when they should. * Use generic 64-bit support query in gpu_mixed_shared_mem_types This is more robust than checking for specific APIs. * [WebGPU] Fix object cleanup during teardown * [WebGPU] Do not re-allocate device buffers * [WebGPU] Mark maximum vector width as 4 in tests * [WebGPU] Add functions to object lifetime tracker * [WebGPU] Override FloatImm handling We need to explicitly suffix the integer literal with a `u` to make it unsigned. * [WebGPU] Scalarize predicated loads * [WebGPU] Implement if_then_else intrinsic This is generated when scalarizing predicated loads. * [WebGPU] Enable gpu_free_sync test * [WebGPU] Implement print_reinterpret Use WGSL's bitcast<> operator. * [WebGPU] Implement print_extern_call This is just a regular function call in WGSL. * Add missing include and namespace for isnan/isinf * [WebGPU] Avoid short-circuiting operators These can cause issues with WGSL's uniformity analysis. * [WebGPU] Use commas for struct member seperators * [WebGPU] Update API headers and usages Two API functions were renamed. * [WebGPU] Use CodeGen_GPU_C base class for codegen Introduce a new enum value for WGSL's vector syntax. * [WebGPU] Add warning for emulated narrow integers * [WebGPU] Update README with latest status * [WebGPU] Add support for non-contiguous copies Also adds support for buffer cropping/slicing. * [WebGPU] Fix clang-tidy error * [WebGPU] Use atomicCmpXchg for 8/16-bit emulation Halide assumes that write-write data-races are benign when both threads are writing the same value, but this is not true when those writes are implemented using atomicXor. We need to use atomicCompareExchangeWeak in a loop to perform this emulation instead. Unfortunately this makes things even slower :-( * [WebGPU] Support non-32-bit parameter types Expand them to 32-bits in the shader and the runtime. * [WebGPU] Fix mixed types in buffers The atomic emulation of narrow types shouldn't use the allocation type, since the same buffer may be re-used for multiple types. This means we also sometimes need to perform 32-bit accesses using atomics as well. Instead of using the buffer allocation type, we now pre-traverse the IR to check for accesses that will require emulation, and mark the corresponding buffer as such. * [WebGPU] Show validation errors for failed maps * [WebGPU] Round up buffer offsets and sizes The WebGPU API requires that these are multiples of 4. * [WebGPU] Update implementation status in README * [WebGPU] Replace @stage(compute) with @compute * [WebGPU] Polyfill the pow_f32 intrinsic The pow builtin in WGSL only has the correct semantics for x>0, so we need to emulate the behavior for the other cases. * [WebGPU] Skip the gpu_allocation_cache test Allocation cache is not yet implemented for the WebGPU backend, and so this test takes forever. * [WebGPU] Use builtins for inverse hyperbolics * [WebGPU] Map rint() to round() WGSL's round() builtin has round-to-nearest-even semantics. * [WebGPU] Set device lost callback This provides more information about events that cause the device to become inaccessible. * [WebGPU] Use i32 for bool parameters The bool type cannot be used in the uniform storage class in WGSL. * [WebGPU] Raise limits for buffer size and workgroup storage * [WebGPU] Update mini_webgpu.h Disable clang-format for this file. * [WebGPU] Add support for dynamic workgroups This is implemented using pipeline-overridable constants. * [WebGPU] Avoid using 'new' as an identifier This is a reserved keyword in WGSL. * [WebGPU] Do not merge workgroup allocations Since we promote 8- and 16-bit integers to 32-bit in workgroup memory, merging multiple different GPUShared allocations into a single one can cause different regions to overlap and data races ensue. * [WebGPU] Fix Emscripten support * [WebGPU] Use const for integer immediates This avoids some MSL compiler ICEs with the code generated by Dawn, and also makes it easier for the WGSL compiler to constant-fold these values. * [WebGPU] Squelch clang-tidy error * [WebGPU] Note Dawn's dependency on go in README * [WebGPU] Add links to Emscripten vs Dawn issue * [WebGPU] Show error finding WebGPU library fails * [WebGPU] Add link to issue about Windows support * [WebGPU] Rename roundUpToMultipleOf4 ...and use it in one place where we weren't. * [WebGPU] Add links to wrap_native issue * [WebGPU] Use debug_assert for some runtime errors * [WebGPU] Stop using designated initializers Add -Wc++20-designator flag to runtime build to prevent this from happening again. * [WebGPU] Update mini_webgpu.h * [WebGPU] Fix a validation issue with ToT Dawn A bitcast that produces a NaN is invalid in WGSL, so implement the `nan()` intrinsic via `float_from_bits()`. * Update README_webgpu.md * Add is_finite_f32, is_inf_f32, is_nan_f32 * Update isinf, isnan, isfinite; add inf_f32, neg_inf_f32 * correctness_isnan should be skipped for WebGPU * Update isnan.cpp * Update atomics_gpu_8_bit.cpp * Fix python_tutorial_lesson_10_aot_compilation_run * Partial fix for generator_aot_acquire_release() This adds the necessary (horrible hackery) to bring the WebGPU case in line with the other backends... but the test still fails, apparently due to the same copy-to-host bug we suspect for generator_aot_gpu_only.Pushing this anyway because it was annoying to write :-) * [WebGPU] Fix AOT test build for non-WASM * [WebGPU] Move README to root * [WebGPU] Address review comments * wip * Update CMakeLists.txt * Work-in-progress for generator_aot_gpu_multi_context_threaded * [WebGPU] Use a per-context staging buffer This fixes the generator_aot_gpu_multi_context_threaded tests. * [WebGPU] Fix clang-format issue * [WebGPU] Move staging buffer creation We shouldn't be doing this inside a callback as we use async error checking routines. * Revert "Merge branch 'webgpu' of https://github.com/jrprice/Halide into pr/6492" This reverts commit 9d79ac6, reversing changes made to 0673e67. * Skip correctness_multi_way_select on x86 Macs (for now) * clang-format * [WebGPU] Recommit 742db3f

jrprice added 17 commits December 10, 2021 12:43

[WebGPU] Add runtime stubs and codegen skeleton

15ad2be

All runtime functions are currently unimplemented, and the shader codegen just emits an empty compute shader.

[WebGPU] Implement lazy device initialization

3306c83

Sychronize device access using a WebGpuContext object, as is done in the other GPU runtimes. Device initialization is asynchronous, so we rely on Emscripten's Asyncify mechanism to wait for it to complete.

[WebGPU] Implement device release

4b9b29a

Release the device and adapter.

[WebGPU] Add scoped error handling mechanism

5bba590

WebGPU uses asynchronous callbacks for error handling, so we need to spin-lock while waiting for them to fire in order to determine success/failure status for any WebGPU APIs that we call.

[WebGPU] Implement device malloc/free

30affd9

Create a WGPUBuffer with Storage|CopyDst|CopySrc usages. We'll need a staging buffer to perform host<->device transfers, as a buffer used as a storage buffer cannot be mapped.

[WebGPU] Implement basic host<->device copies

1f45790

Use a staging buffer to copy data from the device, since we cannot map a buffer that is used a storage buffer. This logic will need an overhaul in order to support buffers that represent sub-regions of larger buffers.

[WebGPU] Implement halide_webgpu_device_sync

e0a5287

Just wait for all submitted commands to complete.

[WebGPU] Implement shader compilation

7a84964

[WebGPU] Implement Cast node

e2d4e34

WGSL uses different syntax for type casts.

[WebGPU] Implement the float_from_bits() intrinsic

9d49911

[WebGPU] Implement run function

b0e1ca2

Create a compute pipeline, create a uniform buffer for non-buffer arguments, and dispatch a compute command to the queue. Does not handle workgroup storage yet.

[WebGPU] Move queue into WgpuContext class

89e1317

The queue handle received from wgpuDeviceGetQueue needs to be released, so it makes more sense to automatically get/release this handle in the WgpuContext constructor/destructor.

[WebGPU] Add support for JIT

828ebe8

This requires a native implementation of WebGPU (e.g. Dawn or wgpu).

[WebGPU] Enable the gpu_only AOT generator test

6aec6f8

Link generator tests against the native WebGPU library if specified.

[WebGPU] Add support for targeting dawn-native

42439df

We have to make this a compile-time switch for now, as the supported APIs currently differ between Dawn and Emscripten. We should be able to remove all of these conditionals when the native API stabilizes.

[WebGPU] Add support for AOT tests when using WASM

d98e101

Use Dawn's node bindings to run these tests, by pre-pending some initialization code to the script that nodejs will invoke.

jrprice marked this pull request as draft December 10, 2021 20:00

steven-johnson reviewed Dec 10, 2021

View reviewed changes

jrprice added 5 commits December 10, 2021 20:15

[WebGPU] Print explicit types for let declarations

4e1c1f4

This is makes the generated code a little more human-readable.

[WebGPU] Address first round of review comments

622e13b

[WebGPU] Add copy of webgpu.h

8cdbf50

This is a verbatim copy of the Emscripten version, just without the include directives.

[WebGPU] Add comment about ASYNCIFY requirement

04ecd93

[WebGPU] Remove -Wno-atomic-alignment

56b5326

Tweak atomic primitive usage to avoid the need for this.

jrprice force-pushed the webgpu branch from 31c4245 to 56b5326 Compare December 13, 2021 22:09

steven-johnson requested a review from halidebuildbots December 15, 2021 00:55

steven-johnson added 2 commits December 14, 2021 16:57

pacify clang-tidy

f36599d

Merge branch 'master' into pr/6492

8f76c70

Fix more clang-tidy errors

399247f

steven-johnson added 2 commits February 28, 2023 17:43

wip

aea8397

Update CMakeLists.txt

c06fd64

steven-johnson and others added 3 commits March 1, 2023 17:43

Work-in-progress for generator_aot_gpu_multi_context_threaded

26ef6d7

[WebGPU] Use a per-context staging buffer

2e88fe2

This fixes the generator_aot_gpu_multi_context_threaded tests.

[WebGPU] Fix clang-format issue

951efb1

Merge branch 'main' into pr/6492

0673e67

[WebGPU] Move staging buffer creation

742db3f

We shouldn't be doing this inside a callback as we use async error checking routines.

jrprice mentioned this pull request Mar 2, 2023

Work-in-progress for generator_aot_gpu_multi_context_threaded jrprice/Halide#1

Closed

steven-johnson added 3 commits March 2, 2023 10:38

Merge branch 'webgpu' of https://github.com/jrprice/Halide into pr/6492

9d79ac6

Revert "Merge branch 'webgpu' of https://github.com/jrprice/Halide in…

2193614

…to pr/6492" This reverts commit 9d79ac6, reversing changes made to 0673e67.

Skip correctness_multi_way_select on x86 Macs (for now)

7535579

steven-johnson self-requested a review March 2, 2023 18:55

steven-johnson approved these changes Mar 2, 2023

View reviewed changes

steven-johnson and others added 2 commits March 2, 2023 10:57

clang-format

ec5c14e

[WebGPU] Recommit 742db3f

236fff7

shoaibkamil reviewed Mar 13, 2023

View reviewed changes

shoaibkamil approved these changes Mar 13, 2023

View reviewed changes

steven-johnson added the release_notes For changes that may warrant a note in README for official releases. label Mar 13, 2023

steven-johnson merged commit 078465c into halide:main Mar 13, 2023

jrprice deleted the webgpu branch March 13, 2023 20:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial support for WebGPU #6492

Add initial support for WebGPU #6492

jrprice commented Dec 10, 2021

steven-johnson left a comment

steven-johnson Dec 10, 2021

jrprice Dec 10, 2021

steven-johnson Dec 11, 2021

jrprice Dec 13, 2021

steven-johnson Dec 15, 2021

jrprice Dec 15, 2021

steven-johnson Dec 10, 2021

jrprice Dec 10, 2021

steven-johnson Dec 11, 2021

abadams Dec 11, 2021

steven-johnson commented Dec 15, 2021

steven-johnson commented Mar 1, 2023

jrprice commented Mar 1, 2023

steven-johnson commented Mar 1, 2023

jrprice commented Mar 1, 2023

steven-johnson commented Mar 2, 2023

steven-johnson commented Mar 2, 2023

jrprice commented Mar 2, 2023

steven-johnson commented Mar 2, 2023

steven-johnson commented Mar 2, 2023 •

edited

Loading

steven-johnson left a comment

shoaibkamil Mar 13, 2023

jrprice Mar 13, 2023

shoaibkamil left a comment

Add initial support for WebGPU #6492

Add initial support for WebGPU #6492

Conversation

jrprice commented Dec 10, 2021

steven-johnson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steven-johnson commented Dec 15, 2021

steven-johnson commented Mar 1, 2023

jrprice commented Mar 1, 2023

steven-johnson commented Mar 1, 2023

jrprice commented Mar 1, 2023

steven-johnson commented Mar 2, 2023

steven-johnson commented Mar 2, 2023

jrprice commented Mar 2, 2023

steven-johnson commented Mar 2, 2023

steven-johnson commented Mar 2, 2023 • edited Loading

steven-johnson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoaibkamil left a comment

Choose a reason for hiding this comment

steven-johnson commented Mar 2, 2023 •

edited

Loading