Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial support for WebGPU #6492

Merged
merged 143 commits into from
Mar 13, 2023
Merged

Add initial support for WebGPU #6492

merged 143 commits into from
Mar 13, 2023

Conversation

jrprice
Copy link
Contributor

@jrprice jrprice commented Dec 10, 2021

This adds the core of a new WebGPU backend. There are still many holes in the functionality and rough edges around how this interacts with Emscripten, native WebGPU implementations, and testing. Nevertheless, I'm opening this after offline discussion with @steven-johnson to get some initial feedback on this backend (and to raise awareness that this is happening).

While a lot of the implementation still remains to be done, this PR provides enough to run a 32-bit version of apps/blur, the gpu_only AOT generator test, and the correctness_bounds JIT test. I've also got the HelloWasm app running with the render pipeline targeting the GPU (and the other two pipelines still using WASM CPU).

For testing, the AOT generator tests can be made to work for wasm-32-wasmrt-webgpu when using Node bindings for Dawn (Chromium's native implementation of WebGPU). Both the AOT tests and JIT tests can also work for host-webgpu when using Dawn. JIT is not currently supported for wasm-32-wasmrt-webgpu.

Unlike the other GPU backends, I'm not employing the dlopen/dlsym approach in the runtime for getting the API functions. I'm not sure how to make this work when using Emscripten, and since using Dawn directly is only really needed for testing purposes it doesn't seem /too/ onerous to require direct linking, but I'm open to opinions and suggestions here.

Another pain point right now is that the C++ API for WebGPU is not currently stable between different implementations, so there is a build-time switch to toggle between targeting Emscripten vs Dawn. I'm optimistic that this requirement will eventually go away.

There's still some patches that need to land in both Emscripten and Dawn before this backend will work for anyone else, so I'll leave this PR as a draft until those are resolved.

All feedback is very welcome! I've joined the Gitter room as well.

All runtime functions are currently unimplemented, and the shader
codegen just emits an empty compute shader.
Sychronize device access using a WebGpuContext object, as is done in
the other GPU runtimes.

Device initialization is asynchronous, so we rely on Emscripten's
Asyncify mechanism to wait for it to complete.
Release the device and adapter.
WebGPU uses asynchronous callbacks for error handling, so we need to
spin-lock while waiting for them to fire in order to determine
success/failure status for any WebGPU APIs that we call.
Create a WGPUBuffer with Storage|CopyDst|CopySrc usages. We'll need a
staging buffer to perform host<->device transfers, as a buffer used as
a storage buffer cannot be mapped.
Use a staging buffer to copy data from the device, since we cannot map
a buffer that is used a storage buffer.

This logic will need an overhaul in order to support buffers that
represent sub-regions of larger buffers.
Just wait for all submitted commands to complete.
This implements enough of the WGSL codegen required to generate code
for a 32-bit version of the apps/blur example.

Buffer arguments are emitted as read_write storage buffers.

Non-buffer arguments are collected into a structure and generated as a
single uniform buffer.

Workgroup sizes are currently required to be constant. This can
relaxed in the future by using pipeline-overridable constants.
WGSL uses different syntax for type casts.
Create a compute pipeline, create a uniform buffer for non-buffer
arguments, and dispatch a compute command to the queue.

Does not handle workgroup storage yet.
The queue handle received from wgpuDeviceGetQueue needs to be
released, so it makes more sense to automatically get/release this
handle in the WgpuContext constructor/destructor.
This requires a native implementation of WebGPU (e.g. Dawn or wgpu).
Link generator tests against the native WebGPU library if specified.
We have to make this a compile-time switch for now, as the supported
APIs currently differ between Dawn and Emscripten. We should be able
to remove all of these conditionals when the native API stabilizes.
Use Dawn's node bindings to run these tests, by pre-pending some
initialization code to the script that nodejs will invoke.
@jrprice jrprice marked this pull request as draft December 10, 2021 20:00
Copy link
Contributor

@steven-johnson steven-johnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this -- I don't have time today for more than a quick skim, but wanted to provide a handful of (minor) bits of feedback, but the important thing to tell you right away is that including webgpu.h (or any other outside-of-Halide header file) in the runtime code is a no-no, at least for anything beyond initial experimentation. Wanted to be sure you were aware of that sooner rather than later.

@@ -92,8 +92,16 @@ function(add_wasm_executable TARGET)
-s ASSERTIONS=1
-s ALLOW_MEMORY_GROWTH=1
-s ENVIRONMENT=node
-s USE_WEBGPU=1
-s ASYNCIFY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASYNCIFY is a pretty nontrivial code change IIRC -- What in this PR requires it? I don't think we'd want to enable it for most tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WebGPU is a Javascript API that makes use of asynchronous callbacks for things like mapping memory onto the host, device initialization, waiting for device commands to finish, and error checking. I don't think there's any way to implement this stuff without Asyncify, unfortunately.

We could certainly make the addition of this flag conditional of WebGPU being the target though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be OK everywhere, I just need to re-read the details of the code transformation to decide if it compromises the value of our (other) wasm tests in any way. If it doesn't, no worries. (Definitely worth adding a comment about this flag being necessary for WebGPU though.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, after re-reading the details on Asyncify, it seems invasive enough that I'm inclined to suggest we should try to only use it when necessary, rather than across the board (e.g.: "As mentioned earlier, unoptimized builds with Asyncify can be large and slow. Build with optimizations (say, -O3) to get good results.") No need to worry about it at this point (and I'm more than willing to drop this if point if the emcc folk tell me I'm overthinking it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made the addition of -s ASYNCIFY conditional on the target including the webgpu feature for now.

src/runtime/webgpu.cpp Outdated Show resolved Hide resolved
src/runtime/webgpu.cpp Outdated Show resolved Hide resolved
src/runtime/webgpu.cpp Outdated Show resolved Hide resolved
src/runtime/CMakeLists.txt Outdated Show resolved Hide resolved
src/CodeGen_WebGPU_Dev.cpp Outdated Show resolved Hide resolved

void CodeGen_WebGPU_Dev::CodeGen_WGSL::visit(const IntImm *op) {
internal_assert(op->type.bits() == 32)
<< "WGSL only supports 32-bit integers";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...well that is disappointing. No other int sizes? or unsigned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, WGSL only supports 32-bit integers, signed and unsigned. Extensions to support 8- and 16-bit types are inevitable though, and potentially could make the core spec for v1 if we can demonstrate enough demand (bearing in mind that we'd have to emulate them on certain platforms).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All I can say is that most of the high-performance, well-scheduled chunks of Halide code I've seen in production make extensive use of 8- and 16-bit vector types. Their absence would be a big limitation for maximizing performance (and memory usage), not to mention making portable pipelines a lot more painful to write and schedule (since pipelines running on typical SIMD or DSP architectures won't want to leave the benefits of having smaller ints, and thus effectively-wider vectors, on the table...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other type-challenged backends (d3d12) I believe we emulate low bit width types in Halide, so you could take a look at those backends. Feel free to email me for help demonstrating enough demand.

src/CodeGen_WebGPU_Dev.cpp Outdated Show resolved Hide resolved
This is makes the generated code a little more human-readable.
This is a verbatim copy of the Emscripten version, just without the
include directives.
Tweak atomic primitive usage to avoid the need for this.
@steven-johnson
Copy link
Contributor

(I took the liberty of pacifying clang-tidy and bringing up to date with top-of-tree; please let me know if you'd prefer I avoid doing changes of this sort here in the future, but these seemed harmless)

@steven-johnson
Copy link
Contributor

OK, really close now! The only things left I see failing:

233 - correctness_multi_way_select (Subprocess aborted)

This ends in "WGPU device lost" -- looks like a real (if obscure) bug.

587 - generator_aot_gpu_multi_context_threaded (Subprocess aborted)
588 - generator_aotcpp_gpu_multi_context_threaded (Subprocess aborted)

(basically the same test, just one with LLVM backend, one with C++ backend)
this is one that I already looked into and punted on -- there's clearly a threading issue of some sort, but not sure what.

@jrprice
Copy link
Contributor Author

jrprice commented Mar 1, 2023

233 - correctness_multi_way_select (Subprocess aborted)

This ends in "WGPU device lost" -- looks like a real (if obscure) bug.

Which platform/device is this on? This test is passing for me on macOS on Apple M1, AMD, and Intel GPUs, and also on a Linux/Vulkan device.

587 - generator_aot_gpu_multi_context_threaded (Subprocess aborted)
588 - generator_aotcpp_gpu_multi_context_threaded (Subprocess aborted)

I'll take a look at this one shortly.

@steven-johnson
Copy link
Contributor

Which platform/device is this on? This test is passing for me on macOS on Apple M1, AMD, and Intel GPUs, and also on a Linux/Vulkan device.

This is on my x86-64 Mac laptop, running top-of-tree Dawn (and dawn.node) built as of yesterday. Hmm... Let me doublecheck that I was running with HL_JIT_TARGET=host-webgpu rather than HL_JIT_TARGET=wasm-32-wasmrt-webgpu (though, if this was wrong, ~all of the correcness_xxx tests would be failing, not just this one)

@jrprice
Copy link
Contributor Author

jrprice commented Mar 1, 2023

I've made the staging buffer per-context, which fixes generator_aot_gpu_multi_context_threaded for me. Please take a look at 2e88fe2.

@steven-johnson
Copy link
Contributor

Looks good. correctness_multi_way_select still fails for me, full output:

Entering Pipeline f2
Target: x86-64-osx-avx-avx2-debug-f16c-fma-jit-sse41-user_context-webgpu
 Input (void const *) __user_context: 0x7ff7b596d5f0
 Output Buffer f2: buffer(0, 0x0, 0x0, 0, uint32)
WGPU: halide_webgpu_initialize_kernels (user_context: 0x7ff7b596d5f0, state_ptr: 0x10a76a6e0, program: 0x10a769000, size: 5143)
Caching compiled kernel: 0x7fde16706510 id 2 context 0x7fde4680fe00
Exiting Pipeline f2
WGPU: halide_webgpu_finalize_kernels (user_context: 0x7ff7b596d5f0, state_ptr: 0x2
Entering Pipeline f2
Target: x86-64-osx-avx-avx2-debug-f16c-fma-jit-sse41-user_context-webgpu
 Input (void const *) __user_context: 0x7ff7b596d5f0
 Output Buffer f2: buffer(0, 0x0, 0x7fde16706980, 0, uint32)
WGPU: halide_webgpu_initialize_kernels (user_context: 0x7ff7b596d5f0, state_ptr: 0x10a76a6e0, program: 0x10a769000, size: 5143)
WGPU: halide_webgpu_device_malloc (user_context: 0x7ff7b596d5f0, buf: 0x600002c38910)
      Allocated device buffer 0x600001920900
WGPU: halide_webgpu_run (user_context: 0x7ff7b596d5f0, entry: _kernel_f2_s0___outermost___outermost_v2___block_id_x, groups: 1x1x1, threads: 1x1x1, workgroup_mem: 0
WGPU: halide_webgpu_finalize_kernels (user_context: 0x7ff7b596d5f0, state_ptr: 0x2
Error: WGPU device lost (0): Error creating pipeline state Compiler encountered an internal error
 - While handling unexpected error type Internal when allowed errors are (Validation|DeviceLost).
    at Initialize (/Users/srj/GitHub/dawn/src/dawn/native/metal/ComputePipelineMTL.mm:54)
    at CreateComputePipeline (/Users/srj/GitHub/dawn/src/dawn/native/Device.cpp:1473)
�WGPU: unknown error (5): GPU device disconnected
WGPU: unknown error (5): GPU device disconnected

@steven-johnson
Copy link
Contributor

Also, now some of the Generators tests (running under Node) now fail with:

RuntimeError: Aborted(RuntimeError: unreachable). "unreachable" may be due to ASYNCIFY_STACK_SIZE not being large enough (try increasing it)
at abort (/Users/srj/GitHub/Halide/build/release/test/generator/generator_aot_cleanup_on_error.js:1:13398)
at runAndAbortIfError (/Users/srj/GitHub/Halide/build/release/test/generator/generator_aot_cleanup_on_error.js:1:49522)
at /Users/srj/GitHub/Halide/build/release/test/generator/generator_aot_cleanup_on_error.js:1:53420
at callUserCallback (/Users/srj/GitHub/Halide/build/release/test/generator/generator_aot_cleanup_on_error.js:1:22442)
at Timeout._onTimeout (/Users/srj/GitHub/Halide/build/release/test/generator/generator_aot_cleanup_on_error.js:1:22545)
at listOnTimeout (node:internal/timers:557:17)
at processTimers (node:internal/timers:500:7)

We shouldn't be doing this inside a callback as we use async error
checking routines.
@jrprice
Copy link
Contributor Author

jrprice commented Mar 2, 2023

correctness_multi_way_select still fails for me
...
Error creating pipeline state Compiler encountered an internal error

Ugh, this is an internal error inside Apple's Metal Shading Language compiler. Are you using an AMD GPU on macOS 12.x or earlier by any chance? We hit a particular bug with this combo a lot, and there's unfortunately not a lot we can do about it. If it's the same one I'm thinking about then it did seem to be fixed in macOS 13.x though.

Also, now some of the Generators tests (running under Node) now fail with:

RuntimeError: Aborted(RuntimeError: unreachable). "unreachable" may be due to ASYNCIFY_STACK_SIZE not being large enough (try increasing it)

Should be fixed now. I was doing async stuff inside an async callback 🙃

@steven-johnson
Copy link
Contributor

Ugh, this is an internal error inside Apple's Metal Shading Language compiler. Are you using an AMD GPU on macOS 12.x or earlier by any chance?

...why, yes, I am. (I didn't even know Apple shipped Intel macs with AMD GPUs...)

If this is fixed in macOS 13.x then we can just ignore it, I guess.

@steven-johnson
Copy link
Contributor

steven-johnson commented Mar 2, 2023

Ugh, this is an internal error inside Apple's Metal Shading Language compiler. Are you using an AMD GPU on macOS 12.x or earlier by any chance?

...why, yes, I am. (I didn't even know Apple shipped Intel macs with AMD GPUs...)

If this is fixed in macOS 13.x then we can just ignore it, I guess.

Wait, not so fast: Unfortunately, our only Mac x86 buildbot is a Late-2013 Mac Pro... and pre-2017 Macs can't be upgraded to Ventura. Obviously we shouldn't let that prevent us from landing this but we're gonna need to figure out how to get better test coverage. (The gotcha is where to find a fast x86 Mac we can replace it with...) #7389

@steven-johnson steven-johnson self-requested a review March 2, 2023 18:55
Copy link
Contributor

@steven-johnson steven-johnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm declaring this Ready Enough To Land, at least in a known-beta-quality state.

Other folks on the review line, please either complete your review (or remove yourself from reviewership) at your earliest convenience so we can land this according to protocol :-)

@@ -154,6 +156,7 @@ set(RUNTIME_CXX_FLAGS
# Necessary for using virtual functions in the runtime code.
-fno-rtti
-Wall
-Wc++20-designator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean C++20 is required for our runtime now? IIRC, we have so far stuck with C++17 or earlier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is adding the warning to make sure we don't accidentally use C++20 designators (which I had done earlier in the PR).

Copy link
Contributor

@shoaibkamil shoaibkamil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this huge chunk of work!

@steven-johnson steven-johnson added the release_notes For changes that may warrant a note in README for official releases. label Mar 13, 2023
@steven-johnson steven-johnson merged commit 078465c into halide:main Mar 13, 2023
@jrprice jrprice deleted the webgpu branch March 13, 2023 20:40
ardier pushed a commit to ardier/Halide-mutation that referenced this pull request Mar 3, 2024
* [WebGPU] Add runtime stubs and codegen skeleton

All runtime functions are currently unimplemented, and the shader
codegen just emits an empty compute shader.

* [WebGPU] Implement lazy device initialization

Sychronize device access using a WebGpuContext object, as is done in
the other GPU runtimes.

Device initialization is asynchronous, so we rely on Emscripten's
Asyncify mechanism to wait for it to complete.

* [WebGPU] Implement device release

Release the device and adapter.

* [WebGPU] Add scoped error handling mechanism

WebGPU uses asynchronous callbacks for error handling, so we need to
spin-lock while waiting for them to fire in order to determine
success/failure status for any WebGPU APIs that we call.

* [WebGPU] Implement device malloc/free

Create a WGPUBuffer with Storage|CopyDst|CopySrc usages. We'll need a
staging buffer to perform host<->device transfers, as a buffer used as
a storage buffer cannot be mapped.

* [WebGPU] Implement basic host<->device copies

Use a staging buffer to copy data from the device, since we cannot map
a buffer that is used a storage buffer.

This logic will need an overhaul in order to support buffers that
represent sub-regions of larger buffers.

* [WebGPU] Implement halide_webgpu_device_sync

Just wait for all submitted commands to complete.

* [WebGPU] Implement shader compilation

* [WebGPU] Implement core of WGSL codegen

This implements enough of the WGSL codegen required to generate code
for a 32-bit version of the apps/blur example.

Buffer arguments are emitted as read_write storage buffers.

Non-buffer arguments are collected into a structure and generated as a
single uniform buffer.

Workgroup sizes are currently required to be constant. This can
relaxed in the future by using pipeline-overridable constants.

* [WebGPU] Implement Cast node

WGSL uses different syntax for type casts.

* [WebGPU] Implement the float_from_bits() intrinsic

* [WebGPU] Implement run function

Create a compute pipeline, create a uniform buffer for non-buffer
arguments, and dispatch a compute command to the queue.

Does not handle workgroup storage yet.

* [WebGPU] Move queue into WgpuContext class

The queue handle received from wgpuDeviceGetQueue needs to be
released, so it makes more sense to automatically get/release this
handle in the WgpuContext constructor/destructor.

* [WebGPU] Add support for JIT

This requires a native implementation of WebGPU (e.g. Dawn or wgpu).

* [WebGPU] Enable the gpu_only AOT generator test

Link generator tests against the native WebGPU library if specified.

* [WebGPU] Add support for targeting dawn-native

We have to make this a compile-time switch for now, as the supported
APIs currently differ between Dawn and Emscripten. We should be able
to remove all of these conditionals when the native API stabilizes.

* [WebGPU] Add support for AOT tests when using WASM

Use Dawn's node bindings to run these tests, by pre-pending some
initialization code to the script that nodejs will invoke.

* [WebGPU] Print explicit types for let declarations

This is makes the generated code a little more human-readable.

* [WebGPU] Address first round of review comments

* [WebGPU] Add copy of webgpu.h

This is a verbatim copy of the Emscripten version, just without the
include directives.

* [WebGPU] Add comment about ASYNCIFY requirement

* [WebGPU] Remove -Wno-atomic-alignment

Tweak atomic primitive usage to avoid the need for this.

* pacify clang-tidy

* Fix more clang-tidy errors

* Only use ASYNCIFY for tests when targeting WebGPU

* Fix even more clang-tidy errors

* [WebGPU] Add basic support to Makefile

* [WebGPU] Don't wrap buffers inside structures

This requirement has been removed from the WGSL specification, and the
corresponding implementation changes have now landed in Dawn.

* [WebGPU] Fix debug message tag

* [WebGPU] Update WGPUErrorFilter enum in header

* [WebGPU] Update WGSL attribute syntax

The WGSL specification recently changed attribute syntax from
[[attribute]] to @Attribute.

* [WebGPU] Add README_webgpu.md

Explains how to configure Halide to target WebGPU for both Emscripten
and Dawn native. Also lists several known limitations.

* [WebGPU] Move native WebGPU library CMake logic

This is the correct place, otherwise the link line order is wrong.

* [WebGPU] Implement WGSL codegen for serial loops

* [WebGPU] Implement WGSL codegen for Allocate

Use array types for stack allocations. Leave GPU shared memory
unimplemented for now.

* [WebGPU] Implement WGSL codegen for Select

Use the WGSL select builtin function, which supports bool vector
conditions for component-wise selection too.

* [WebGPU] Mark 64-bit types as unsupported

* [WebGPU] Implement device_and_host_{malloc,free}

Just use the default implementations.

* [WebGPU] Fixed WGSL codegen for boolean vectors

* [WebGPU] Implement f32 math intrinsics

* [WebGPU] Implement inverse and inverse sqrt

* [WebGPU] Fixup some errors in WGSL codegen

* [WebGPU] Implement logical and/or for bool vectors

* [WebGPU] Implement WGSL codegen for Broadcast node

* [WebGPU] Implement WGSL codegen for Ramp node

* [WebGPU] Emulate 8- and 16-bit integers

Use atomics to emulate storage, and widen the values to 32-bits when
operating on them.

* [WebGPU] Avoid buffer name collisions

Buffers are declared as global variables in WGSL, so prefix them with
the kernel name to avoid collisions.

* [WebGPU] Fix divide-by-power-of-two

WGSL requires the RHS of a shift to be unsigned.

* [WebGPU] Implement codegen for gpu_thread_barrier

* [WebGPU] Implement WGSL codegen for Evaluate

This fixes an issue with the halide_unused() intrinsic.

* [WebGPU] Add support for shared memory

This currently only supports statically-sized shared memory
allocations; dynamically-sized allocations require using
pipeline-overridable constants which are not yet implemented in Dawn.

* [WebGPU] Fix 8/16-bit load/store emulation

Loads and stores that implicitly casted to/from 32-bits were casting
when they should not have been, and not casting when they should.

* Use generic 64-bit support query in gpu_mixed_shared_mem_types

This is more robust than checking for specific APIs.

* [WebGPU] Fix object cleanup during teardown

* [WebGPU] Do not re-allocate device buffers

* [WebGPU] Mark maximum vector width as 4 in tests

* [WebGPU] Add functions to object lifetime tracker

* [WebGPU] Override FloatImm handling

We need to explicitly suffix the integer literal with a `u` to make it
unsigned.

* [WebGPU] Scalarize predicated loads

* [WebGPU] Implement if_then_else intrinsic

This is generated when scalarizing predicated loads.

* [WebGPU] Enable gpu_free_sync test

* [WebGPU] Implement print_reinterpret

Use WGSL's bitcast<> operator.

* [WebGPU] Implement print_extern_call

This is just a regular function call in WGSL.

* Add missing include and namespace for isnan/isinf

* [WebGPU] Avoid short-circuiting operators

These can cause issues with WGSL's uniformity analysis.

* [WebGPU] Use commas for struct member seperators

* [WebGPU] Update API headers and usages

Two API functions were renamed.

* [WebGPU] Use CodeGen_GPU_C base class for codegen

Introduce a new enum value for WGSL's vector syntax.

* [WebGPU] Add warning for emulated narrow integers

* [WebGPU] Update README with latest status

* [WebGPU] Add support for non-contiguous copies

Also adds support for buffer cropping/slicing.

* [WebGPU] Fix clang-tidy error

* [WebGPU] Use atomicCmpXchg for 8/16-bit emulation

Halide assumes that write-write data-races are benign when both
threads are writing the same value, but this is not true when those
writes are implemented using atomicXor. We need to use
atomicCompareExchangeWeak in a loop to perform this emulation instead.

Unfortunately this makes things even slower :-(

* [WebGPU] Support non-32-bit parameter types

Expand them to 32-bits in the shader and the runtime.

* [WebGPU] Fix mixed types in buffers

The atomic emulation of narrow types shouldn't use the allocation
type, since the same buffer may be re-used for multiple types. This
means we also sometimes need to perform 32-bit accesses using atomics
as well.

Instead of using the buffer allocation type, we now pre-traverse the
IR to check for accesses that will require emulation, and mark the
corresponding buffer as such.

* [WebGPU] Show validation errors for failed maps

* [WebGPU] Round up buffer offsets and sizes

The WebGPU API requires that these are multiples of 4.

* [WebGPU] Update implementation status in README

* [WebGPU] Replace @stage(compute) with @compute

* [WebGPU] Polyfill the pow_f32 intrinsic

The pow builtin in WGSL only has the correct semantics for x>0, so we
need to emulate the behavior for the other cases.

* [WebGPU] Skip the gpu_allocation_cache test

Allocation cache is not yet implemented for the WebGPU backend, and so
this test takes forever.

* [WebGPU] Use builtins for inverse hyperbolics

* [WebGPU] Map rint() to round()

WGSL's round() builtin has round-to-nearest-even semantics.

* [WebGPU] Set device lost callback

This provides more information about events that cause the device to
become inaccessible.

* [WebGPU] Use i32 for bool parameters

The bool type cannot be used in the uniform storage class in WGSL.

* [WebGPU] Raise limits for buffer size and workgroup storage

* [WebGPU] Update mini_webgpu.h

Disable clang-format for this file.

* [WebGPU] Add support for dynamic workgroups

This is implemented using pipeline-overridable constants.

* [WebGPU] Avoid using 'new' as an identifier

This is a reserved keyword in WGSL.

* [WebGPU] Do not merge workgroup allocations

Since we promote 8- and 16-bit integers to 32-bit in workgroup memory,
merging multiple different GPUShared allocations into a single one can
cause different regions to overlap and data races ensue.

* [WebGPU] Fix Emscripten support

* [WebGPU] Use const for integer immediates

This avoids some MSL compiler ICEs with the code generated by Dawn,
and also makes it easier for the WGSL compiler to constant-fold these
values.

* [WebGPU] Squelch clang-tidy error

* [WebGPU] Note Dawn's dependency on go in README

* [WebGPU] Add links to Emscripten vs Dawn issue

* [WebGPU] Show error finding WebGPU library fails

* [WebGPU] Add link to issue about Windows support

* [WebGPU] Rename roundUpToMultipleOf4

...and use it in one place where we weren't.

* [WebGPU] Add links to wrap_native issue

* [WebGPU] Use debug_assert for some runtime errors

* [WebGPU] Stop using designated initializers

Add -Wc++20-designator flag to runtime build to prevent this from
happening again.

* [WebGPU] Update mini_webgpu.h

* [WebGPU] Fix a validation issue with ToT Dawn

A bitcast that produces a NaN is invalid in WGSL, so implement the
`nan()` intrinsic via `float_from_bits()`.

* Update README_webgpu.md

* Add is_finite_f32, is_inf_f32, is_nan_f32

* Update isinf, isnan, isfinite; add inf_f32, neg_inf_f32

* correctness_isnan should be skipped for WebGPU

* Update isnan.cpp

* Update atomics_gpu_8_bit.cpp

* Fix python_tutorial_lesson_10_aot_compilation_run

* Partial fix for generator_aot_acquire_release()

This adds the necessary (horrible hackery) to bring the WebGPU case in line with the other backends... but the test still fails, apparently due to the same copy-to-host bug we suspect for generator_aot_gpu_only.Pushing this anyway because it was annoying to write :-)

* [WebGPU] Fix AOT test build for non-WASM

* [WebGPU] Move README to root

* [WebGPU] Address review comments

* wip

* Update CMakeLists.txt

* Work-in-progress for generator_aot_gpu_multi_context_threaded

* [WebGPU] Use a per-context staging buffer

This fixes the generator_aot_gpu_multi_context_threaded tests.

* [WebGPU] Fix clang-format issue

* [WebGPU] Move staging buffer creation

We shouldn't be doing this inside a callback as we use async error
checking routines.

* Revert "Merge branch 'webgpu' of https://github.com/jrprice/Halide into pr/6492"

This reverts commit 9d79ac6, reversing
changes made to 0673e67.

* Skip correctness_multi_way_select on x86 Macs (for now)

* clang-format

* [WebGPU] Recommit 742db3f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release_notes For changes that may warrant a note in README for official releases.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants