-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[vulkan] Improve overall performance #7202
Comments
I got very suspicious when working on performance tests for fast arctan. My test results are very variable, and they seem to be going in increments:
Another run:
They are all hovering round this 11.7ms time, and sometimes, when the test doesn't get the right performance, it goes almost neatly double of that: 24ms. Compare that to CUDA:
These neatly get gradually slower, and are about 20 times faster than Vulkan (or 40 times in case of the worst-case outliers). I'm even thinking Vulkan is waiting on vsync or something... |
Hmm, perf shows calls to |
Testing this on main, using the existing atan methods with this trimmed down version of your performance test:
|
So the current |
Running this I'm getting the following on a NVIDIA RTX 3070 Ti ...
And for Cuda ...
However, the test is calling realize({dimx, dimy}) which will compile and cache on the first call, and allocate and cache the output buffer. So the overhead is significant for this type of test. |
If I change the benchmarking code to compile first, and use existing buffer allocations, and sync the device in the loop like so ...
The runtimes are much closer:
|
Thanks a lot, will update the benchmark. Perhaps this fixes the WebGPU slowness as well... |
Specifically, reduce the number of wait calls, and remove any potential bottlenecks in the kernel submission. More importantly ... the performance_async_gpu test should pass!
Overall performance should be on par with other gpu backends like OpenCL, Metal, CUDA, etc.
The text was updated successfully, but these errors were encountered: