Kuramoto-Sivashinsky algorithm benchmark (original benchmark).
This benchmark is dominated by the cost of the FFT, leading to worse results for OpenCL with CLFFT compared to the faster CUFFT. Similarly the multithreaded backend doesn't improve much over base with the same FFT implementation. Result of the benchmarked PDE:
The Julia Set benchmark. The unrolled benchmark uses generated functions to emit an unrolled version of the inner loop. This currently doesn't yield a speed up, but was quite a bit faster in the initial tests. Needs some further research of why this slowed down - Potentially an N == 16 for the inner iteration is too big. Image of the benchmarked juliaset:
Blackschole is a nice benchmark for broadcasting performance. It's a medium heavy calculation per array element, where the calculation is completely independant from each other. The CuArray package is a bit slower here compared to GPUArrays, which should be straightforward to fix. I suspect that it's due to more promotions between integer types in the indexing code.
Poincare section of a chaotic neuronal network. The domination of OpenCL in this benchmark might be due to a better use of vector intrinsics in Transpiler.jl, but needs some more investigations. Result of calculation:
device | N = 10³ | N = 10⁹ |
---|---|---|
clarrays gpu | 4.928e-5 s 0.0827x |
0.1 s 303.1x |
clarrays cpu | 5.2626e-5 s 0.0775x |
1.3 s 34.1x |
gpuarrays threaded | 0.0003 s 0.0126x |
7.3 s 6.1x |
julia base | 4.078e-6 s 1.0x |
44.4 s 1.0x |
Mapreduce, e.g. sum!. Interestingly, for the sum benchmark the arrayfire opencl backend is the fastest, while GPUArrays OpenCL backend is the slowest. This means we should be able to remove the slowdown for GPUArrays + OpenCL and maybe also for all the CUDA backends.