Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vulkan] Improve overall performance #7202

Open
derek-gerstmann opened this issue Dec 5, 2022 · 7 comments
Open

[vulkan] Improve overall performance #7202

derek-gerstmann opened this issue Dec 5, 2022 · 7 comments
Assignees

Comments

@derek-gerstmann
Copy link
Contributor

Specifically, reduce the number of wait calls, and remove any potential bottlenecks in the kernel submission. More importantly ... the performance_async_gpu test should pass!

Overall performance should be on par with other gpu backends like OpenCL, Metal, CUDA, etc.

@mcourteaux
Copy link
Contributor

I got very suspicious when working on performance tests for fast arctan. My test results are very variable, and they seem to be going in increments:

JIT compiling fast_atan2_4 for x86-64-linux-tune_znver1-avx-avx2-f16c-fma-jit-sse41-user_context-vk_v13-vulkan
                  atan: 0.176901 ns per atan
 fast_atan (MAE 1e-02): 0.173092 ns per atan ( 2.2% faster)  [per invokation: 11.616016 ms]
 fast_atan (MAE 1e-03): 0.172300 ns per atan ( 2.6% faster)  [per invokation: 11.562847 ms]
 fast_atan (MAE 1e-04): 0.172323 ns per atan ( 2.6% faster)  [per invokation: 11.564432 ms]
 fast_atan (MAE 1e-05): 0.172705 ns per atan ( 2.4% faster)  [per invokation: 11.590015 ms]
 fast_atan (MAE 1e-06): 0.173716 ns per atan ( 1.8% faster)  [per invokation: 11.657883 ms]

                  atan2: 0.182086 ns per atan2
 fast_atan2 (MAE 1e-02): 0.174972 ns per atan2 ( 3.9% faster)  [per invokation: 11.742171 ms]
 fast_atan2 (MAE 1e-03): 0.174859 ns per atan2 ( 4.0% faster)  [per invokation: 11.734573 ms]
 fast_atan2 (MAE 1e-04): 0.175999 ns per atan2 ( 3.3% faster)  [per invokation: 11.811114 ms]
 fast_atan2 (MAE 1e-05): 0.176096 ns per atan2 ( 3.3% faster)  [per invokation: 11.817596 ms]
 fast_atan2 (MAE 1e-06): 0.176075 ns per atan2 ( 3.3% faster)  [per invokation: 11.816217 ms]

Another run:

                  atan: 0.176924 ns per atan
 fast_atan (MAE 1e-02): 0.172724 ns per atan ( 2.4% faster)  [per invokation: 11.591305 ms]
 fast_atan (MAE 1e-03): 0.173269 ns per atan ( 2.1% faster)  [per invokation: 11.627858 ms]
 fast_atan (MAE 1e-04): 0.174131 ns per atan ( 1.6% faster)  [per invokation: 11.685726 ms]
 fast_atan (MAE 1e-05): 0.173564 ns per atan ( 1.9% faster)  [per invokation: 11.647658 ms]
 fast_atan (MAE 1e-06): 0.346123 ns per atan (-95.6% faster)  [per invokation: 23.227917 ms]

                  atan2: 0.182132 ns per atan2
 fast_atan2 (MAE 1e-02): 0.175971 ns per atan2 ( 3.4% faster)  [per invokation: 11.809239 ms]
 fast_atan2 (MAE 1e-03): 0.175526 ns per atan2 ( 3.6% faster)  [per invokation: 11.779378 ms]
 fast_atan2 (MAE 1e-04): 0.176735 ns per atan2 ( 3.0% faster)  [per invokation: 11.860472 ms]
 fast_atan2 (MAE 1e-05): 0.177133 ns per atan2 ( 2.7% faster)  [per invokation: 11.887211 ms]
 fast_atan2 (MAE 1e-06): 0.360196 ns per atan2 (-97.8% faster)  [per invokation: 24.172320 ms]

They are all hovering round this 11.7ms time, and sometimes, when the test doesn't get the right performance, it goes almost neatly double of that: 24ms. Compare that to CUDA:

                  atan: 0.014434 ns per atan
 fast_atan (MAE 1e-02): 0.007271 ns per atan (49.6% faster)  [per invokation: 0.487923 ms]
 fast_atan (MAE 1e-03): 0.007490 ns per atan (48.1% faster)  [per invokation: 0.502641 ms]
 fast_atan (MAE 1e-04): 0.007792 ns per atan (46.0% faster)  [per invokation: 0.522928 ms]
 fast_atan (MAE 1e-05): 0.008710 ns per atan (39.7% faster)  [per invokation: 0.584539 ms]
 fast_atan (MAE 1e-06): 0.009016 ns per atan (37.5% faster)  [per invokation: 0.605042 ms]

                  atan2: 0.014800 ns per atan2
 fast_atan2 (MAE 1e-02): 0.009493 ns per atan2 (35.9% faster)  [per invokation: 0.637034 ms]
 fast_atan2 (MAE 1e-03): 0.009774 ns per atan2 (34.0% faster)  [per invokation: 0.655949 ms]
 fast_atan2 (MAE 1e-04): 0.010010 ns per atan2 (32.4% faster)  [per invokation: 0.671784 ms]
 fast_atan2 (MAE 1e-05): 0.010671 ns per atan2 (27.9% faster)  [per invokation: 0.716130 ms]
 fast_atan2 (MAE 1e-06): 0.010944 ns per atan2 (26.1% faster)  [per invokation: 0.734416 ms]
Success!

These neatly get gradually slower, and are about 20 times faster than Vulkan (or 40 times in case of the worst-case outliers).

I'm even thinking Vulkan is waiting on vsync or something...

@mcourteaux
Copy link
Contributor

mcourteaux commented Aug 12, 2024

Hmm, perf shows calls to _atanf maybe I'm not even using the GPU... One CPU thread goes to 100%. NVIDIA-SMI doesn't show any activity. HL_DEBUG_CODEGEN=1 show that codegen does effectively produce SPIR-V... I'm puzzled...

@derek-gerstmann
Copy link
Contributor Author

Testing this on main, using the existing atan methods with this trimmed down version of your performance test:

#include "Halide.h"
#include "halide_benchmark.h"

#ifndef M_PI
#define M_PI 3.14159265358979310000
#endif

using namespace Halide;
using namespace Halide::Tools;

int main(int argc, char **argv) {
    Target target = get_jit_target_from_environment();
    if (target.arch == Target::WebAssembly) {
        printf("[SKIP] Performance tests are meaningless and/or misleading under WebAssembly interpreter.\n");
        return 0;
    }
    if (target.has_feature(Target::WebGPU)) {
        printf("[SKIP] WebGPU seems to perform bad, and fast_atan is not really faster in all scenarios.\n");
        return 0;
    }

    Var x, y;
    const int test_w = 256;
    const int test_h = 256;

    Expr t0 = x / float(test_w);
    Expr t1 = y / float(test_h);

    // To make sure we time mostely the computation of the arctan, and not memory bandwidth,
    // we will compute many arctans per output and sum them. In my testing, GPUs suffer more
    // from bandwith with this test, so we give it more arctangenses to compute per output.

    const int test_d = target.has_gpu_feature() ? 1024 : 64;
    RDom rdom{0, test_d};
    Expr off = rdom / float(test_d) - 0.5f;

    float range = -10.0f;
    Func atan_ref{"atan_ref"}, atan2_ref{"atan2_ref"};
    atan_ref(x, y) = sum(atan(-range * t0 + (1 - t0) * range + off));
    atan2_ref(x, y) = sum(atan2(-range * t0 + (1 - t0) * range + off, -range * t1 + (1 - t1) * range));

    Var xo, xi;
    Var yo, yi;
    if (target.has_gpu_feature()) {
        atan_ref.never_partition_all();
        atan2_ref.never_partition_all();
        atan_ref.gpu_tile(x, y, xo, yo, xi, yi, 16, 16, TailStrategy::ShiftInwards);
        atan2_ref.gpu_tile(x, y, xo, yo, xi, yi, 16, 16, TailStrategy::ShiftInwards);
    } else {
        atan_ref.vectorize(x, 8);
        atan2_ref.vectorize(x, 8);
    }

    Tools::BenchmarkConfig cfg = {0.2, 1.0};
    double scale = 1e9 / (double(test_w) * (test_h * test_d));

    // clang-format off
    double t_atan  = scale * benchmark([&]() {  atan_ref.realize({test_w, test_h}); }, cfg);
    double t_atan2 = scale * benchmark([&]() { atan2_ref.realize({test_w, test_h}); }, cfg);
    // clang-format on

    printf("                  atan: %f ns per atan\n", t_atan);
    printf("                  atan2: %f ns per atan2\n", t_atan2);
    printf("Success!\n");
    return 0;
}

@derek-gerstmann
Copy link
Contributor Author

derek-gerstmann commented Aug 12, 2024

> HL_SPIRV_DUMP_FILE=atan.spirv HL_JIT_TARGET="host-vulkan-vk_int8-vk_int16-vk_int64-vk_float16-vk_float64-vk_v13" ./build/test/performance/performance_fast_atan 
>  spirv-dis atan.spirv
; SPIR-V
; Version: 1.2
; Generator: Khronos; 0
; Bound: 107
; Schema: 0
               OpCapability Shader
         %53 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel Logical GLSL450
               OpEntryPoint GLCompute %_kernel_atan2_ref_s0_v1_v9_block_id_y "_kernel_atan2_ref_s0_v1_v9_block_id_y" %k1_LocalInvocationId %k1_WorkgroupId
               OpExecutionMode %_kernel_atan2_ref_s0_v1_v9_block_id_y LocalSize 16 16 1
               OpName %_kernel_atan2_ref_s0_v1_v9_block_id_y "_kernel_atan2_ref_s0_v1_v9_block_id_y"
               OpName %k1_LocalInvocationId "k1_LocalInvocationId"
               OpName %k1_WorkgroupId "k1_WorkgroupId"
               OpName %k1_args_struct "k1_args_struct"
               OpName %k1_args_var "k1_args_var"
               OpName %k1_buffer_block1 "k1_buffer_block1"
               OpName %k1_atan2_ref "k1_atan2_ref"
               OpName %k1_sum_1_0 "k1_sum$1.0"
               OpName %k1_loop_idx_1 "k1_loop_idx$1"
               OpDecorate %k1_LocalInvocationId BuiltIn LocalInvocationId
               OpDecorate %k1_WorkgroupId BuiltIn WorkgroupId
               OpMemberDecorate %k1_args_struct 0 Offset 0
               OpMemberDecorate %k1_args_struct 1 Offset 4
               OpMemberDecorate %k1_args_struct 2 Offset 8
               OpMemberDecorate %k1_args_struct 3 Offset 12
               OpMemberDecorate %k1_args_struct 4 Offset 16
               OpMemberDecorate %k1_args_struct 5 Offset 20
               OpDecorate %k1_args_struct Block
               OpDecorate %k1_args_var DescriptorSet 0
               OpDecorate %k1_args_var Binding 0
               OpDecorate %_runtimearr_float ArrayStride 4
               OpDecorate %k1_buffer_block1 BufferBlock
               OpMemberDecorate %k1_buffer_block1 0 Offset 0
               OpDecorate %k1_atan2_ref DescriptorSet 0
               OpDecorate %k1_atan2_ref Binding 1
       %void = OpTypeVoid
          %4 = OpTypeFunction %void
       %uint = OpTypeInt 32 0
     %v3uint = OpTypeVector %uint 3
%_ptr_Input_v3uint = OpTypePointer Input %v3uint
        %int = OpTypeInt 32 1
%k1_args_struct = OpTypeStruct %int %int %int %int %int %int
%_ptr_Uniform_k1_args_struct = OpTypePointer Uniform %k1_args_struct
%_ptr_Uniform_int = OpTypePointer Uniform %int
      %float = OpTypeFloat 32
%_runtimearr_float = OpTypeRuntimeArray %float
%k1_buffer_block1 = OpTypeStruct %_runtimearr_float
%_ptr_Uniform_k1_buffer_block1 = OpTypePointer Uniform %k1_buffer_block1
%_ptr_Function_float = OpTypePointer Function %float
%_ptr_Function_int = OpTypePointer Function %int
       %bool = OpTypeBool
%_ptr_Uniform_float = OpTypePointer Uniform %float
     %uint_0 = OpConstant %uint 0
     %uint_1 = OpConstant %uint 1
     %uint_2 = OpConstant %uint 2
     %uint_3 = OpConstant %uint 3
     %uint_4 = OpConstant %uint 4
     %uint_5 = OpConstant %uint 5
     %int_16 = OpConstant %int 16
    %int_n16 = OpConstant %int -16
    %float_0 = OpConstant %float 0
      %int_0 = OpConstant %int 0
   %float_80 = OpConstant %float 80
%float_0_078125 = OpConstant %float 0.078125
  %float_n10 = OpConstant %float -10
   %int_1024 = OpConstant %int 1024
%float_0_0009765625 = OpConstant %float 0.0009765625
%float_n10_5 = OpConstant %float -10.5
      %int_1 = OpConstant %int 1
%k1_LocalInvocationId = OpVariable %_ptr_Input_v3uint Input
%k1_WorkgroupId = OpVariable %_ptr_Input_v3uint Input
%k1_args_var = OpVariable %_ptr_Uniform_k1_args_struct Uniform
%k1_atan2_ref = OpVariable %_ptr_Uniform_k1_buffer_block1 Uniform
%_kernel_atan2_ref_s0_v1_v9_block_id_y = OpFunction %void None %4
          %5 = OpLabel
 %k1_sum_1_0 = OpVariable %_ptr_Function_float Function
%k1_loop_idx_1 = OpVariable %_ptr_Function_int Function
         %10 = OpLoad %v3uint %k1_LocalInvocationId None
         %12 = OpLoad %v3uint %k1_WorkgroupId None
         %19 = OpInBoundsAccessChain %_ptr_Uniform_int %k1_args_var %uint_0
         %20 = OpLoad %int %19 None
         %22 = OpInBoundsAccessChain %_ptr_Uniform_int %k1_args_var %uint_1
         %23 = OpLoad %int %22 None
         %25 = OpInBoundsAccessChain %_ptr_Uniform_int %k1_args_var %uint_2
         %26 = OpLoad %int %25 None
         %28 = OpInBoundsAccessChain %_ptr_Uniform_int %k1_args_var %uint_3
         %29 = OpLoad %int %28 None
         %31 = OpInBoundsAccessChain %_ptr_Uniform_int %k1_args_var %uint_4
         %32 = OpLoad %int %31 None
         %34 = OpInBoundsAccessChain %_ptr_Uniform_int %k1_args_var %uint_5
         %35 = OpLoad %int %34 None
         %41 = OpCompositeExtract %uint %12 1
         %42 = OpBitcast %int %41
         %43 = OpCompositeExtract %uint %12 0
         %44 = OpBitcast %int %43
         %45 = OpCompositeExtract %uint %10 1
         %46 = OpBitcast %int %45
         %47 = OpCompositeExtract %uint %10 0
         %48 = OpBitcast %int %47
         %50 = OpIMul %int %42 %int_16
         %52 = OpIAdd %int %23 %int_n16
         %54 = OpExtInst %int %53 SMin %50 %52
         %57 = OpIMul %int %44 %int_16
         %58 = OpIAdd %int %20 %int_n16
         %59 = OpExtInst %int %53 SMin %57 %58
               OpStore %k1_sum_1_0 %float_0 None
         %62 = OpIAdd %int %26 %59
         %63 = OpIAdd %int %62 %48
         %64 = OpConvertSToF %float %63
         %66 = OpFMul %float %64 %float_80
         %67 = OpIAdd %int %29 %54
         %68 = OpIAdd %int %67 %46
         %69 = OpConvertSToF %float %68
         %71 = OpFMul %float %69 %float_0_078125
         %73 = OpFAdd %float %71 %float_n10
         %76 = OpIAdd %int %int_0 %int_1024
               OpStore %k1_loop_idx_1 %int_0 None
               OpBranch %78
         %78 = OpLabel
               OpLoopMerge %82 %81 DontUnroll
               OpBranch %79
         %79 = OpLabel
         %83 = OpLoad %int %k1_loop_idx_1 None
         %85 = OpULessThan %bool %83 %76
               OpBranchConditional %85 %80 %82
         %80 = OpLabel
         %86 = OpConvertSToF %float %83
         %87 = OpFAdd %float %66 %86
         %89 = OpFMul %float %87 %float_0_0009765625
         %91 = OpFAdd %float %89 %float_n10_5
         %92 = OpExtInst %float %53 Atan2 %91 %73
         %93 = OpLoad %float %k1_sum_1_0 None
         %94 = OpFAdd %float %92 %93
               OpStore %k1_sum_1_0 %94 None
               OpBranch %81
         %81 = OpLabel
         %97 = OpLoad %int %k1_loop_idx_1 None
         %95 = OpIAdd %int %97 %int_1
               OpStore %k1_loop_idx_1 %95 None
               OpBranch %78
         %82 = OpLabel
         %98 = OpLoad %float %k1_sum_1_0 None
         %99 = OpIAdd %int %29 %54
        %100 = OpIAdd %int %99 %46
        %101 = OpIMul %int %100 %32
        %102 = OpIAdd %int %59 %35
        %103 = OpIAdd %int %101 %102
        %104 = OpIAdd %int %103 %48
        %106 = OpInBoundsAccessChain %_ptr_Uniform_float %k1_atan2_ref %uint_0 %104
               OpStore %106 %98 None
               OpReturn
               OpFunctionEnd

So the current atan is getting mapped to the native SPIRV atan instruction (see %92).

@derek-gerstmann
Copy link
Contributor Author

Running this I'm getting the following on a NVIDIA RTX 3070 Ti ...

> HL_JIT_TARGET="host-vulkan-vk_int8-vk_int16-vk_int64-vk_float16-vk_float64-vk_v13" ./build/test/performance/performance_fast_atan 
                  atan: 0.077941 ns per atan
                  atan2: 0.081444 ns per atan2
Success!

And for Cuda ...

> HL_JIT_TARGET="host-cuda" ./build/test/performance/performance_fast_atan 
                  atan: 0.005477 ns per atan
                  atan2: 0.007064 ns per atan2
Success!

However, the test is calling realize({dimx, dimy}) which will compile and cache on the first call, and allocate and cache the output buffer. So the overhead is significant for this type of test.

@derek-gerstmann
Copy link
Contributor Author

If I change the benchmarking code to compile first, and use existing buffer allocations, and sync the device in the loop like so ...

...

    atan_ref.compile_jit();
    atan2_ref.compile_jit();

    Buffer<float> atan_out(test_w, test_h);
    Buffer<float> atan2_out(test_w, test_h);

    Tools::BenchmarkConfig cfg = {0.2, 1.0};
    double scale = 1e9 / (double(test_w) * (test_h * test_d));

    // clang-format off
    double t_atan  = scale * benchmark([&]() {  atan_ref.realize(atan_out); atan_out.device_sync(); }, cfg);
    double t_atan2 = scale * benchmark([&]() { atan2_ref.realize(atan2_out); atan2_out.device_sync(); }, cfg);
    // clang-format on
...

The runtimes are much closer:

> HL_JIT_TARGET="host-vulkan-vk_int8-vk_int16-vk_int64-vk_float16-vk_float64-vk_v13" ./build/test/performance/performance_fast_atan 
                  atan: 0.004023 ns per atan
                  atan2: 0.007173 ns per atan2
Success!
> HL_JIT_TARGET="x86-64-linux-tune_znver3-avx-avx2-f16c-fma-sse41-cuda" ./build/test/performance/performance_fast_atan 
                  atan: 0.005034 ns per atan
                  atan2: 0.006537 ns per atan2
Success!

@mcourteaux
Copy link
Contributor

Thanks a lot, will update the benchmark. Perhaps this fixes the WebGPU slowness as well...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants