Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: mul_mat_vec_q tiling, refactor mul mat logic #5434

Merged

Conversation

JohannesGaessler
Copy link
Collaborator

This PR does the following:

  • Refactor ggml_cuda_mul_mat and simplify the logic for choosing between different matrix multiplication kernels. This also fixes P100s not being treated as having good FP16 performance.
  • Add tiling in src0 rows to mul_mat_vec_q. This increases arithmetic intensity and results in higher t/s for batch sizes > 1. This and a reduction in the number of warps for batch sizes 5-8 (to reduce register pressure) mostly fixes the performance regression sometimes seen when increasing batch sizes. I increased the maximum batch size for mul_mat_vec_q to 8. There are still some cases where the performance regresses slightly as you increase the batch size if you get unlucky with occupancy but these cases should be rare now.
  • Refactor mul_mat_vec_q to use as few registers as possible so that there are more for loop unrolling. This increases performance on average but it again is possible to get unlucky with occupancy so there are also some cases where performance is ~1% worse.
  • Removed the option to compile a mul_mat_vec_q kernel with variable ncols_y. With a batch size of 8 the kernel seems to already be hitting its limits and making ncols_y purely a template parameter makes it possible to reduce compilation time.

I'm currently too tired to re-run all of the performance tests; I'll post results tomorrow.

Because of the loop unrolling for the new kernels the compilation time has increased. On master the command

make clean && time make LLAMA_CUBLAS=1 LLAMA_NO_CCACHE=1 main

reports 13.443 s, with this PR it's 18.691 s. It may make sense to add an option like LLAMA_FAST_COMPILE that reduces compilation time as much as possible.

@cebtenzzre
Copy link
Collaborator

reports 13.443 s, with this PR it's 18.691 s. It may make sense to add an option like LLAMA_FAST_COMPILE that reduces compilation time as much as possible.

I hate to say it, but this is one of the downsides of splitting a repo into as few files as possible - compilation is very serial and inefficient. cmake is probably worse right now because it seems to be building and archiving static libraries before compiling the examples and tests that depend on them.

@slaren
Copy link
Collaborator

slaren commented Feb 9, 2024

In this case the reason of the high compilation time is the number of template instantiations. I am not sure that anything short of putting each kernel in a different source file would help with that. I would still prefer to split the sources into more files, even if just to ease working with the code, though. I find it very hard to work with a 10k LOC file.

@Artefact2
Copy link
Collaborator

Artefact2 commented Feb 9, 2024

This commit seems to fail badly on ROCm.

% ./llama-bench -m ../models/Chronomaid-Storytelling-13b-Q4_K_S.gguf 
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon RX 6750 XT, compute capability 10.3, VMM: no
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
Memory access fault by GPU node-1 (Agent handle: 0x59aef28eb440) on address 0x7a5895227000. Reason: Page not present or supervisor privilege.
% journalctl -b
Feb 09 23:39:51 Silmeria kernel: gmc_v10_0_process_interrupt: 371 callbacks suppressed
Feb 09 23:39:51 Silmeria kernel: amdgpu 0000:0a:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:8 pasid:32772, for process llama-bench pid 4528 thread llama-bench pid 4528)
Feb 09 23:39:51 Silmeria kernel: amdgpu 0000:0a:00.0: amdgpu:   in page starting at address 0x00007a5895200000 from client 0x1b (UTCL2)
Feb 09 23:39:51 Silmeria kernel: amdgpu 0000:0a:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00801031
Feb 09 23:39:51 Silmeria kernel: amdgpu 0000:0a:00.0: amdgpu:          Faulty UTCL2 client ID: TCP (0x8)
Feb 09 23:39:51 Silmeria kernel: amdgpu 0000:0a:00.0: amdgpu:          MORE_FAULTS: 0x1
Feb 09 23:39:51 Silmeria kernel: amdgpu 0000:0a:00.0: amdgpu:          WALKER_ERROR: 0x0
Feb 09 23:39:51 Silmeria kernel: amdgpu 0000:0a:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Feb 09 23:39:51 Silmeria kernel: amdgpu 0000:0a:00.0: amdgpu:          MAPPING_ERROR: 0x0
Feb 09 23:39:51 Silmeria kernel: amdgpu 0000:0a:00.0: amdgpu:          RW: 0x0
(gdb) thread apply all bt

Thread 6 (Thread 0x7ffee82ff6c0 (LWP 5562) "Hostcall Listen"):
#0  __GI___ioctl (fd=4, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1  0x00007fffa31fd7f1 in ?? () from /opt/rocm/lib/libhsakmt.so.1
#2  0x00007fffa31f5a21 in hsaKmtWaitOnMultipleEvents_Ext () from /opt/rocm/lib/libhsakmt.so.1
#3  0x00007fffa31f629c in hsaKmtWaitOnEvent_Ext () from /opt/rocm/lib/libhsakmt.so.1
#4  0x00007fff39c65921 in rocr::core::InterruptSignal::WaitRelaxed (this=0x55556b28ad10, condition=HSA_SIGNAL_CONDITION_NE, compare_value=1, timeout=<optimized out>, wait_hint=HSA_WAIT_STATE_BLOCKED) at /usr/src/debug/hsa-rocr/ROCR-Runtime-rocm-6.0.0/src/core/runtime/interrupt_signal.cpp:243
#5  0x00007fff39c6562e in rocr::core::InterruptSignal::WaitAcquire (this=<optimized out>, condition=<optimized out>, compare_value=<optimized out>, timeout=<optimized out>, wait_hint=<optimized out>) at /usr/src/debug/hsa-rocr/ROCR-Runtime-rocm-6.0.0/src/core/runtime/interrupt_signal.cpp:251
#6  0x00007fff39c58b41 in rocr::HSA::hsa_signal_wait_scacquire (hsa_signal=..., condition=HSA_SIGNAL_CONDITION_NE, compare_value=1, timeout_hint=4000000, wait_state_hint=HSA_WAIT_STATE_BLOCKED) at /usr/src/debug/hsa-rocr/ROCR-Runtime-rocm-6.0.0/src/core/runtime/hsa.cpp:1220
#7  0x00007ffff6b21d32 in roc::Signal::Wait (timeout=4000000, c=device::Signal::Condition::Ne, value=1, this=<optimized out>) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/rocclr/device/rocm/rocsignal.cpp:43
#8  HostcallListener::consumePackets (this=0x55556b275d20) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/rocclr/device/devhostcall.cpp:282
#9  HostcallListener::Thread::run (this=<optimized out>, data=0x55556b275d20) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/rocclr/device/devhostcall.cpp:237
#10 0x00007ffff6ad5532 in amd::Thread::main (this=0x55556b275db8) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/rocclr/thread/thread.cpp:93
#11 amd::Thread::entry (thread=0x55556b275db8) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/rocclr/os/os_posix.cpp:351
#12 0x00007fffa32a955a in start_thread (arg=<optimized out>) at pthread_create.c:447
#13 0x00007fffa3326a3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 2 (Thread 0x7ffeec98f6c0 (LWP 5556) "llama-bench"):
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007fffa32ab393 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2  0x00007fffa325a6c8 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007fffa32424b8 in __GI_abort () at abort.c:79
#4  0x00007fff39c23514 in rocr::core::Runtime::VMFaultHandler (val=<optimized out>, arg=<optimized out>) at /usr/src/debug/hsa-rocr/ROCR-Runtime-rocm-6.0.0/src/core/runtime/runtime.cpp:1429
#5  0x00007fff39c7f642 in rocr::core::Runtime::AsyncEventsLoop () at /usr/include/c++/13.2.1/bits/stl_vector.h:1125
#6  0x00007fff39c27a6c in rocr::os::ThreadTrampoline (arg=<optimized out>) at /usr/src/debug/hsa-rocr/ROCR-Runtime-rocm-6.0.0/src/core/util/lnx/os_linux.cpp:80
#7  0x00007fffa32a955a in start_thread (arg=<optimized out>) at pthread_create.c:447
#8  0x00007fffa3326a3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 1 (Thread 0x7fffa367fc00 (LWP 5507) "llama-bench"):
#0  0x00007fffa32b97fb in __GI___libc_malloc (bytes=bytes@entry=600) at malloc.c:3347
#1  0x00007fffa34b089d in operator new (sz=sz@entry=600) at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/new_op.cc:50
#2  0x00007ffff6a0b38b in amd::ReferenceCountedObject::operator new (size=<optimized out>, size=<optimized out>) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/rocclr/include/top.hpp:188
#3  ihipLaunchKernelCommand (command=@0x7fffffffbf68: 0x0, f=f@entry=0x555567bee390, globalWorkSizeX=globalWorkSizeX@entry=5120, globalWorkSizeY=globalWorkSizeY@entry=2, globalWorkSizeZ=globalWorkSizeZ@entry=1, blockDimX=blockDimX@entry=256, blockDimY=1, blockDimZ=1, sharedMemBytes=0, stream=0x555556968510, kernelParams=0x7fffffffc8d0, extra=0x0, startEvent=0x0, stopEvent=0x0, flags=0, params=0, gridId=0, numGrids=0, prevGridSum=0, allGridSum=0, firstDevice=0) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/hipamd/src/hip_module.cpp:336
#4  0x00007ffff6a0bba1 in ihipModuleLaunchKernel (f=0x555567bee390, globalWorkSizeX=5120, globalWorkSizeY=2, globalWorkSizeZ=1, blockDimX=256, blockDimY=1, blockDimZ=1, sharedMemBytes=0, hStream=0x555556968510, kernelParams=0x7fffffffc8d0, extra=0x0, startEvent=0x0, stopEvent=0x0, flags=0, params=0, gridId=0, numGrids=0, prevGridSum=0, allGridSum=0, firstDevice=0) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/hipamd/src/hip_module.cpp:393
#5  0x00007ffff6a2f0d4 in ihipLaunchKernel (hostFunction=0x555555d9c5d0 <quantize_q8_1(float const*, void*, int, int)>, gridDim=..., blockDim=..., args=0x7fffffffc8d0, sharedMemBytes=0, stream=0x555556968510, startEvent=0x0, stopEvent=0x0, flags=0) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/hipamd/src/hip_platform.cpp:584
#6  0x00007ffff6a05f22 in hipLaunchKernel_common (hostFunction=0x555555d9c5d0 <quantize_q8_1(float const*, void*, int, int)>, gridDim=..., blockDim=..., args=0x7fffffffc8d0, sharedMemBytes=0, stream=<optimized out>) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/hipamd/src/hip_module.cpp:662
#7  0x00007ffff6a121bd in hipLaunchKernel (hostFunction=0x555555d9c5d0 <quantize_q8_1(float const*, void*, int, int)>, gridDim=..., blockDim=..., args=<optimized out>, sharedMemBytes=<optimized out>, stream=<optimized out>) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/hipamd/src/hip_module.cpp:669
--Type <RET> for more, q to quit, c to continue without paging--c
#8  0x00005555556f51a3 in __device_stub__quantize_q8_1(float const*, void*, int, int) ()
#9  0x00005555556f50a0 in quantize_row_q8_1_cuda (x=0x7ffd9fc00000, vy=0x7ffda5220000, kx=5120, ky=2, kx_padded=5120, stream=0x555556968510) at ggml-cuda.cu:6579
#10 0x00005555556eeddb in ggml_cuda_op_mul_mat (src0=0x555567cca7a0, src1=0x55556a2886f0, dst=0x55556a288a10, op=0x5555556f0770 <ggml_cuda_op_mul_mat_vec_q(ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, ihipStream_t*)>, convert_src1_to_q8_1=true) at ggml-cuda.cu:9396
#11 0x00005555556d7d94 in ggml_cuda_mul_mat (src0=0x555567cca7a0, src1=0x55556a2886f0, dst=0x55556a288a10) at ggml-cuda.cu:9993
#12 0x00005555556d738a in ggml_cuda_compute_forward (params=0x7fffffffd8c8, tensor=0x55556a288a10) at ggml-cuda.cu:10622
#13 0x000055555570f15f in ggml_backend_cuda_graph_compute (backend=0x5555677b3a80, cgraph=0x55556a54bc28) at ggml-cuda.cu:11313
#14 0x000055555571a3fa in ggml_backend_graph_compute (backend=0x5555677b3a80, cgraph=0x55556a54bc28) at ggml-backend.c:256
#15 0x000055555571df74 in sched_compute_splits (sched=0x55556a54b8f0) at ggml-backend.c:1453
#16 0x000055555571e6ff in ggml_backend_sched_graph_compute (sched=0x55556a54b8f0, graph=0x55556a21b870) at ggml-backend.c:1579
#17 0x00005555555db894 in llama_decode_internal (lctx=..., batch=...) at llama.cpp:7308
#18 0x00005555555eb923 in llama_decode (ctx=0x555567be66a0, batch=...) at llama.cpp:11644
#19 0x0000555555745fd3 in test_prompt (ctx=0x555567be66a0, n_prompt=2, n_past=0, n_batch=512, n_threads=8) at examples/llama-bench/llama-bench.cpp:1111
#20 0x0000555555746701 in main (argc=3, argv=0x7fffffffe348) at examples/llama-bench/llama-bench.cpp:1212

Edit: fixed in 2bb97fc.

@slaren
Copy link
Collaborator

slaren commented Feb 9, 2024

Some results
GPU Model Test t/s master t/s cuda-faster-mmvq-12 Speedup
RTX 3090 Ti llama 7B Q4_0 tg1 132.59 133.54 1.01
RTX 3090 Ti llama 7B Q4_0 pp2 258.57 260.18 1.01
RTX 3090 Ti llama 7B Q4_0 pp3 360.68 381.09 1.06
RTX 3090 Ti llama 7B Q4_0 pp4 429.91 489.20 1.14
RTX 3090 Ti llama 7B Q4_0 pp5 355.36 504.63 1.42
RTX 3090 Ti llama 7B Q4_0 pp6 425.44 577.27 1.36
RTX 3090 Ti llama 7B Q4_0 pp7 490.16 623.50 1.27
RTX 3090 Ti llama 7B Q4_0 pp8 560.78 691.22 1.23
RTX 3090 Ti llama 13B Q4_0 tg1 82.37 82.22 1.00
RTX 3090 Ti llama 13B Q4_0 pp2 160.77 160.70 1.00
RTX 3090 Ti llama 13B Q4_0 pp3 225.33 234.99 1.04
RTX 3090 Ti llama 13B Q4_0 pp4 269.30 301.56 1.12
RTX 3090 Ti llama 13B Q4_0 pp5 217.98 310.23 1.42
RTX 3090 Ti llama 13B Q4_0 pp6 261.57 352.30 1.35
RTX 3090 Ti llama 13B Q4_0 pp7 303.34 375.35 1.24
RTX 3090 Ti llama 13B Q4_0 pp8 345.41 416.29 1.21
RTX 3080 llama 7B Q4_0 tg1 112.12 110.22 0.98
RTX 3080 llama 7B Q4_0 pp2 217.45 218.05 1.00
RTX 3080 llama 7B Q4_0 pp3 299.76 315.32 1.05
RTX 3080 llama 7B Q4_0 pp4 354.66 410.39 1.16
RTX 3080 llama 7B Q4_0 pp5 306.18 430.59 1.41
RTX 3080 llama 7B Q4_0 pp6 363.81 491.54 1.35
RTX 3080 llama 7B Q4_0 pp7 422.88 516.66 1.22
RTX 3080 llama 7B Q4_0 pp8 481.83 547.08 1.14
RTX 3080 llama 13B Q4_0 tg1 67.23 66.62 0.99
RTX 3080 llama 13B Q4_0 pp2 131.38 131.61 1.00
RTX 3080 llama 13B Q4_0 pp3 182.14 191.19 1.05
RTX 3080 llama 13B Q4_0 pp4 211.87 242.29 1.14
RTX 3080 llama 13B Q4_0 pp5 168.92 254.30 1.51
RTX 3080 llama 13B Q4_0 pp6 202.04 289.77 1.43
RTX 3080 llama 13B Q4_0 pp7 233.61 299.19 1.28
RTX 3080 llama 13B Q4_0 pp8 265.20 314.00 1.18
Small script to automate using compare-llama-bench.py
#!/bin/bash

set -e
set -x

if [ $# -lt 2 ]; then
    echo "usage: ./scripts/compare-commits.sh <commit1> <commit2> [additional llama-bench arguments]"
    exit 1
fi

bench_args="${@:3}"

rm -f llama-bench.sqlite

git checkout $1
make clean && LLAMA_CUBLAS=1 make -j32 llama-bench
./llama-bench -o sql $bench_args | tee /dev/tty | sqlite3 llama-bench.sqlite

git checkout $2
make clean && LLAMA_CUBLAS=1 make -j32 llama-bench
./llama-bench -o sql $bench_args | tee /dev/tty | sqlite3 llama-bench.sqlite

./scripts/compare-llama-bench.py -b $1 -c $2

Example usage:

scripts/compare-commits.sh master cuda-faster-mmvq-12 -p 2,3,4,5,6,7,8 -n 1 -r 100

@ggerganov
Copy link
Owner

ggerganov commented Feb 10, 2024

Results on V100, RTX 2060 and A100

bash scripts/compare-commits.sh master cuda-faster-mmvq-12 -p 2,3,4,5,6,7,8 -n 1 -r 100 -m $(echo models-mnt/open-llama/7B-v2/ggml-model-*.gguf | sed -e "s/ /,/g")
V100

Device 0: Tesla V100-PCIE-16GB, compute capability 7.0, VMM: yes

GPU Model Test t/s master t/s cuda-faster-mmvq-12 Speedup
V100-PCIE-16GB llama 7B F16 pp2 88.54 88.54 1.00
V100-PCIE-16GB llama 7B F16 pp3 131.61 132.17 1.00
V100-PCIE-16GB llama 7B F16 pp4 175.00 175.30 1.00
V100-PCIE-16GB llama 7B F16 pp5 216.35 216.67 1.00
V100-PCIE-16GB llama 7B F16 pp6 251.15 252.34 1.00
V100-PCIE-16GB llama 7B F16 pp7 290.77 291.89 1.00
V100-PCIE-16GB llama 7B F16 pp8 331.40 332.58 1.00
V100-PCIE-16GB llama 7B F16 tg1 51.69 51.70 1.00
V100-PCIE-16GB llama 7B Q2_K_M pp2 164.05 174.92 1.07
V100-PCIE-16GB llama 7B Q2_K_M pp3 206.59 211.91 1.03
V100-PCIE-16GB llama 7B Q2_K_M pp4 232.01 246.09 1.06
V100-PCIE-16GB llama 7B Q2_K_M pp5 170.69 258.84 1.52
V100-PCIE-16GB llama 7B Q2_K_M pp6 205.15 279.95 1.36
V100-PCIE-16GB llama 7B Q2_K_M pp7 238.27 269.67 1.13
V100-PCIE-16GB llama 7B Q2_K_M pp8 270.43 284.50 1.05
V100-PCIE-16GB llama 7B Q2_K_M tg1 97.44 98.48 1.01
V100-PCIE-16GB llama 7B Q3_K_M pp2 155.60 159.43 1.02
V100-PCIE-16GB llama 7B Q3_K_M pp3 200.98 210.37 1.05
V100-PCIE-16GB llama 7B Q3_K_M pp4 238.34 249.48 1.05
V100-PCIE-16GB llama 7B Q3_K_M pp5 141.75 279.57 1.97
V100-PCIE-16GB llama 7B Q3_K_M pp6 169.70 302.38 1.78
V100-PCIE-16GB llama 7B Q3_K_M pp7 196.81 320.81 1.63
V100-PCIE-16GB llama 7B Q3_K_M pp8 222.65 320.36 1.44
V100-PCIE-16GB llama 7B Q3_K_M tg1 87.63 87.35 1.00
V100-PCIE-16GB llama 7B Q4_0 pp2 208.03 222.05 1.07
V100-PCIE-16GB llama 7B Q4_0 pp3 265.53 274.66 1.03
V100-PCIE-16GB llama 7B Q4_0 pp4 308.18 329.83 1.07
V100-PCIE-16GB llama 7B Q4_0 pp5 290.75 394.38 1.36
V100-PCIE-16GB llama 7B Q4_0 pp6 346.87 438.34 1.26
V100-PCIE-16GB llama 7B Q4_0 pp7 399.16 462.65 1.16
V100-PCIE-16GB llama 7B Q4_0 pp8 455.37 488.27 1.07
V100-PCIE-16GB llama 7B Q4_0 tg1 113.33 114.01 1.01
V100-PCIE-16GB llama 7B Q4_1 pp2 214.30 215.34 1.00
V100-PCIE-16GB llama 7B Q4_1 pp3 282.61 268.95 0.95
V100-PCIE-16GB llama 7B Q4_1 pp4 328.24 327.83 1.00
V100-PCIE-16GB llama 7B Q4_1 pp5 307.88 396.97 1.29
V100-PCIE-16GB llama 7B Q4_1 pp6 368.81 447.44 1.21
V100-PCIE-16GB llama 7B Q4_1 pp7 425.87 483.65 1.14
V100-PCIE-16GB llama 7B Q4_1 pp8 483.55 502.72 1.04
V100-PCIE-16GB llama 7B Q4_1 tg1 107.61 108.37 1.01
V100-PCIE-16GB llama 7B Q4_K_M pp2 180.23 188.33 1.04
V100-PCIE-16GB llama 7B Q4_K_M pp3 227.60 234.35 1.03
V100-PCIE-16GB llama 7B Q4_K_M pp4 268.04 267.35 1.00
V100-PCIE-16GB llama 7B Q4_K_M pp5 229.58 303.87 1.32
V100-PCIE-16GB llama 7B Q4_K_M pp6 273.90 316.38 1.16
V100-PCIE-16GB llama 7B Q4_K_M pp7 317.11 328.78 1.04
V100-PCIE-16GB llama 7B Q4_K_M pp8 360.99 306.84 0.85
V100-PCIE-16GB llama 7B Q4_K_M tg1 106.52 106.79 1.00
V100-PCIE-16GB llama 7B Q5_0 pp2 188.48 197.34 1.05
V100-PCIE-16GB llama 7B Q5_0 pp3 250.35 250.74 1.00
V100-PCIE-16GB llama 7B Q5_0 pp4 295.23 303.51 1.03
V100-PCIE-16GB llama 7B Q5_0 pp5 173.57 362.25 2.09
V100-PCIE-16GB llama 7B Q5_0 pp6 207.53 404.44 1.95
V100-PCIE-16GB llama 7B Q5_0 pp7 240.17 433.96 1.81
V100-PCIE-16GB llama 7B Q5_0 pp8 273.25 457.64 1.67
V100-PCIE-16GB llama 7B Q5_0 tg1 101.13 100.32 0.99
V100-PCIE-16GB llama 7B Q5_1 pp2 194.37 193.89 1.00
V100-PCIE-16GB llama 7B Q5_1 pp3 261.26 247.25 0.95
V100-PCIE-16GB llama 7B Q5_1 pp4 316.47 305.25 0.96
V100-PCIE-16GB llama 7B Q5_1 pp5 220.07 367.51 1.67
V100-PCIE-16GB llama 7B Q5_1 pp6 262.35 417.17 1.59
V100-PCIE-16GB llama 7B Q5_1 pp7 303.87 453.54 1.49
V100-PCIE-16GB llama 7B Q5_1 pp8 345.11 475.82 1.38
V100-PCIE-16GB llama 7B Q5_1 tg1 98.03 97.04 0.99
V100-PCIE-16GB llama 7B Q5_K_M pp2 170.46 180.16 1.06
V100-PCIE-16GB llama 7B Q5_K_M pp3 219.73 221.50 1.01
V100-PCIE-16GB llama 7B Q5_K_M pp4 257.19 256.52 1.00
V100-PCIE-16GB llama 7B Q5_K_M pp5 195.42 285.05 1.46
V100-PCIE-16GB llama 7B Q5_K_M pp6 233.21 311.29 1.33
V100-PCIE-16GB llama 7B Q5_K_M pp7 270.92 323.66 1.19
V100-PCIE-16GB llama 7B Q5_K_M pp8 307.21 305.84 1.00
V100-PCIE-16GB llama 7B Q5_K_M tg1 98.42 98.05 1.00
V100-PCIE-16GB llama 7B Q6_K pp2 158.32 166.40 1.05
V100-PCIE-16GB llama 7B Q6_K pp3 212.05 202.76 0.96
V100-PCIE-16GB llama 7B Q6_K pp4 254.27 249.03 0.98
V100-PCIE-16GB llama 7B Q6_K pp5 183.15 284.54 1.55
V100-PCIE-16GB llama 7B Q6_K pp6 219.96 312.81 1.42
V100-PCIE-16GB llama 7B Q6_K pp7 254.55 336.80 1.32
V100-PCIE-16GB llama 7B Q6_K pp8 289.57 343.74 1.19
V100-PCIE-16GB llama 7B Q6_K tg1 88.67 85.51 0.96
V100-PCIE-16GB llama 7B Q8_0 pp2 157.63 155.48 0.99
V100-PCIE-16GB llama 7B Q8_0 pp3 220.47 226.89 1.03
V100-PCIE-16GB llama 7B Q8_0 pp4 275.39 232.02 0.84
V100-PCIE-16GB llama 7B Q8_0 pp5 208.74 344.63 1.65
V100-PCIE-16GB llama 7B Q8_0 pp6 249.54 314.36 1.26
V100-PCIE-16GB llama 7B Q8_0 pp7 287.99 346.31 1.20
V100-PCIE-16GB llama 7B Q8_0 pp8 326.89 381.50 1.17
V100-PCIE-16GB llama 7B Q8_0 tg1 80.03 79.59 0.99
RTX 2060
GPU Model Test t/s master t/s cuda-faster-mmvq-12 Speedup
RTX 2060 SUPER llama 7B Q2_K_M pp2 58.75 80.00 1.36
RTX 2060 SUPER llama 7B Q2_K_M pp3 67.90 97.80 1.44
RTX 2060 SUPER llama 7B Q2_K_M pp4 72.91 109.34 1.50
RTX 2060 SUPER llama 7B Q2_K_M pp5 62.58 118.57 1.89
RTX 2060 SUPER llama 7B Q2_K_M pp6 74.86 125.29 1.67
RTX 2060 SUPER llama 7B Q2_K_M pp7 86.91 127.44 1.47
RTX 2060 SUPER llama 7B Q2_K_M pp8 99.01 125.91 1.27
RTX 2060 SUPER llama 7B Q2_K_M tg1 42.10 42.00 1.00
RTX 2060 SUPER llama 7B Q3_K_M pp2 64.35 86.20 1.34
RTX 2060 SUPER llama 7B Q3_K_M pp3 73.79 105.87 1.43
RTX 2060 SUPER llama 7B Q3_K_M pp4 77.76 118.37 1.52
RTX 2060 SUPER llama 7B Q3_K_M pp5 65.19 128.61 1.97
RTX 2060 SUPER llama 7B Q3_K_M pp6 77.93 134.87 1.73
RTX 2060 SUPER llama 7B Q3_K_M pp7 90.41 140.51 1.55
RTX 2060 SUPER llama 7B Q3_K_M pp8 102.97 136.35 1.32
RTX 2060 SUPER llama 7B Q3_K_M tg1 46.92 47.24 1.01
RTX 2060 SUPER llama 7B Q4_0 pp2 84.83 114.81 1.35
RTX 2060 SUPER llama 7B Q4_0 pp3 98.40 140.43 1.43
RTX 2060 SUPER llama 7B Q4_0 pp4 105.53 157.20 1.49
RTX 2060 SUPER llama 7B Q4_0 pp5 109.63 175.16 1.60
RTX 2060 SUPER llama 7B Q4_0 pp6 130.49 184.85 1.42
RTX 2060 SUPER llama 7B Q4_0 pp7 150.60 192.75 1.28
RTX 2060 SUPER llama 7B Q4_0 pp8 170.64 197.45 1.16
RTX 2060 SUPER llama 7B Q4_0 tg1 59.59 59.49 1.00
RTX 2060 SUPER llama 7B Q4_1 pp2 96.11 137.05 1.43
RTX 2060 SUPER llama 7B Q4_1 pp3 107.75 162.65 1.51
RTX 2060 SUPER llama 7B Q4_1 pp4 113.64 177.47 1.56
RTX 2060 SUPER llama 7B Q4_1 pp5 121.51 195.44 1.61
RTX 2060 SUPER llama 7B Q4_1 pp6 144.49 202.78 1.40
RTX 2060 SUPER llama 7B Q4_1 pp7 166.61 209.18 1.26
RTX 2060 SUPER llama 7B Q4_1 pp8 188.75 213.23 1.13
RTX 2060 SUPER llama 7B Q4_1 tg1 71.87 72.20 1.00
RTX 2060 SUPER llama 7B Q4_K_M pp2 79.71 106.06 1.33
RTX 2060 SUPER llama 7B Q4_K_M pp3 89.99 127.14 1.41
RTX 2060 SUPER llama 7B Q4_K_M pp4 92.47 139.47 1.51
RTX 2060 SUPER llama 7B Q4_K_M pp5 76.55 153.52 2.01
RTX 2060 SUPER llama 7B Q4_K_M pp6 91.32 156.83 1.72
RTX 2060 SUPER llama 7B Q4_K_M pp7 105.73 161.33 1.53
RTX 2060 SUPER llama 7B Q4_K_M pp8 120.18 157.17 1.31
RTX 2060 SUPER llama 7B Q4_K_M tg1 61.31 61.02 1.00
RTX 2060 SUPER llama 7B Q5_0 pp2 79.60 104.11 1.31
RTX 2060 SUPER llama 7B Q5_0 pp3 93.66 129.54 1.38
RTX 2060 SUPER llama 7B Q5_0 pp4 101.23 147.16 1.45
RTX 2060 SUPER llama 7B Q5_0 pp5 73.87 164.77 2.23
RTX 2060 SUPER llama 7B Q5_0 pp6 87.79 175.09 1.99
RTX 2060 SUPER llama 7B Q5_0 pp7 101.75 183.20 1.80
RTX 2060 SUPER llama 7B Q5_0 pp8 115.66 189.43 1.64
RTX 2060 SUPER llama 7B Q5_0 tg1 54.56 54.55 1.00
RTX 2060 SUPER llama 7B Q5_1 pp2 92.12 127.43 1.38
RTX 2060 SUPER llama 7B Q5_1 pp3 104.44 153.25 1.47
RTX 2060 SUPER llama 7B Q5_1 pp4 110.79 168.95 1.52
RTX 2060 SUPER llama 7B Q5_1 pp5 84.41 187.45 2.22
RTX 2060 SUPER llama 7B Q5_1 pp6 100.61 195.62 1.94
RTX 2060 SUPER llama 7B Q5_1 pp7 116.38 202.01 1.74
RTX 2060 SUPER llama 7B Q5_1 pp8 132.15 206.04 1.56
RTX 2060 SUPER llama 7B Q5_1 tg1 65.47 65.55 1.00
RTX 2060 SUPER llama 7B Q5_K_M pp2 75.95 98.08 1.29
RTX 2060 SUPER llama 7B Q5_K_M pp3 86.24 120.57 1.40
RTX 2060 SUPER llama 7B Q5_K_M pp4 90.16 134.54 1.49
RTX 2060 SUPER llama 7B Q5_K_M pp5 62.82 147.82 2.35
RTX 2060 SUPER llama 7B Q5_K_M pp6 75.02 152.62 2.03
RTX 2060 SUPER llama 7B Q5_K_M pp7 86.97 157.05 1.81
RTX 2060 SUPER llama 7B Q5_K_M pp8 98.91 147.40 1.49
RTX 2060 SUPER llama 7B Q5_K_M tg1 56.78 56.94 1.00
RTX 2060 SUPER llama 7B Q6_K pp2 72.16 92.78 1.29
RTX 2060 SUPER llama 7B Q6_K pp3 86.01 116.59 1.36
RTX 2060 SUPER llama 7B Q6_K pp4 93.80 131.79 1.40
RTX 2060 SUPER llama 7B Q6_K pp5 62.46 147.64 2.36
RTX 2060 SUPER llama 7B Q6_K pp6 74.64 158.85 2.13
RTX 2060 SUPER llama 7B Q6_K pp7 86.56 166.44 1.92
RTX 2060 SUPER llama 7B Q6_K pp8 98.49 167.56 1.70
RTX 2060 SUPER llama 7B Q6_K tg1 48.87 48.87 1.00
RTX 2060 SUPER llama 7B Q8_0 pp2 67.99 83.84 1.23
RTX 2060 SUPER llama 7B Q8_0 pp3 83.04 108.38 1.31
RTX 2060 SUPER llama 7B Q8_0 pp4 92.85 127.72 1.38
RTX 2060 SUPER llama 7B Q8_0 pp5 71.57 144.96 2.03
RTX 2060 SUPER llama 7B Q8_0 pp6 85.40 157.54 1.84
RTX 2060 SUPER llama 7B Q8_0 pp7 98.86 167.84 1.70
RTX 2060 SUPER llama 7B Q8_0 pp8 112.37 175.67 1.56
RTX 2060 SUPER llama 7B Q8_0 tg1 43.11 42.96 1.00
A100 SXM 80GB
GPU Model Test t/s master t/s cuda-faster-mmvq-12 Speedup
NVIDIA A100-SXM4-80GB llama 7B F16 pp2 150.76 149.17 0.99
NVIDIA A100-SXM4-80GB llama 7B F16 pp3 223.54 223.38 1.00
NVIDIA A100-SXM4-80GB llama 7B F16 pp4 296.48 296.25 1.00
NVIDIA A100-SXM4-80GB llama 7B F16 pp5 365.36 366.40 1.00
NVIDIA A100-SXM4-80GB llama 7B F16 pp6 439.07 438.45 1.00
NVIDIA A100-SXM4-80GB llama 7B F16 pp7 507.90 507.55 1.00
NVIDIA A100-SXM4-80GB llama 7B F16 pp8 575.19 575.26 1.00
NVIDIA A100-SXM4-80GB llama 7B F16 tg1 75.68 75.90 1.00
NVIDIA A100-SXM4-80GB llama 7B Q4_0 pp2 249.77 270.29 1.08
NVIDIA A100-SXM4-80GB llama 7B Q4_0 pp3 325.09 344.97 1.06
NVIDIA A100-SXM4-80GB llama 7B Q4_0 pp4 380.96 408.08 1.07
NVIDIA A100-SXM4-80GB llama 7B Q4_0 pp5 327.22 503.31 1.54
NVIDIA A100-SXM4-80GB llama 7B Q4_0 pp6 391.34 552.92 1.41
NVIDIA A100-SXM4-80GB llama 7B Q4_0 pp7 453.74 606.01 1.34
NVIDIA A100-SXM4-80GB llama 7B Q4_0 pp8 513.29 645.02 1.26
NVIDIA A100-SXM4-80GB llama 7B Q4_0 tg1 147.56 144.64 0.98
NVIDIA A100-SXM4-80GB llama 7B Q4_K_M pp2 210.76 224.22 1.06
NVIDIA A100-SXM4-80GB llama 7B Q4_K_M pp3 271.65 280.83 1.03
NVIDIA A100-SXM4-80GB llama 7B Q4_K_M pp4 324.97 324.34 1.00
NVIDIA A100-SXM4-80GB llama 7B Q4_K_M pp5 252.72 368.59 1.46
NVIDIA A100-SXM4-80GB llama 7B Q4_K_M pp6 301.31 390.93 1.30
NVIDIA A100-SXM4-80GB llama 7B Q4_K_M pp7 349.96 421.75 1.21
NVIDIA A100-SXM4-80GB llama 7B Q4_K_M pp8 397.32 392.84 0.99
NVIDIA A100-SXM4-80GB llama 7B Q4_K_M tg1 132.61 131.79 0.99
NVIDIA A100-SXM4-80GB llama 7B Q8_0 pp2 212.95 212.05 1.00
NVIDIA A100-SXM4-80GB llama 7B Q8_0 pp3 278.16 298.92 1.07
NVIDIA A100-SXM4-80GB llama 7B Q8_0 pp4 335.66 350.48 1.04
NVIDIA A100-SXM4-80GB llama 7B Q8_0 pp5 213.38 429.12 2.01
NVIDIA A100-SXM4-80GB llama 7B Q8_0 pp6 255.16 466.45 1.83
NVIDIA A100-SXM4-80GB llama 7B Q8_0 pp7 296.35 520.85 1.76
NVIDIA A100-SXM4-80GB llama 7B Q8_0 pp8 336.44 544.94 1.62
NVIDIA A100-SXM4-80GB llama 7B Q8_0 tg1 117.88 116.43 0.99
GPU Model Test t/s master t/s cuda-faster-mmvq-12 Speedup
NVIDIA A100-SXM4-80GB llama 34B F16 pp2 41.38 41.60 1.01
NVIDIA A100-SXM4-80GB llama 34B F16 pp3 61.81 62.11 1.00
NVIDIA A100-SXM4-80GB llama 34B F16 pp4 82.02 82.34 1.00
NVIDIA A100-SXM4-80GB llama 34B F16 pp5 101.54 102.04 1.00
NVIDIA A100-SXM4-80GB llama 34B F16 pp6 121.74 122.36 1.01
NVIDIA A100-SXM4-80GB llama 34B F16 pp7 141.30 142.10 1.01
NVIDIA A100-SXM4-80GB llama 34B F16 pp8 160.74 161.77 1.01
NVIDIA A100-SXM4-80GB llama 34B F16 tg1 20.15 20.28 1.01
NVIDIA A100-SXM4-80GB llama 34B Q4_0 pp2 84.95 101.70 1.20
NVIDIA A100-SXM4-80GB llama 34B Q4_0 pp3 104.58 126.68 1.21
NVIDIA A100-SXM4-80GB llama 34B Q4_0 pp4 117.25 147.78 1.26
NVIDIA A100-SXM4-80GB llama 34B Q4_0 pp5 97.76 165.73 1.70
NVIDIA A100-SXM4-80GB llama 34B Q4_0 pp6 116.50 184.20 1.58
NVIDIA A100-SXM4-80GB llama 34B Q4_0 pp7 134.95 196.43 1.46
NVIDIA A100-SXM4-80GB llama 34B Q4_0 pp8 153.24 203.31 1.33
NVIDIA A100-SXM4-80GB llama 34B Q4_0 tg1 52.42 53.38 1.02
NVIDIA A100-SXM4-80GB llama 34B Q4_K_M pp2 67.04 73.89 1.10
NVIDIA A100-SXM4-80GB llama 34B Q4_K_M pp3 81.53 89.68 1.10
NVIDIA A100-SXM4-80GB llama 34B Q4_K_M pp4 92.82 98.07 1.06
NVIDIA A100-SXM4-80GB llama 34B Q4_K_M pp5 69.55 101.31 1.46
NVIDIA A100-SXM4-80GB llama 34B Q4_K_M pp6 83.01 106.89 1.29
NVIDIA A100-SXM4-80GB llama 34B Q4_K_M pp7 96.22 114.08 1.19
NVIDIA A100-SXM4-80GB llama 34B Q4_K_M pp8 109.51 106.09 0.97
NVIDIA A100-SXM4-80GB llama 34B Q4_K_M tg1 46.01 46.76 1.02
NVIDIA A100-SXM4-80GB llama 34B Q8_0 pp2 63.65 65.29 1.03
NVIDIA A100-SXM4-80GB llama 34B Q8_0 pp3 81.48 93.98 1.15
NVIDIA A100-SXM4-80GB llama 34B Q8_0 pp4 95.94 109.43 1.14
NVIDIA A100-SXM4-80GB llama 34B Q8_0 pp5 60.11 136.84 2.28
NVIDIA A100-SXM4-80GB llama 34B Q8_0 pp6 71.86 136.30 1.90
NVIDIA A100-SXM4-80GB llama 34B Q8_0 pp7 83.40 152.17 1.82
NVIDIA A100-SXM4-80GB llama 34B Q8_0 pp8 94.80 161.81 1.71
NVIDIA A100-SXM4-80GB llama 34B Q8_0 tg1 34.94 35.34 1.01

@Artefact2
Copy link
Collaborator

Artefact2 commented Feb 10, 2024

Edit: updated to reflect changes after 76a0128.

GPU Model Test t/s master t/s pr Speedup
RX 6750 XT llama 13B Q4_K_S pp1 34.60 34.48 1.00
RX 6750 XT llama 13B Q4_K_S pp2 58.07 57.96 1.00
RX 6750 XT llama 13B Q4_K_S pp3 71.74 72.00 1.00
RX 6750 XT llama 13B Q4_K_S pp4 79.94 69.98 0.88
RX 6750 XT llama 13B Q4_K_S pp5 41.94 75.04 1.79
RX 6750 XT llama 13B Q4_K_S pp6 50.29 75.24 1.50
RX 6750 XT llama 13B Q4_K_S pp7 58.58 76.32 1.30
RX 6750 XT llama 13B Q4_K_S pp8 66.30 75.84 1.14

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented Feb 10, 2024

My results:

GPU Model Batch size Test t/s b2110 t/s cuda-faster-mmvq-12 Speedup
RTX 3090 llama 7B Q2_K_M 1 pp512 104.87 104.54 1.00
RTX 3090 llama 7B Q2_K_M 2 pp512 180.04 197.06 1.09
RTX 3090 llama 7B Q2_K_M 3 pp512 230.69 244.44 1.06
RTX 3090 llama 7B Q2_K_M 4 pp512 264.60 296.03 1.12
RTX 3090 llama 7B Q2_K_M 5 pp512 151.74 333.70 2.20
RTX 3090 llama 7B Q2_K_M 6 pp512 182.40 372.55 2.04
RTX 3090 llama 7B Q2_K_M 7 pp512 211.51 388.74 1.84
RTX 3090 llama 7B Q2_K_M 8 pp512 237.83 411.63 1.73
RTX 3090 llama 7B Q3_K_S 1 pp512 99.08 98.39 0.99
RTX 3090 llama 7B Q3_K_S 2 pp512 173.08 191.13 1.10
RTX 3090 llama 7B Q3_K_S 3 pp512 226.19 239.33 1.06
RTX 3090 llama 7B Q3_K_S 4 pp512 256.83 288.21 1.12
RTX 3090 llama 7B Q3_K_S 5 pp512 135.17 328.73 2.43
RTX 3090 llama 7B Q3_K_S 6 pp512 162.96 370.99 2.28
RTX 3090 llama 7B Q3_K_S 7 pp512 189.62 388.38 2.05
RTX 3090 llama 7B Q3_K_S 8 pp512 217.31 411.25 1.89
RTX 3090 llama 7B Q4_0 1 pp512 133.30 132.36 0.99
RTX 3090 llama 7B Q4_0 2 pp512 255.03 257.19 1.01
RTX 3090 llama 7B Q4_0 3 pp512 327.89 364.65 1.11
RTX 3090 llama 7B Q4_0 4 pp512 376.89 449.99 1.19
RTX 3090 llama 7B Q4_0 5 pp512 314.85 471.48 1.50
RTX 3090 llama 7B Q4_0 6 pp512 375.46 512.72 1.37
RTX 3090 llama 7B Q4_0 7 pp512 431.75 545.70 1.26
RTX 3090 llama 7B Q4_0 8 pp512 492.88 564.13 1.14
RTX 3090 llama 7B Q4_1 1 pp512 125.06 124.14 0.99
RTX 3090 llama 7B Q4_1 2 pp512 242.06 243.08 1.00
RTX 3090 llama 7B Q4_1 3 pp512 333.16 347.79 1.04
RTX 3090 llama 7B Q4_1 4 pp512 386.49 442.75 1.15
RTX 3090 llama 7B Q4_1 5 pp512 312.85 449.88 1.44
RTX 3090 llama 7B Q4_1 6 pp512 372.42 519.18 1.39
RTX 3090 llama 7B Q4_1 7 pp512 428.98 551.23 1.28
RTX 3090 llama 7B Q4_1 8 pp512 489.17 576.17 1.18
RTX 3090 llama 7B Q4_K_S 1 pp512 128.57 128.03 1.00
RTX 3090 llama 7B Q4_K_S 2 pp512 215.26 232.36 1.08
RTX 3090 llama 7B Q4_K_S 3 pp512 260.32 281.56 1.08
RTX 3090 llama 7B Q4_K_S 4 pp512 297.08 315.09 1.06
RTX 3090 llama 7B Q4_K_S 5 pp512 213.26 344.28 1.61
RTX 3090 llama 7B Q4_K_S 6 pp512 255.48 365.24 1.43
RTX 3090 llama 7B Q4_K_S 7 pp512 295.25 376.35 1.27
RTX 3090 llama 7B Q4_K_S 8 pp512 337.00 382.78 1.14
RTX 3090 llama 7B Q5_0 1 pp512 114.97 114.96 1.00
RTX 3090 llama 7B Q5_0 2 pp512 221.84 224.29 1.01
RTX 3090 llama 7B Q5_0 3 pp512 293.43 319.36 1.09
RTX 3090 llama 7B Q5_0 4 pp512 343.44 395.16 1.15
RTX 3090 llama 7B Q5_0 5 pp512 190.53 437.49 2.30
RTX 3090 llama 7B Q5_0 6 pp512 227.72 470.48 2.07
RTX 3090 llama 7B Q5_0 7 pp512 263.02 497.16 1.89
RTX 3090 llama 7B Q5_0 8 pp512 301.11 524.95 1.74
RTX 3090 llama 7B Q5_1 1 pp512 110.57 110.21 1.00
RTX 3090 llama 7B Q5_1 2 pp512 215.07 215.68 1.00
RTX 3090 llama 7B Q5_1 3 pp512 300.93 309.78 1.03
RTX 3090 llama 7B Q5_1 4 pp512 357.76 394.02 1.10
RTX 3090 llama 7B Q5_1 5 pp512 227.96 424.77 1.86
RTX 3090 llama 7B Q5_1 6 pp512 270.60 472.59 1.75
RTX 3090 llama 7B Q5_1 7 pp512 312.60 505.26 1.62
RTX 3090 llama 7B Q5_1 8 pp512 358.38 538.93 1.50
RTX 3090 llama 7B Q5_K_S 1 pp512 114.87 114.56 1.00
RTX 3090 llama 7B Q5_K_S 2 pp512 198.63 211.88 1.07
RTX 3090 llama 7B Q5_K_S 3 pp512 245.66 265.23 1.08
RTX 3090 llama 7B Q5_K_S 4 pp512 284.78 299.73 1.05
RTX 3090 llama 7B Q5_K_S 5 pp512 166.82 324.17 1.94
RTX 3090 llama 7B Q5_K_S 6 pp512 199.68 348.82 1.75
RTX 3090 llama 7B Q5_K_S 7 pp512 230.96 360.74 1.56
RTX 3090 llama 7B Q5_K_S 8 pp512 264.08 382.89 1.45
RTX 3090 llama 7B Q6_K 1 pp512 100.58 99.02 0.98
RTX 3090 llama 7B Q6_K 2 pp512 174.96 188.78 1.08
RTX 3090 llama 7B Q6_K 3 pp512 214.58 248.29 1.16
RTX 3090 llama 7B Q6_K 4 pp512 258.43 298.17 1.15
RTX 3090 llama 7B Q6_K 5 pp512 160.14 327.65 2.05
RTX 3090 llama 7B Q6_K 6 pp512 192.18 347.11 1.81
RTX 3090 llama 7B Q6_K 7 pp512 222.59 367.12 1.65
RTX 3090 llama 7B Q6_K 8 pp512 254.48 382.59 1.50
RTX 3090 llama 7B Q8_0 1 pp512 87.99 87.79 1.00
RTX 3090 llama 7B Q8_0 2 pp512 172.17 170.96 0.99
RTX 3090 llama 7B Q8_0 3 pp512 245.41 247.56 1.01
RTX 3090 llama 7B Q8_0 4 pp512 305.27 321.88 1.05
RTX 3090 llama 7B Q8_0 5 pp512 198.83 386.36 1.94
RTX 3090 llama 7B Q8_0 6 pp512 237.56 431.32 1.82
RTX 3090 llama 7B Q8_0 7 pp512 274.28 476.62 1.74
RTX 3090 llama 7B Q8_0 8 pp512 313.71 444.36 1.42
RX 6800 llama 7B Q2_K_M 1 pp512 37.32 37.19 1.00
RX 6800 llama 7B Q2_K_M 2 pp512 59.68 63.05 1.06
RX 6800 llama 7B Q2_K_M 3 pp512 68.74 68.02 0.99
RX 6800 llama 7B Q2_K_M 4 pp512 70.51 78.75 1.12
RX 6800 llama 7B Q2_K_M 5 pp512 15.85 82.20 5.19
RX 6800 llama 7B Q2_K_M 6 pp512 18.98 85.83 4.52
RX 6800 llama 7B Q2_K_M 7 pp512 22.09 91.67 4.15
RX 6800 llama 7B Q2_K_M 8 pp512 25.20 93.60 3.71
RX 6800 llama 7B Q3_K_S 1 pp512 35.60 35.26 0.99
RX 6800 llama 7B Q3_K_S 2 pp512 57.08 60.60 1.06
RX 6800 llama 7B Q3_K_S 3 pp512 65.67 64.45 0.98
RX 6800 llama 7B Q3_K_S 4 pp512 66.75 75.52 1.13
RX 6800 llama 7B Q3_K_S 5 pp512 14.86 78.91 5.31
RX 6800 llama 7B Q3_K_S 6 pp512 17.81 82.49 4.63
RX 6800 llama 7B Q3_K_S 7 pp512 20.73 88.62 4.28
RX 6800 llama 7B Q3_K_S 8 pp512 23.65 90.73 3.84
RX 6800 llama 7B Q4_0 1 pp512 55.77 55.84 1.00
RX 6800 llama 7B Q4_0 2 pp512 104.72 105.24 1.01
RX 6800 llama 7B Q4_0 3 pp512 142.45 145.90 1.02
RX 6800 llama 7B Q4_0 4 pp512 148.88 168.23 1.13
RX 6800 llama 7B Q4_0 5 pp512 47.59 181.92 3.82
RX 6800 llama 7B Q4_0 6 pp512 56.98 179.13 3.14
RX 6800 llama 7B Q4_0 7 pp512 66.22 186.13 2.81
RX 6800 llama 7B Q4_0 8 pp512 75.60 188.91 2.50
RX 6800 llama 7B Q4_1 1 pp512 52.93 52.70 1.00
RX 6800 llama 7B Q4_1 2 pp512 101.82 101.08 0.99
RX 6800 llama 7B Q4_1 3 pp512 138.59 145.17 1.05
RX 6800 llama 7B Q4_1 4 pp512 153.78 171.55 1.12
RX 6800 llama 7B Q4_1 5 pp512 44.39 186.47 4.20
RX 6800 llama 7B Q4_1 6 pp512 53.15 180.34 3.39
RX 6800 llama 7B Q4_1 7 pp512 61.81 177.78 2.88
RX 6800 llama 7B Q4_1 8 pp512 70.54 191.20 2.71
RX 6800 llama 7B Q4_K_S 1 pp512 41.54 41.21 0.99
RX 6800 llama 7B Q4_K_S 2 pp512 69.36 69.12 1.00
RX 6800 llama 7B Q4_K_S 3 pp512 86.69 86.43 1.00
RX 6800 llama 7B Q4_K_S 4 pp512 98.13 91.29 0.93
RX 6800 llama 7B Q4_K_S 5 pp512 36.95 93.55 2.53
RX 6800 llama 7B Q4_K_S 6 pp512 44.26 100.04 2.26
RX 6800 llama 7B Q4_K_S 7 pp512 51.50 97.03 1.88
RX 6800 llama 7B Q4_K_S 8 pp512 58.80 101.18 1.72
RX 6800 llama 7B Q5_0 1 pp512 50.75 50.77 1.00
RX 6800 llama 7B Q5_0 2 pp512 94.68 95.50 1.01
RX 6800 llama 7B Q5_0 3 pp512 123.89 131.79 1.06
RX 6800 llama 7B Q5_0 4 pp512 152.71 150.08 0.98
RX 6800 llama 7B Q5_0 5 pp512 39.12 162.84 4.16
RX 6800 llama 7B Q5_0 6 pp512 46.88 165.97 3.54
RX 6800 llama 7B Q5_0 7 pp512 54.53 173.55 3.18
RX 6800 llama 7B Q5_0 8 pp512 62.24 180.50 2.90
RX 6800 llama 7B Q5_1 1 pp512 47.26 46.96 0.99
RX 6800 llama 7B Q5_1 2 pp512 90.61 90.70 1.00
RX 6800 llama 7B Q5_1 3 pp512 128.31 129.70 1.01
RX 6800 llama 7B Q5_1 4 pp512 153.13 156.90 1.02
RX 6800 llama 7B Q5_1 5 pp512 38.75 149.53 3.86
RX 6800 llama 7B Q5_1 6 pp512 46.36 149.81 3.23
RX 6800 llama 7B Q5_1 7 pp512 53.85 157.62 2.93
RX 6800 llama 7B Q5_1 8 pp512 61.41 170.94 2.78
RX 6800 llama 7B Q5_K_S 1 pp512 40.69 39.96 0.98
RX 6800 llama 7B Q5_K_S 2 pp512 67.55 67.17 0.99
RX 6800 llama 7B Q5_K_S 3 pp512 85.56 83.74 0.98
RX 6800 llama 7B Q5_K_S 4 pp512 89.91 89.35 0.99
RX 6800 llama 7B Q5_K_S 5 pp512 36.50 91.90 2.52
RX 6800 llama 7B Q5_K_S 6 pp512 43.71 94.94 2.17
RX 6800 llama 7B Q5_K_S 7 pp512 50.85 95.43 1.88
RX 6800 llama 7B Q5_K_S 8 pp512 58.04 100.00 1.72
RX 6800 llama 7B Q6_K 1 pp512 42.03 40.75 0.97
RX 6800 llama 7B Q6_K 2 pp512 71.45 70.03 0.98
RX 6800 llama 7B Q6_K 3 pp512 83.23 84.10 1.01
RX 6800 llama 7B Q6_K 4 pp512 94.47 94.64 1.00
RX 6800 llama 7B Q6_K 5 pp512 34.50 91.12 2.64
RX 6800 llama 7B Q6_K 6 pp512 41.34 97.24 2.35
RX 6800 llama 7B Q6_K 7 pp512 48.08 95.10 1.98
RX 6800 llama 7B Q6_K 8 pp512 54.87 98.73 1.80
RX 6800 llama 7B Q8_0 1 pp512 39.75 39.50 0.99
RX 6800 llama 7B Q8_0 2 pp512 77.31 77.66 1.00
RX 6800 llama 7B Q8_0 3 pp512 112.52 113.54 1.01
RX 6800 llama 7B Q8_0 4 pp512 145.41 145.58 1.00
RX 6800 llama 7B Q8_0 5 pp512 49.37 162.69 3.30
RX 6800 llama 7B Q8_0 6 pp512 59.11 167.74 2.84
RX 6800 llama 7B Q8_0 7 pp512 68.65 153.83 2.24
RX 6800 llama 7B Q8_0 8 pp512 78.44 149.39 1.90
P40 llama 7B Q2_K_M 1 pp512 46.14 45.86 0.99
P40 llama 7B Q2_K_M 2 pp512 47.63 51.93 1.09
P40 llama 7B Q2_K_M 3 pp512 60.66 66.59 1.10
P40 llama 7B Q2_K_M 4 pp512 69.11 80.19 1.16
P40 llama 7B Q2_K_M 5 pp512 32.73 89.18 2.73
P40 llama 7B Q2_K_M 6 pp512 39.21 95.79 2.44
P40 llama 7B Q2_K_M 7 pp512 45.66 105.42 2.31
P40 llama 7B Q2_K_M 8 pp512 52.15 110.68 2.12
P40 llama 7B Q3_K_S 1 pp512 44.53 44.08 0.99
P40 llama 7B Q3_K_S 2 pp512 46.98 50.70 1.08
P40 llama 7B Q3_K_S 3 pp512 59.63 65.28 1.09
P40 llama 7B Q3_K_S 4 pp512 68.34 79.74 1.17
P40 llama 7B Q3_K_S 5 pp512 32.13 88.03 2.74
P40 llama 7B Q3_K_S 6 pp512 38.50 94.31 2.45
P40 llama 7B Q3_K_S 7 pp512 44.84 104.68 2.33
P40 llama 7B Q3_K_S 8 pp512 51.22 110.23 2.15
P40 llama 7B Q4_0 1 pp512 56.03 56.18 1.00
P40 llama 7B Q4_0 2 pp512 57.61 61.33 1.06
P40 llama 7B Q4_0 3 pp512 77.67 85.10 1.10
P40 llama 7B Q4_0 4 pp512 93.15 99.33 1.07
P40 llama 7B Q4_0 5 pp512 51.43 109.35 2.13
P40 llama 7B Q4_0 6 pp512 61.48 119.47 1.94
P40 llama 7B Q4_0 7 pp512 71.56 136.09 1.90
P40 llama 7B Q4_0 8 pp512 81.76 146.96 1.80
P40 llama 7B Q4_1 1 pp512 53.76 54.33 1.01
P40 llama 7B Q4_1 2 pp512 57.89 60.14 1.04
P40 llama 7B Q4_1 3 pp512 78.39 85.49 1.09
P40 llama 7B Q4_1 4 pp512 93.70 99.48 1.06
P40 llama 7B Q4_1 5 pp512 52.91 109.74 2.07
P40 llama 7B Q4_1 6 pp512 63.31 123.05 1.94
P40 llama 7B Q4_1 7 pp512 73.66 133.98 1.82
P40 llama 7B Q4_1 8 pp512 84.12 143.23 1.70
P40 llama 7B Q4_K_S 1 pp512 50.57 50.74 1.00
P40 llama 7B Q4_K_S 2 pp512 52.59 55.98 1.06
P40 llama 7B Q4_K_S 3 pp512 68.48 76.67 1.12
P40 llama 7B Q4_K_S 4 pp512 79.16 87.57 1.11
P40 llama 7B Q4_K_S 5 pp512 48.76 94.77 1.94
P40 llama 7B Q4_K_S 6 pp512 58.36 98.64 1.69
P40 llama 7B Q4_K_S 7 pp512 67.95 106.44 1.57
P40 llama 7B Q4_K_S 8 pp512 77.64 110.11 1.42
P40 llama 7B Q5_0 1 pp512 46.78 46.71 1.00
P40 llama 7B Q5_0 2 pp512 52.55 54.40 1.04
P40 llama 7B Q5_0 3 pp512 71.39 77.92 1.09
P40 llama 7B Q5_0 4 pp512 87.87 92.61 1.05
P40 llama 7B Q5_0 5 pp512 48.04 101.96 2.12
P40 llama 7B Q5_0 6 pp512 57.53 115.38 2.01
P40 llama 7B Q5_0 7 pp512 67.02 128.91 1.92
P40 llama 7B Q5_0 8 pp512 76.49 135.95 1.78
P40 llama 7B Q5_1 1 pp512 47.42 47.39 1.00
P40 llama 7B Q5_1 2 pp512 53.38 55.31 1.04
P40 llama 7B Q5_1 3 pp512 74.02 78.81 1.06
P40 llama 7B Q5_1 4 pp512 90.20 94.59 1.05
P40 llama 7B Q5_1 5 pp512 50.87 103.23 2.03
P40 llama 7B Q5_1 6 pp512 60.77 119.00 1.96
P40 llama 7B Q5_1 7 pp512 70.75 129.42 1.83
P40 llama 7B Q5_1 8 pp512 80.86 139.49 1.73
P40 llama 7B Q5_K_S 1 pp512 43.18 43.02 1.00
P40 llama 7B Q5_K_S 2 pp512 48.59 51.10 1.05
P40 llama 7B Q5_K_S 3 pp512 63.97 70.66 1.10
P40 llama 7B Q5_K_S 4 pp512 75.91 83.20 1.10
P40 llama 7B Q5_K_S 5 pp512 44.85 90.29 2.01
P40 llama 7B Q5_K_S 6 pp512 53.69 95.14 1.77
P40 llama 7B Q5_K_S 7 pp512 62.51 102.60 1.64
P40 llama 7B Q5_K_S 8 pp512 71.44 106.62 1.49
P40 llama 7B Q6_K 1 pp512 35.69 35.56 1.00
P40 llama 7B Q6_K 2 pp512 42.62 45.75 1.07
P40 llama 7B Q6_K 3 pp512 56.62 64.39 1.14
P40 llama 7B Q6_K 4 pp512 68.53 77.78 1.13
P40 llama 7B Q6_K 5 pp512 45.81 82.52 1.80
P40 llama 7B Q6_K 6 pp512 54.76 93.50 1.71
P40 llama 7B Q6_K 7 pp512 63.79 104.36 1.64
P40 llama 7B Q6_K 8 pp512 72.97 105.85 1.45
P40 llama 7B Q8_0 1 pp512 35.76 35.52 0.99
P40 llama 7B Q8_0 2 pp512 44.17 45.28 1.03
P40 llama 7B Q8_0 3 pp512 62.10 65.53 1.06
P40 llama 7B Q8_0 4 pp512 77.46 75.20 0.97
P40 llama 7B Q8_0 5 pp512 51.14 92.18 1.80
P40 llama 7B Q8_0 6 pp512 61.26 94.09 1.54
P40 llama 7B Q8_0 7 pp512 71.24 106.27 1.49
P40 llama 7B Q8_0 8 pp512 81.43 118.83 1.46

I think I'll revert part of the changes that cause a small regression for a batch size of 1. I did this to reduce register pressure but for that batch size it seems to on average not be beneficial.

@JohannesGaessler
Copy link
Collaborator Author

I reverted part of the changes. This should fix the regression for a batch size of 1. It may be possible to squeeze out 1-2% more performance by utilizing bit shifts for pointer arithmetic but then you'd have to compile 8 times more kernels.

@ggerganov
Copy link
Owner

ggerganov commented Feb 10, 2024

What is missing to get perfect scaling of the speed (i.e. S_b = b*S_1)?

Also, should we try to put an F16 switch in ggml_cuda_op_mul_mat_vec_q and see how it performs for b <= 8?

Nvm, it already scales almost perfectly for F16 when there is enough memory bandwidth:

Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes

model size params backend ngl test t/s
llama 7B F16 12.55 GiB 6.74 B CUDA 99 tg 1 76.17 ± 1.19
llama 7B F16 12.55 GiB 6.74 B CUDA 99 pp 2 151.45 ± 2.02
llama 7B F16 12.55 GiB 6.74 B CUDA 99 pp 3 225.81 ± 2.23
llama 7B F16 12.55 GiB 6.74 B CUDA 99 pp 4 299.29 ± 1.98
llama 7B F16 12.55 GiB 6.74 B CUDA 99 pp 5 370.48 ± 3.12
llama 7B F16 12.55 GiB 6.74 B CUDA 99 pp 6 442.94 ± 4.98
llama 7B F16 12.55 GiB 6.74 B CUDA 99 pp 7 514.47 ± 4.21
llama 7B F16 12.55 GiB 6.74 B CUDA 99 pp 8 580.76 ± 3.36

build: 2bb97fc (2112)

@JohannesGaessler
Copy link
Collaborator Author

I think that the biggest issue right now is that the CUDA grids are just too small. On my RTX 3090, for q8_0, a 4096x4096 weight matrix, and a batch size of 1 ~10% of the kernel runtime is lost to the initial latency when launching the kernel and another ~5% are lost due to tail effects. That's why I think fusing the branching matrix multiplication kernels like the KQV ones which all use the same hidden state as input would be beneficial. It would also allow you to save time on the conversion of the hidden state (something like 1% of the total runtime).

Other than that, you could potentially get better performance by fundamentally changing the data layout. If you were to separate the quantized data into blocks that only contain the quantized values or the scales you should be able to load it more efficiently. But with some prototypes where I tried this the performance did not improve. And also an identical data layout between backends makes it much easier to work with the code.

Or write a completely new kernel since mul_mat_vec_q doesn't seem to scale well. But so far I have not been able to make something better.

@JohannesGaessler
Copy link
Collaborator Author

When it comes to scaling in particular the issue could also be related to the amount of compute. The scaling as the batch size increases on my P40s is much worse than on my RTX 3090 and I think the reason is that (relative to the memory bandwidth) modern cards just have way more compute.

@JohannesGaessler
Copy link
Collaborator Author

With the current version I get this performance:

GPU Model Batch size Test t/s b2110 t/s cuda-faster-mmvq-12 Speedup
RTX 3090 llama 7B Q2_K_M 1 pp512 104.87 105.29 1.00
RTX 3090 llama 7B Q2_K_M 2 pp512 180.04 190.78 1.06
RTX 3090 llama 7B Q2_K_M 3 pp512 230.69 250.60 1.09
RTX 3090 llama 7B Q2_K_M 4 pp512 264.60 284.61 1.08
RTX 3090 llama 7B Q2_K_M 5 pp512 151.74 341.47 2.25
RTX 3090 llama 7B Q2_K_M 6 pp512 182.40 367.78 2.02
RTX 3090 llama 7B Q2_K_M 7 pp512 211.51 377.19 1.78
RTX 3090 llama 7B Q2_K_M 8 pp512 237.83 406.71 1.71
RTX 3090 llama 7B Q3_K_S 1 pp512 99.08 99.27 1.00
RTX 3090 llama 7B Q3_K_S 2 pp512 173.08 181.52 1.05
RTX 3090 llama 7B Q3_K_S 3 pp512 226.19 242.38 1.07
RTX 3090 llama 7B Q3_K_S 4 pp512 256.83 276.10 1.08
RTX 3090 llama 7B Q3_K_S 5 pp512 135.17 333.92 2.47
RTX 3090 llama 7B Q3_K_S 6 pp512 162.96 359.21 2.20
RTX 3090 llama 7B Q3_K_S 7 pp512 189.62 368.78 1.94
RTX 3090 llama 7B Q3_K_S 8 pp512 217.31 397.42 1.83
RTX 3090 llama 7B Q4_0 1 pp512 133.30 133.34 1.00
RTX 3090 llama 7B Q4_0 2 pp512 255.03 259.65 1.02
RTX 3090 llama 7B Q4_0 3 pp512 327.89 365.69 1.12
RTX 3090 llama 7B Q4_0 4 pp512 376.89 433.86 1.15
RTX 3090 llama 7B Q4_0 5 pp512 314.85 492.87 1.57
RTX 3090 llama 7B Q4_0 6 pp512 375.46 518.17 1.38
RTX 3090 llama 7B Q4_0 7 pp512 431.75 544.65 1.26
RTX 3090 llama 7B Q4_0 8 pp512 492.88 569.05 1.15
RTX 3090 llama 7B Q4_1 1 pp512 125.06 125.05 1.00
RTX 3090 llama 7B Q4_1 2 pp512 242.06 244.52 1.01
RTX 3090 llama 7B Q4_1 3 pp512 333.16 347.10 1.04
RTX 3090 llama 7B Q4_1 4 pp512 386.49 427.97 1.11
RTX 3090 llama 7B Q4_1 5 pp512 312.85 479.48 1.53
RTX 3090 llama 7B Q4_1 6 pp512 372.42 527.78 1.42
RTX 3090 llama 7B Q4_1 7 pp512 428.98 557.31 1.30
RTX 3090 llama 7B Q4_1 8 pp512 489.17 579.48 1.18
RTX 3090 llama 7B Q4_K_S 1 pp512 128.57 128.05 1.00
RTX 3090 llama 7B Q4_K_S 2 pp512 215.26 227.61 1.06
RTX 3090 llama 7B Q4_K_S 3 pp512 260.32 279.55 1.07
RTX 3090 llama 7B Q4_K_S 4 pp512 297.08 314.50 1.06
RTX 3090 llama 7B Q4_K_S 5 pp512 213.26 344.82 1.62
RTX 3090 llama 7B Q4_K_S 6 pp512 255.48 359.98 1.41
RTX 3090 llama 7B Q4_K_S 7 pp512 295.25 371.54 1.26
RTX 3090 llama 7B Q4_K_S 8 pp512 337.00 389.45 1.16
RTX 3090 llama 7B Q5_0 1 pp512 114.97 115.96 1.01
RTX 3090 llama 7B Q5_0 2 pp512 221.84 225.69 1.02
RTX 3090 llama 7B Q5_0 3 pp512 293.43 310.50 1.06
RTX 3090 llama 7B Q5_0 4 pp512 343.44 381.37 1.11
RTX 3090 llama 7B Q5_0 5 pp512 190.53 441.30 2.32
RTX 3090 llama 7B Q5_0 6 pp512 227.72 469.65 2.06
RTX 3090 llama 7B Q5_0 7 pp512 263.02 498.46 1.90
RTX 3090 llama 7B Q5_0 8 pp512 301.11 523.58 1.74
RTX 3090 llama 7B Q5_1 1 pp512 110.57 110.96 1.00
RTX 3090 llama 7B Q5_1 2 pp512 215.07 216.84 1.01
RTX 3090 llama 7B Q5_1 3 pp512 300.93 308.70 1.03
RTX 3090 llama 7B Q5_1 4 pp512 357.76 386.55 1.08
RTX 3090 llama 7B Q5_1 5 pp512 227.96 427.41 1.87
RTX 3090 llama 7B Q5_1 6 pp512 270.60 478.14 1.77
RTX 3090 llama 7B Q5_1 7 pp512 312.60 508.13 1.63
RTX 3090 llama 7B Q5_1 8 pp512 358.38 534.49 1.49
RTX 3090 llama 7B Q5_K_S 1 pp512 114.87 114.31 1.00
RTX 3090 llama 7B Q5_K_S 2 pp512 198.63 212.01 1.07
RTX 3090 llama 7B Q5_K_S 3 pp512 245.66 258.59 1.05
RTX 3090 llama 7B Q5_K_S 4 pp512 284.78 299.15 1.05
RTX 3090 llama 7B Q5_K_S 5 pp512 166.82 328.37 1.97
RTX 3090 llama 7B Q5_K_S 6 pp512 199.68 344.99 1.73
RTX 3090 llama 7B Q5_K_S 7 pp512 230.96 358.06 1.55
RTX 3090 llama 7B Q5_K_S 8 pp512 264.08 381.06 1.44
RTX 3090 llama 7B Q6_K 1 pp512 100.58 100.73 1.00
RTX 3090 llama 7B Q6_K 2 pp512 174.96 189.92 1.09
RTX 3090 llama 7B Q6_K 3 pp512 214.58 254.95 1.19
RTX 3090 llama 7B Q6_K 4 pp512 258.43 304.35 1.18
RTX 3090 llama 7B Q6_K 5 pp512 160.14 334.40 2.09
RTX 3090 llama 7B Q6_K 6 pp512 192.18 362.71 1.89
RTX 3090 llama 7B Q6_K 7 pp512 222.59 380.24 1.71
RTX 3090 llama 7B Q6_K 8 pp512 254.48 399.92 1.57
RTX 3090 llama 7B Q8_0 1 pp512 87.99 88.35 1.00
RTX 3090 llama 7B Q8_0 2 pp512 172.17 171.86 1.00
RTX 3090 llama 7B Q8_0 3 pp512 245.41 249.58 1.02
RTX 3090 llama 7B Q8_0 4 pp512 305.27 323.12 1.06
RTX 3090 llama 7B Q8_0 5 pp512 198.83 387.22 1.95
RTX 3090 llama 7B Q8_0 6 pp512 237.56 445.77 1.88
RTX 3090 llama 7B Q8_0 7 pp512 274.28 480.16 1.75
RTX 3090 llama 7B Q8_0 8 pp512 313.71 449.42 1.43
RX 6800 llama 7B Q2_K_M 1 pp512 37.32 37.53 1.01
RX 6800 llama 7B Q2_K_M 2 pp512 59.68 59.94 1.00
RX 6800 llama 7B Q2_K_M 3 pp512 68.74 68.84 1.00
RX 6800 llama 7B Q2_K_M 4 pp512 70.51 70.61 1.00
RX 6800 llama 7B Q2_K_M 5 pp512 15.85 83.64 5.28
RX 6800 llama 7B Q2_K_M 6 pp512 18.98 88.06 4.64
RX 6800 llama 7B Q2_K_M 7 pp512 22.09 91.44 4.14
RX 6800 llama 7B Q2_K_M 8 pp512 25.20 95.04 3.77
RX 6800 llama 7B Q3_K_S 1 pp512 35.60 35.69 1.00
RX 6800 llama 7B Q3_K_S 2 pp512 57.08 57.13 1.00
RX 6800 llama 7B Q3_K_S 3 pp512 65.67 65.67 1.00
RX 6800 llama 7B Q3_K_S 4 pp512 66.75 66.80 1.00
RX 6800 llama 7B Q3_K_S 5 pp512 14.86 80.69 5.43
RX 6800 llama 7B Q3_K_S 6 pp512 17.81 85.67 4.81
RX 6800 llama 7B Q3_K_S 7 pp512 20.73 88.72 4.28
RX 6800 llama 7B Q3_K_S 8 pp512 23.65 93.18 3.94
RX 6800 llama 7B Q4_0 1 pp512 55.77 56.14 1.01
RX 6800 llama 7B Q4_0 2 pp512 104.72 104.99 1.00
RX 6800 llama 7B Q4_0 3 pp512 142.45 142.61 1.00
RX 6800 llama 7B Q4_0 4 pp512 148.88 149.48 1.00
RX 6800 llama 7B Q4_0 5 pp512 47.59 143.98 3.03
RX 6800 llama 7B Q4_0 6 pp512 56.98 184.83 3.24
RX 6800 llama 7B Q4_0 7 pp512 66.22 184.79 2.79
RX 6800 llama 7B Q4_0 8 pp512 75.60 187.01 2.47
RX 6800 llama 7B Q4_1 1 pp512 52.93 53.04 1.00
RX 6800 llama 7B Q4_1 2 pp512 101.82 102.03 1.00
RX 6800 llama 7B Q4_1 3 pp512 138.59 139.26 1.00
RX 6800 llama 7B Q4_1 4 pp512 153.78 154.22 1.00
RX 6800 llama 7B Q4_1 5 pp512 44.39 158.16 3.56
RX 6800 llama 7B Q4_1 6 pp512 53.15 179.05 3.37
RX 6800 llama 7B Q4_1 7 pp512 61.81 180.45 2.92
RX 6800 llama 7B Q4_1 8 pp512 70.54 193.65 2.75
RX 6800 llama 7B Q4_K_S 1 pp512 41.54 41.75 1.01
RX 6800 llama 7B Q4_K_S 2 pp512 69.36 69.78 1.01
RX 6800 llama 7B Q4_K_S 3 pp512 86.69 87.24 1.01
RX 6800 llama 7B Q4_K_S 4 pp512 98.13 92.46 0.94
RX 6800 llama 7B Q4_K_S 5 pp512 36.95 100.07 2.71
RX 6800 llama 7B Q4_K_S 6 pp512 44.26 100.72 2.28
RX 6800 llama 7B Q4_K_S 7 pp512 51.50 101.85 1.98
RX 6800 llama 7B Q4_K_S 8 pp512 58.80 101.24 1.72
RX 6800 llama 7B Q5_0 1 pp512 50.75 50.74 1.00
RX 6800 llama 7B Q5_0 2 pp512 94.68 94.88 1.00
RX 6800 llama 7B Q5_0 3 pp512 123.89 124.90 1.01
RX 6800 llama 7B Q5_0 4 pp512 152.71 153.78 1.01
RX 6800 llama 7B Q5_0 5 pp512 39.12 165.17 4.22
RX 6800 llama 7B Q5_0 6 pp512 46.88 167.20 3.57
RX 6800 llama 7B Q5_0 7 pp512 54.53 173.96 3.19
RX 6800 llama 7B Q5_0 8 pp512 62.24 171.33 2.75
RX 6800 llama 7B Q5_1 1 pp512 47.26 47.29 1.00
RX 6800 llama 7B Q5_1 2 pp512 90.61 90.41 1.00
RX 6800 llama 7B Q5_1 3 pp512 128.31 128.15 1.00
RX 6800 llama 7B Q5_1 4 pp512 153.13 152.69 1.00
RX 6800 llama 7B Q5_1 5 pp512 38.75 139.67 3.60
RX 6800 llama 7B Q5_1 6 pp512 46.36 157.52 3.40
RX 6800 llama 7B Q5_1 7 pp512 53.85 157.72 2.93
RX 6800 llama 7B Q5_1 8 pp512 61.41 163.68 2.67
RX 6800 llama 7B Q5_K_S 1 pp512 40.69 40.89 1.01
RX 6800 llama 7B Q5_K_S 2 pp512 67.55 67.77 1.00
RX 6800 llama 7B Q5_K_S 3 pp512 85.56 86.28 1.01
RX 6800 llama 7B Q5_K_S 4 pp512 89.91 90.67 1.01
RX 6800 llama 7B Q5_K_S 5 pp512 36.50 92.29 2.53
RX 6800 llama 7B Q5_K_S 6 pp512 43.71 95.64 2.19
RX 6800 llama 7B Q5_K_S 7 pp512 50.85 100.85 1.98
RX 6800 llama 7B Q5_K_S 8 pp512 58.04 100.61 1.73
RX 6800 llama 7B Q6_K 1 pp512 42.03 42.04 1.00
RX 6800 llama 7B Q6_K 2 pp512 71.45 71.41 1.00
RX 6800 llama 7B Q6_K 3 pp512 83.23 85.62 1.03
RX 6800 llama 7B Q6_K 4 pp512 94.47 94.39 1.00
RX 6800 llama 7B Q6_K 5 pp512 34.50 88.24 2.56
RX 6800 llama 7B Q6_K 6 pp512 41.34 95.14 2.30
RX 6800 llama 7B Q6_K 7 pp512 48.08 91.53 1.90
RX 6800 llama 7B Q6_K 8 pp512 54.87 96.21 1.75
RX 6800 llama 7B Q8_0 1 pp512 39.75 39.81 1.00
RX 6800 llama 7B Q8_0 2 pp512 77.31 77.58 1.00
RX 6800 llama 7B Q8_0 3 pp512 112.52 112.78 1.00
RX 6800 llama 7B Q8_0 4 pp512 145.41 145.88 1.00
RX 6800 llama 7B Q8_0 5 pp512 49.37 158.51 3.21
RX 6800 llama 7B Q8_0 6 pp512 59.11 172.57 2.92
RX 6800 llama 7B Q8_0 7 pp512 68.65 144.67 2.11
RX 6800 llama 7B Q8_0 8 pp512 78.44 145.77 1.86
P40 llama 7B Q2_K_M 1 pp512 46.14 46.20 1.00
P40 llama 7B Q2_K_M 2 pp512 47.63 50.32 1.06
P40 llama 7B Q2_K_M 3 pp512 60.66 64.04 1.06
P40 llama 7B Q2_K_M 4 pp512 69.11 73.18 1.06
P40 llama 7B Q2_K_M 5 pp512 32.73 88.55 2.71
P40 llama 7B Q2_K_M 6 pp512 39.21 97.94 2.50
P40 llama 7B Q2_K_M 7 pp512 45.66 105.41 2.31
P40 llama 7B Q2_K_M 8 pp512 52.15 110.61 2.12
P40 llama 7B Q3_K_S 1 pp512 44.53 44.50 1.00
P40 llama 7B Q3_K_S 2 pp512 46.98 49.27 1.05
P40 llama 7B Q3_K_S 3 pp512 59.63 61.93 1.04
P40 llama 7B Q3_K_S 4 pp512 68.34 71.20 1.04
P40 llama 7B Q3_K_S 5 pp512 32.13 87.30 2.72
P40 llama 7B Q3_K_S 6 pp512 38.50 97.08 2.52
P40 llama 7B Q3_K_S 7 pp512 44.84 104.90 2.34
P40 llama 7B Q3_K_S 8 pp512 51.22 112.36 2.19
P40 llama 7B Q4_0 1 pp512 56.03 56.01 1.00
P40 llama 7B Q4_0 2 pp512 57.61 58.65 1.02
P40 llama 7B Q4_0 3 pp512 77.67 83.84 1.08
P40 llama 7B Q4_0 4 pp512 93.15 101.41 1.09
P40 llama 7B Q4_0 5 pp512 51.43 111.66 2.17
P40 llama 7B Q4_0 6 pp512 61.48 122.76 2.00
P40 llama 7B Q4_0 7 pp512 71.56 137.33 1.92
P40 llama 7B Q4_0 8 pp512 81.76 147.84 1.81
P40 llama 7B Q4_1 1 pp512 53.76 53.88 1.00
P40 llama 7B Q4_1 2 pp512 57.89 59.95 1.04
P40 llama 7B Q4_1 3 pp512 78.39 84.69 1.08
P40 llama 7B Q4_1 4 pp512 93.70 102.39 1.09
P40 llama 7B Q4_1 5 pp512 52.91 112.26 2.12
P40 llama 7B Q4_1 6 pp512 63.31 126.74 2.00
P40 llama 7B Q4_1 7 pp512 73.66 135.81 1.84
P40 llama 7B Q4_1 8 pp512 84.12 146.73 1.74
P40 llama 7B Q4_K_S 1 pp512 50.57 50.58 1.00
P40 llama 7B Q4_K_S 2 pp512 52.59 56.28 1.07
P40 llama 7B Q4_K_S 3 pp512 68.48 76.28 1.11
P40 llama 7B Q4_K_S 4 pp512 79.16 88.23 1.11
P40 llama 7B Q4_K_S 5 pp512 48.76 92.38 1.89
P40 llama 7B Q4_K_S 6 pp512 58.36 100.27 1.72
P40 llama 7B Q4_K_S 7 pp512 67.95 106.88 1.57
P40 llama 7B Q4_K_S 8 pp512 77.64 112.27 1.45
P40 llama 7B Q5_0 1 pp512 46.78 46.71 1.00
P40 llama 7B Q5_0 2 pp512 52.55 54.11 1.03
P40 llama 7B Q5_0 3 pp512 71.39 77.54 1.09
P40 llama 7B Q5_0 4 pp512 87.87 94.97 1.08
P40 llama 7B Q5_0 5 pp512 48.04 103.74 2.16
P40 llama 7B Q5_0 6 pp512 57.53 113.00 1.96
P40 llama 7B Q5_0 7 pp512 67.02 127.25 1.90
P40 llama 7B Q5_0 8 pp512 76.49 131.84 1.72
P40 llama 7B Q5_1 1 pp512 47.42 47.45 1.00
P40 llama 7B Q5_1 2 pp512 53.38 54.87 1.03
P40 llama 7B Q5_1 3 pp512 74.02 78.41 1.06
P40 llama 7B Q5_1 4 pp512 90.20 95.56 1.06
P40 llama 7B Q5_1 5 pp512 50.87 106.16 2.09
P40 llama 7B Q5_1 6 pp512 60.77 120.37 1.98
P40 llama 7B Q5_1 7 pp512 70.75 129.80 1.83
P40 llama 7B Q5_1 8 pp512 80.86 140.95 1.74
P40 llama 7B Q5_K_S 1 pp512 43.18 43.04 1.00
P40 llama 7B Q5_K_S 2 pp512 48.59 51.52 1.06
P40 llama 7B Q5_K_S 3 pp512 63.97 70.47 1.10
P40 llama 7B Q5_K_S 4 pp512 75.91 82.70 1.09
P40 llama 7B Q5_K_S 5 pp512 44.85 88.60 1.98
P40 llama 7B Q5_K_S 6 pp512 53.69 96.24 1.79
P40 llama 7B Q5_K_S 7 pp512 62.51 103.65 1.66
P40 llama 7B Q5_K_S 8 pp512 71.44 108.90 1.52
P40 llama 7B Q6_K 1 pp512 35.69 35.42 0.99
P40 llama 7B Q6_K 2 pp512 42.62 45.82 1.07
P40 llama 7B Q6_K 3 pp512 56.62 63.95 1.13
P40 llama 7B Q6_K 4 pp512 68.53 76.28 1.11
P40 llama 7B Q6_K 5 pp512 45.81 83.52 1.82
P40 llama 7B Q6_K 6 pp512 54.76 93.91 1.71
P40 llama 7B Q6_K 7 pp512 63.79 104.99 1.65
P40 llama 7B Q6_K 8 pp512 72.97 113.58 1.56
P40 llama 7B Q8_0 1 pp512 35.76 35.73 1.00
P40 llama 7B Q8_0 2 pp512 44.17 45.66 1.03
P40 llama 7B Q8_0 3 pp512 62.10 65.70 1.06
P40 llama 7B Q8_0 4 pp512 77.46 78.25 1.01
P40 llama 7B Q8_0 5 pp512 51.14 93.88 1.84
P40 llama 7B Q8_0 6 pp512 61.26 96.41 1.57
P40 llama 7B Q8_0 7 pp512 71.24 109.15 1.53
P40 llama 7B Q8_0 8 pp512 81.43 119.29 1.46

Compared to the previous version the performance for a batch size of 1 should be the same as on master but the performance for larger batch sizes is slightly worse. I think prioritizing a batch size of 1 makes more sense since it's the most common use case.

@ggerganov
Copy link
Owner

Yes, bs = 1 should remain optimal

So should we look for ways to improve the build time or should we merge it like this? I wish we didn't have to special case so many sizes and architectures, but it looks like this is how CUDA (GPU?) programming goes. I don't want the code to start taking 10 minutes to compile, so what options are there to improve this?

@JohannesGaessler
Copy link
Collaborator Author

I wish we didn't have to special case so many sizes and architectures, but it looks like this is how CUDA (GPU?) programming goes.

The main issue is that to get good performance you have to do loop unrolling. The compiler can then optimize out a lot of conditional statements and rearrange the instructions in a better way but this simply takes time, especially if you do this for multiple loop lengths.

The biggest reduction in compile time would be achieved with just splitting the code into multiple files. Currently you cannot parallelize the compilation at all. But if you were to split the code based on e.g. the 14 different data formats you could compile the code with 14 parallel jobs which (given enough cores) should be much faster.

Or add a compile option that reduces compile time at the cost of performance.

@JohannesGaessler
Copy link
Collaborator Author

As a data point, on my system with a Ryzen 5950X the compile time on master is 13.461 s, with this PR it's 18.945 s. Command used:

make clean && time make LLAMA_CUBLAS=1 LLAMA_NO_CCACHE=1 main

@ggerganov
Copy link
Owner

But if you were to split the code based on e.g. the 14 different data formats you could compile the code with 14 parallel jobs which (given enough cores) should be much faster.

I guess it's more realistic if we moved the kernels in a separate header + source and build it multiple times for various template specializations based on ifdefs. This way the source would not be scattered in large amount of files and it still can build in parallel. But it will involve maintaining lists with the specializations

@slaren
Copy link
Collaborator

slaren commented Feb 11, 2024

To do that we would need to keep a list of specializations on the build scripts, litter the code with ifdefs for each combination of template parameters, and additionally create a different function for each combination of parameters so that the template can be linked externally. This would be an insane amount of complexity.

@ggerganov
Copy link
Owner

Yeah, I already tried to prototype the idea and saw it does not make sense. (the main issue is actually the cases where the template argument is a function (e.g. dequantize_xxx) but it's not important)

@JohannesGaessler
Copy link
Collaborator Author

What we could do instead is separate the code by kernel. As of right now what takes up most of the time are the matrix multiplication kernels that deal with quantized data: dequantize_mul_mat_vec, mul_mat_vec_q, and mul_mat_q. They take up a lot of compilation time because you need separate variants for different data formats and because they utilize a lot of loop unrolling. Compilation time would still be reduced by a lot this way but you would only need to move the code and add three extra calls to the compilation.

@JohannesGaessler
Copy link
Collaborator Author

litter the code with ifdefs for each combination of template parameters, and additionally create a different function for each combination of parameters so that the template can be linked externally.

Wouldn't it be possible to define the templates in one file, include that file in the quantization-specific files (where the actual kernels get compiled), and to then include the quantization-specific files in ggml-cuda.cu?

@slaren
Copy link
Collaborator

slaren commented Feb 11, 2024

Moving each kernel to a different source file, together with its _cuda launching function, should be enough to improve the compilation times. It would allow us to work on specific kernels without triggering a full recompilation. Source device functions would need to be in a common header, but that's fine. Ultimately the goal should be to improve code organization, the compilation time is just a bonus.
If it is ok for the Vulkan backends to use a different files for each kernel, so it is for the CUDA backend.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's consider reorganizing the CUDA backend into multiple files

@ggerganov ggerganov requested a review from slaren February 11, 2024 16:50
ggml-cuda.cu Outdated Show resolved Hide resolved
ggml-cuda.cu Outdated Show resolved Hide resolved
ggml-cuda.cu Outdated Show resolved Hide resolved
ggml-cuda.cu Show resolved Hide resolved
@JohannesGaessler JohannesGaessler merged commit 3bdc4cd into ggerganov:master Feb 11, 2024
46 of 53 checks passed
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
* CUDA: mul_mat_vec_q tiling, refactor mul mat logic

Co-authored-by: slaren <[email protected]>

---------

Co-authored-by: slaren <[email protected]>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* CUDA: mul_mat_vec_q tiling, refactor mul mat logic

Co-authored-by: slaren <[email protected]>

---------

Co-authored-by: slaren <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants