ggml : make GeLU faster and more accurate on CPU · jart/llama.cpp@12e2ebc

Commit

ggml : make GeLU faster and more accurate on CPU

This change makes GeLU go 8x faster on AVX2, 3x faster on Apple Silicon,
and 2x faster on Threadripper. It is the world's most popular activation
function, used by models such as Whisper and Gemma, where it can lead to
a noticeable improvement in performance, because the GeLU op is the most
time-consuming usually of any operation except for matrix multiplication

In addition to improving performance this change also improves accuracy.
On ARM64 and AMD64 systems, we no longer need to rely on a 16-bit lookup
table. We're now using SIMD instead. The GeLU lookup table is still here
except it's been converted from fp16 to bf16. This helps align inference
more with training possibly, but it helps us avoid the two extra lookups
into the fp16 table. Therefore this change should have a positive impact
on performance for platforms like OpenPOWER and RISC-V too.

Loading branch information

jart committed Aug 5, 2024

1 parent bc0f887 commit 12e2ebc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `12e2ebc`

Commit

There are no files selected for viewing

0 comments on commit 12e2ebc

0 comments on commit `12e2ebc`