Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
ggml : make GeLU faster and more accurate on CPU
This change makes GeLU go 8x faster on AVX2, 3x faster on Apple Silicon, and 2x faster on Threadripper. It is the world's most popular activation function, used by models such as Whisper and Gemma, where it can lead to a noticeable improvement in performance, because the GeLU op is the most time-consuming usually of any operation except for matrix multiplication In addition to improving performance this change also improves accuracy. On ARM64 and AMD64 systems, we no longer need to rely on a 16-bit lookup table. We're now using SIMD instead. The GeLU lookup table is still here except it's been converted from fp16 to bf16. This helps align inference more with training possibly, but it helps us avoid the two extra lookups into the fp16 table. Therefore this change should have a positive impact on performance for platforms like OpenPOWER and RISC-V too.
- Loading branch information