Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add QuickGELU (lookup-table based) #561

Open
PallHaraldsson opened this issue Jan 22, 2024 · 1 comment
Open

Add QuickGELU (lookup-table based) #561

PallHaraldsson opened this issue Jan 22, 2024 · 1 comment

Comments

@PallHaraldsson
Copy link

PallHaraldsson commented Jan 22, 2024

Motivation and description

See here:
https://github.com/ggerganov/ggml/pull/254/files

I think we may need QuickGELU, for compatibility, if not same as GELU, more than just optimization.

It's probably just an optimization, it's a an approximation, but then why have both definitions there?

https://zeta.apac.ai/en/latest/zeta/nn/modules/quickgeluactivation/

The QuickGELUActivation class is a part of the Neural Network(NN) module that applies a Gaussian Error Linear Unit (GELU) approximation. [..] The approximate version of GELU used in this class is fast although somewhat less accurate than the standard GELU activation. [..]
"""Applies GELU approximation that is fast but somewhat inaccurate. See: https://github.com/hendrycks/GELUs"""

ggerganov/ggml#253

I'm implementing CLIP in GGML, and it turns out that we need the Quick GELU activation instead of GELU.

Also used with:
https://github.com/facebookresearch/MetaCLIP

MetaCLIP is trained w/ face blurred images.
@inproceedings{xu2023metaclip,
title={Demystifying CLIP Data}

They have two 128KB tables each for Float16 (but no table for ggml_gelu_quick_f32).

I thought lookup-tables went out of favor (for CPUs and GPUs), since faster to compute, but since not, most likely faster, at least in this case. I really don't think they would do this unless it really helped (I believe that's the most optimized and used library), at least for CPUs. So maybe consider also for other activation functions?

I'm not sure, probably lookup tables do not make sense on GPUs, since latency not as big of a deal, and threading compensates. I think the code there may only apply to CPUs. Can anyone confirm, or if also for GPUs?

Would it make sense to have a table for 8-bit floats too? And maybe to use it or some small table for Float16 with some extra computation?

I think I could implement this (in same way as there), i.e. the activations (so a starting point, not all of their use).

I also see there: "initialize GELU, Quick GELU, SILU and EXP F32 tables" I didn't think FP32 tables(!?) used, or for EXP, and also see unrelated GGML_OP_SILU_BACK and GGML_OP_ALIBI.

And FYI the 2016 GELU paper is updated in 2023 for some reason:

https://arxiv.org/abs/1606.08415
[v1] Mon, 27 Jun 2016 19:20:40 UTC (435 KB) [..]
[v3] Sun, 11 Nov 2018 07:40:32 UTC (3,013 KB) [..]
[v5] Tue, 6 Jun 2023 01:53:32 UTC (3,016 KB)

Possible Implementation

inline static float ggml_gelu_quick_f32(float x) {
    return x*(1.0f/(1.0f+expf(GELU_QUICK_COEF*x)));
}

is:

@inline gelu_quick(x) = x*(one(x)/(one(x)+exp(-1.702f*x)))
@PallHaraldsson
Copy link
Author

PallHaraldsson commented Jan 22, 2024

Note, I meant to post this to https://github.com/FluxML/NNlib.jl but it might not matter (same people reading?), could move there, add here and/or there?

I just followed some link and by accident ended up here. I think activations here may be legacy, or just most needed ones, so not wanted here? Since NNlib usable from Flux. Better there, then also from Lux etc.?

FYI: Unrelated, while 4-bit quant is mainstream (or getting there, and 2-bit available, and 1-bit (Microsoft's BitNets), I also see this ("lossless" and "post-training quantization" I guess the pro over BitNets, that are not post-training):

ggerganov/llama.cpp#5063

In addition to the IQ2_XXS, IQ2_XS, Q2_K_S (and now Q3_K_S via PR ggerganov/llama.cpp#5060) that were recently added to llama.cpp

In this case, the improved Q2_K (pre-SOTA) and the Q3_K_S are competing with each other.
In Ppl/s terms (and Hellaswag terms), the best bang for our buck between both is this Q2_K, because the gain in size clearly goes way beyond the bump of perplexity in percentage, and 1k more context at equal VRAM usage is quite a boner. [..]

https://github.com/SqueezeAILab/SqueezeLLM

SqueezeLLM: Dense-and-Sparse Quantization
https://arxiv.org/abs/2306.07629v2

In this work, we demonstrate that the main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, specifically for single batch inference. While quantization has emerged as a promising solution by representing model weights with reduced precision, previous efforts have often resulted in notable performance degradation. To address this, we introduce SqueezeLLM, a post-training quantization framework that not only enables lossless compression to ultra-low precisions of up to 3-bit, but also achieves higher quantization performance under the same memory constraint. Our framework incorporates two novel ideas: [..] When applied to the LLaMA models, our 3-bit quantization significantly reduces the perplexity gap from the FP16 baseline by up to 2.1x as compared to the state-of-the-art methods with the same memory requirement. Furthermore, when deployed on an A6000 GPU, our quantized models achieve up to 2.3x speedup compared to the baseline.

We want to support whatever quantization they have, at least such emerging?

@mcabbott mcabbott transferred this issue from FluxML/Flux.jl Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant