Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA implementation of ggml_clamp #545

Closed
jploski opened this issue Sep 30, 2023 · 3 comments
Closed

CUDA implementation of ggml_clamp #545

jploski opened this issue Sep 30, 2023 · 3 comments

Comments

@jploski
Copy link
Contributor

jploski commented Sep 30, 2023

ggml_clamp currently lacks a CUDA implementation, which is rather trivial, but something I ran into while porting MPT to llama.cpp. I have a ready implementation and will create a PR for it, referencing this issue.

@jploski
Copy link
Contributor Author

jploski commented Sep 30, 2023

Hmm, I'm a bit confused as the internals (e.g. of ggml_scale) on ggml master branch look different than on llama.cpp master. llama.cpp's implementation of ggml-cuda.cu seems more recent (and cleaner looking, too). Will ggml eventually catch up to llama.cpp? (In that case I would contribute the patch on llama.cpp instead.)

jploski added a commit to jploski/llama.cpp that referenced this issue Sep 30, 2023
@slaren
Copy link
Collaborator

slaren commented Sep 30, 2023

Most of the development of the CUDA backend happens in the llama.cpp repository, and the changes are merged here regularly. Opening the PR in the llama.cpp repository is fine, the changes will eventually end here. I see that the ggml_clamp implementation is already in your MPT PR in llama.cpp, so there is no need to open another PR here.

@jploski
Copy link
Contributor Author

jploski commented Sep 30, 2023

Most of the development of the CUDA backend happens in the llama.cpp repository, and the changes are merged here regularly. Opening the PR in the llama.cpp repository is fine, the changes will eventually end here. I see that the ggml_clamp implementation is already in your MPT PR in llama.cpp, so there is no need to open another PR here.

Thanks for clarifying - I thought it worked exactly the other way around.

@jploski jploski closed this as completed Sep 30, 2023
ggerganov added a commit to ggerganov/llama.cpp that referenced this issue Oct 10, 2023
* CUDA: added support for ggml_clamp (see also: ggerganov/ggml#545)

* mpt : added an implementation based (mostly) on falcon integration, modified with deltas from ggml/examples/mpt

* mpt : protect against "clip_qkv": null in mpt-7b

* mpt : quick fix to avoid "Strange model" warning when quantizing MPT models

* mpt : addendum to changeset:84e30e8 - leave parameter clamp_kqv out from metadata rather than use 0.0 to indicate "no clamping" (more compliant with the current GGUF spec?)

* mpt : standardized all tensor names to follow GGUF spec

* mpt : addendum to changeset:1be89c40 - use "req" parameter of GGUF_GET_KEY macro instead of duplicate code

* mpt : fixed comment s/gptneox/mpt/

* mpt : remove tabs, trailing whitespace

* mpt : removed ne01 + n_past == ne00 assertion from alibi (cuda/f32) and rope_shift from build_mpt

* mpt : updated convert-mpt-hf-to-gguf.py to reflect changes made to convert-gptneox-hf-to-gguf.py in pr:3252

* comment out n_past instead of marking it unused

* mpt : removed hardcoded +178 from convert script in favor of utilizing hparams["vocab_size"]

* mpt : remove unused tokenizer_json in convert script

* ggml : remove obsolete n_past assert in ggml_alibi

* llama : print clam_kqv and max_alibi_bias hparams

---------

Co-authored-by: Cebtenzzre <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
github-actions bot pushed a commit to KerfuffleV2/ggml-sys-bleedingedge that referenced this issue Oct 10, 2023
== Relevant log messages from source repo:

commit f5f9121de140eff558f13b5c5e78a3a3b6b94377
Author: Jan Ploski <[email protected]>
Date:   Tue Oct 10 09:50:23 2023 +0200

    llm : add MPT support (#3417)

    * CUDA: added support for ggml_clamp (see also: ggerganov/ggml#545)

    * mpt : added an implementation based (mostly) on falcon integration, modified with deltas from ggml/examples/mpt

    * mpt : protect against "clip_qkv": null in mpt-7b

    * mpt : quick fix to avoid "Strange model" warning when quantizing MPT models

    * mpt : addendum to changeset:84e30e8 - leave parameter clamp_kqv out from metadata rather than use 0.0 to indicate "no clamping" (more compliant with the current GGUF spec?)

    * mpt : standardized all tensor names to follow GGUF spec

    * mpt : addendum to changeset:1be89c40 - use "req" parameter of GGUF_GET_KEY macro instead of duplicate code

    * mpt : fixed comment s/gptneox/mpt/

    * mpt : remove tabs, trailing whitespace

    * mpt : removed ne01 + n_past == ne00 assertion from alibi (cuda/f32) and rope_shift from build_mpt

    * mpt : updated convert-mpt-hf-to-gguf.py to reflect changes made to convert-gptneox-hf-to-gguf.py in pr:3252

    * comment out n_past instead of marking it unused

    * mpt : removed hardcoded +178 from convert script in favor of utilizing hparams["vocab_size"]

    * mpt : remove unused tokenizer_json in convert script

    * ggml : remove obsolete n_past assert in ggml_alibi

    * llama : print clam_kqv and max_alibi_bias hparams

    ---------

    Co-authored-by: Cebtenzzre <[email protected]>
    Co-authored-by: Georgi Gerganov <[email protected]>
CCLDArjun pushed a commit to CCLDArjun/ggml that referenced this issue Dec 18, 2023
Current status: Working, except for the latest GPTQ-for-LLaMa format
  that includes `g_idx`.  This turns out to require changes to GGML, so
  for now it only works if you use the `--outtype` option to dequantize it
  back to f16 (which is pointless except for debugging).

  I also included some cleanup for the C++ code.

  This script is meant to replace all the existing conversion scripts
  (including the ones that convert from older GGML formats), while also
  adding support for some new formats.  Specifically, I've tested with:

  - [x] `LLaMA` (original)
  - [x] `llama-65b-4bit`
  - [x] `alpaca-native`
  - [x] `alpaca-native-4bit`
  - [x] LLaMA converted to 'transformers' format using
        `convert_llama_weights_to_hf.py`
  - [x] `alpaca-native` quantized with `--true-sequential --act-order
        --groupsize 128` (dequantized only)
  - [x] same as above plus `--save_safetensors`
  - [x] GPT4All
  - [x] stock unversioned ggml
  - [x] ggmh

  There's enough overlap in the logic needed to handle these different
  cases that it seemed best to move to a single script.

  I haven't tried this with Alpaca-LoRA because I don't know where to find
  it.

  Useful features:

  - Uses multiple threads for a speedup in some cases (though the Python
    GIL limits the gain, and sometimes it's disk-bound anyway).

  - Combines split models into a single file (both the intra-tensor split
    of the original and the inter-tensor split of 'transformers' format
    files).  Single files are more convenient to work with and more
    friendly to future changes to use memory mapping on the C++ side.  To
    accomplish this without increasing memory requirements, it has some
    custom loading code which avoids loading whole input files into memory
    at once.

  - Because of the custom loading code, it no longer depends in PyTorch,
    which might make installing dependencies slightly easier or faster...
    although it still depends on NumPy and sentencepiece, so I don't know
    if there's any meaningful difference.  In any case, I also added a
    requirements.txt file to lock the dependency versions in case of any
    future breaking changes.

  - Type annotations checked with mypy.

  - Some attempts to be extra user-friendly:

      - The script tries to be forgiving with arguments, e.g. you can
        specify either the model file itself or the directory containing
        it.

      - The script doesn't depend on config.json / params.json, just in
        case the user downloaded files individually and doesn't have those
        handy.  But you still need tokenizer.model and, for Alpaca,
        added_tokens.json.

      - The script tries to give a helpful error message if
        added_tokens.json is missing.
CCLDArjun pushed a commit to CCLDArjun/ggml that referenced this issue Dec 18, 2023
after ggerganov#545 we do not need torch, tqdm and requests in the dependencies
YuMJie pushed a commit to YuMJie/powerinfer that referenced this issue Oct 25, 2024
* CUDA: added support for ggml_clamp (see also: ggerganov/ggml#545)

* mpt : added an implementation based (mostly) on falcon integration, modified with deltas from ggml/examples/mpt

* mpt : protect against "clip_qkv": null in mpt-7b

* mpt : quick fix to avoid "Strange model" warning when quantizing MPT models

* mpt : addendum to changeset:84e30e8 - leave parameter clamp_kqv out from metadata rather than use 0.0 to indicate "no clamping" (more compliant with the current GGUF spec?)

* mpt : standardized all tensor names to follow GGUF spec

* mpt : addendum to changeset:1be89c40 - use "req" parameter of GGUF_GET_KEY macro instead of duplicate code

* mpt : fixed comment s/gptneox/mpt/

* mpt : remove tabs, trailing whitespace

* mpt : removed ne01 + n_past == ne00 assertion from alibi (cuda/f32) and rope_shift from build_mpt

* mpt : updated convert-mpt-hf-to-gguf.py to reflect changes made to convert-gptneox-hf-to-gguf.py in pr:3252

* comment out n_past instead of marking it unused

* mpt : removed hardcoded +178 from convert script in favor of utilizing hparams["vocab_size"]

* mpt : remove unused tokenizer_json in convert script

* ggml : remove obsolete n_past assert in ggml_alibi

* llama : print clam_kqv and max_alibi_bias hparams

---------

Co-authored-by: Cebtenzzre <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants