Add special token modification capability #6778

CISC · 2024-04-20T06:53:35Z

To be able to fix/amend special tokens in a GGUF let's add two new arguments:

--special-token <name> <value> where <name> can be bos, eos, prefix, middle, etc. while <value> is the token value, f.ex. "<｜fim▁begin｜>"
--special-token-by-id <name> <id> where <id> is the ID of the token, f.ex. 32006

So, in order to f.ex. add fill-in-middle tokens to a GGUF you would do the following:

gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<｜fim▁begin｜>" --special-token middle "<｜fim▁end｜>" --special-token suffix "<｜fim▁hole｜>"

(yes, fim_end is the middle token, because completion is a prefix/suffix/middle sequence (where middle is unfilled))
or

gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<fim_prefix>" --special-token middle "<fim_middle>" --special-token suffix "<fim_suffix>"

etc...

NB: The tokens have to exist already, trying to add non-existent token name/IDs will be ignored (with a warning), while non-existent values will fail (with an error).

To be able to fix/amend special tokens in a GGUF let's add two new arguments: * `--special-token <name> <value>` where `<name>` can be bos, eos, prefix, middle, etc. while `<value>` is the token value, f.ex. `"<｜fim▁begin｜>"` * `--special-token-by-id <name> <id>` where `<id>` is the ID of the token, f.ex. 32006 So, in order to f.ex. add fill-in-middle tokens to a GGUF you would do the following: ```bash python3 gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<｜fim▁begin｜>" --special-token middle "<｜fim▁hole｜>" --special-token suffix "<｜fim▁end｜>" ```

* common : disable get_math_cpu_count() until Android CI gets fixed * common : another try

…org#6788)

* Support Llama 3 conversion The tokenizer is BPE. * style * Accept suggestion Co-authored-by: Sourab Mangrulkar <[email protected]> * llama : add llama_token_is_eog() ggml-ci * llama : auto-detect more EOT tokens when missing in KV data * convert : replacing EOS token is a hack * llama : fix codegemma EOT token + add TODOs * llama : fix model type string for 8B model --------- Co-authored-by: Sourab Mangrulkar <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

This change removes printf() logging so llava-cli is shell scriptable.

* added fedora to list of distros that may need the package (the packages have the same name on Fedora) * how to add clblast that is avalible in the fedora repos

* Added llama-3 chat template * Update llama.cpp Co-authored-by: Samuel Tallet <[email protected]> * Update llama.cpp Co-authored-by: Samuel Tallet <[email protected]> * Update tests/test-chat-template.cpp Co-authored-by: Samuel Tallet <[email protected]> * Added EOS stop sequence according to ggml-org#6751 (comment) * Removed adding of BOS token before first message * Removed bos token from expected output from llama-3 * Update tests/test-chat-template.cpp Co-authored-by: Rene Leonhardt <[email protected]> * Update tests/test-chat-template.cpp Co-authored-by: Rene Leonhardt <[email protected]> * Added <|end_of_text|> as another stop token * Reverted last change of adding the end_of_text stop word for llama 3 --------- Co-authored-by: Wouter Tichelaar <[email protected]> Co-authored-by: Samuel Tallet <[email protected]> Co-authored-by: Rene Leonhardt <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

* make : fix common dep on llama.h * llama : add option to render special tokens * readme : add API change notice ggml-ci * swift : fix build

* `build`: generate hex dumps of server assets on the fly * build: workaround lack of -n on gnu xxd * build: don't use xxd in cmake * build: don't call xxd from build.zig * build: more idiomatic hexing * build: don't use xxd in Makefile (od hackery instead) * build: avoid exceeding max cmd line limit in makefile hex dump * build: hex dump assets at cmake build time (not config time)

Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/1042fd8b148a9105f3c0aca3a6177fd1d9360ba5?narHash=sha256-3sbWO1mbpWsLepZGbWaMovSO7ndZeFqDSdX0hZ9nVyw%3D' (2024-04-10) → 'github:NixOS/nixpkgs/5c24cf2f0a12ad855f444c30b2421d044120c66f?narHash=sha256-XtTSSIB2DA6tOv%2Bl0FhvfDMiyCmhoRbNB%2B0SeInZkbk%3D' (2024-04-19)

Latest gcc complains here: /home/airlied/devel/llama.cpp/ggml-alloc.c: In function ‘ggml_gallocr_new_n’: /home/airlied/devel/llama.cpp/ggml-alloc.c:374:59: warning: ‘calloc’ sizes specified with ‘sizeof’ in the earlier argument and not in the later argument [-Wcalloc-transposed-args] 374 | ggml_gallocr_t galloc = (ggml_gallocr_t)calloc(sizeof(struct ggml_gallocr), 1); | ^~~~~~ /home/airlied/devel/llama.cpp/ggml-alloc.c:374:59: note: earlier argument should specify number of elements, later size of each element and a bunch more. calloc is specified to take nmemb first then size, so realign the code. In a couple of places there was a * x, 1 so I fixed those to use calloc properly.

* llamafile : improve sgemm.cpp - Re-enable by default - Fix issue described in ggml-org#6716 - Make code more abstract, elegant, and maintainable - Faster handling of weirdly shaped `m` an `n` edge cases * Address review comments * Help clang produce fma instructions * Address review comments

…ag activated (ggml-org#6767) * Fix FP32/FP16 build instructions * Fix typo * Recommended build instruction Co-authored-by: Neo Zhang Jianyu <[email protected]> * Recommended build instruction Co-authored-by: Neo Zhang Jianyu <[email protected]> * Recommended build instruction Co-authored-by: Neo Zhang Jianyu <[email protected]> * Add comments in Intel GPU linux --------- Co-authored-by: Anas Ahouzi <[email protected]> Co-authored-by: Neo Zhang Jianyu <[email protected]>

* add explicit phi3 support * add explicit phi3 support * remove unused code * convert : add BOS token * llama : match EOT token <|end|> * llama : minor / style * llama : tabs -> spaces * convert : fix lint checks --------- Co-authored-by: Georgi Gerganov <[email protected]>

* add support of codeqwen due to tokenizer * override load_hparams * fix typo * fix load_params * convert : fix whitespace --------- Co-authored-by: Georgi Gerganov <[email protected]>

* Add phi 3 chat template & tests * test : fix chat template result --------- Co-authored-by: Georgi Gerganov <[email protected]>

ggml-ci

* Server: add tests for consistent results * sampling: separate rng per sampling context

…g#6860) * fix: revert showing control tokens by default * feat: revert changes to default behavior of llama_token_to_piece; provide overridden declaration to receive "bool special" param to toggle showing control tokens * feat: use the overridden declaration of llama_token_to_piece from common/common.cpp to specify "false" so that control tokens are not shown in chat completion responses" * common : simplify --------- Co-authored-by: Georgi Gerganov <[email protected]>

…6850)

* Introduce bfloat16 support Many models on Hugging Face (e.g. Mistral, TinyLLaMA) use bfloat16 as their canonical floating point format. ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌──┴───┐┌─┴───┐ 0b0000000000000000 brain16 This encoding has the same number of exponent bits as float32. That makes conversion relatively straightforward, even in the absence of hardware support. For example, converting brain16 to binary32 means simply shifting 16 bits to the left. ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌──┴───┐┌─┴───────────────────┐ 0b00000000000000000000000000000000 IEEE binary32 The issue is that converting bf16 to fp16 can result in information loss. Only 13% of bf16 numbers can be precisely represented in fp16 which in practice ends up being 99.71% of Mistral 7b v0.2's weights however there is currently no way other than fp32 to get the others ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌─┴─┐┌─┴──────┐ 0b0000000000000000 IEEE binary16 This change fixes that, by adding a bf16 data type to GGML. Support for CPU inference has been implemented along with optimizations for the AVX2, AVX512, and AVX512BF16 ISAs. Perplexity on Mistral 7b 0.2 improves somewhere around -0.0024 to -0.0046 compared to using fp16 * Remove GGML code that's not needed * Minimize the GGML API surface area for BF16 * Remove bf16 luts * Make the GGML header look nicer * Fix documentation * Apply ggerganov's fixes for test-backend-ops * Add BF16 code for new ggml_validate_row_data() function

* compare-llama-bench.py: add missing basicConfig * compare-llama-bench.py: Add line break between error message and print_help() * Add regular print() markdown table

* Add BPE pre-tokenization for DBRX. * Add vocab GGUFs. * Remove test. * Remove GGUFs.

* Add BPE pre-tokenization for Qwen2. * minor : fixes --------- Co-authored-by: Ren Xuancheng <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

…7027) An example of how this might be used in the style of baby-llama will be attached with this PR.

* basic avx implementation * style * combine denibble with load * reduce 256 to 128 (and back!) conversions * sse load * Update sgemm.cpp * oops oops

…org#7078) * fix: use `malloc` instead of `posix_memalign` in `ggml-metal.m` to make it not crash Electron proccesses * fix: typo * fix: use `vm_allocate` instead of `posix_memalign` * fix: don't call `newBufferWithBytesNoCopy` with `NULL` when `ggml_metal_host_malloc` returns `NULL` * fix: use `vm_allocate` only on macOS

* Added themes support with two sample themes and a favicon. * Newline * Newline * Newline * Trailing whitespace * Increased opacity for contrast * Increase opacity. Check actions cancelled for some other priority job and I can't seem to manually re-run them, so MOAR OPACITY * Opacity action trigger. Trying to re-trigger the cancelled action. * One more opacity adjustment This Actions pipeline is failing for random issues. * Delete examples/server/themes/buttons_top/completion.js This will be served from the static string built-in to server. * Delete examples/server/themes/buttons_top/index.js This will be served from the static string built-in to server. * Delete examples/server/themes/wild/completion.js This will be served from the static string built-in to server. * Delete examples/server/themes/buttons_top/json-schema-to-grammar.mjs This will be served from the static string built-in to server. * Delete examples/server/themes/wild/index.js This will be served from the static string built-in to server. * Delete examples/server/themes/wild/json-schema-to-grammar.mjs This will be served from the static string built-in to server. * Replaced underscore.

…-org#6899)" This reverts commit 46e12c4.

* DRAFT: Introduction of CUDA Graphs to LLama.cpp * FIx issues raised in comments * Tidied to now only use CUDA runtime (not mixed with driver calls) * disable for multi-gpu and batch size > 1 * Disable CUDA graphs for old GPU arch and with env var * added missing CUDA_CHECKs * Addressed comments * further addressed comments * limit to GGML_ALLOW_CUDA_GRAPHS defined in llama.cpp cmake * Added more comprehensive graph node checking * With mechanism to fall back if graph capture fails * Revert "With mechanism to fall back if graph capture fails" This reverts commit eb9f15f. * Fall back if graph capture fails and address other comments * - renamed GGML_ALLOW_CUDA_GRAPHS to GGML_CUDA_USE_GRAPHS - rename env variable to disable CUDA graphs to GGML_CUDA_DISABLE_GRAPHS - updated Makefile build to enable CUDA graphs - removed graph capture failure checking in ggml_cuda_error using a global variable to track this is not thread safe, but I am also not safistied with checking an error by string if this is necessary to workaround some issues with graph capture with eg. cuBLAS, we can pass the ggml_backend_cuda_context to the error checking macro and store the result in the context - fixed several resource leaks - fixed issue with zero node graphs - changed fixed size arrays to vectors - removed the count of number of evaluations before start capturing, and instead changed the capture mode to relaxed - removed the check for multiple devices so that it is still possible to use a single device, instead checks for split buffers to disable cuda graphs with -sm row - changed the op for checking batch size to GGML_OP_ADD, should be more reliable than GGML_OP_SOFT_MAX - code style fixes - things to look into - VRAM usage of the cudaGraphExec_t, if it is significant we may need to make it optional - possibility of using cudaStreamBeginCaptureToGraph to keep track of which ggml graph nodes correspond to which cuda graph nodes * fix build without cuda graphs * remove outdated comment * replace minimum cc value with a constant --------- Co-authored-by: slaren <[email protected]>

* convert-hf : begin refactoring write_tensor * convert : upgrade to sentencepiece v0.2.0 * convert-hf : remove unused n_dims in extra_*_tensors * convert-hf : simplify MoE weights stacking * convert-hf : flake8 linter doesn't like semicolons * convert-hf : allow unusual model part names For example, loading `model-00001-of-00001.safetensors` now works. * convert-hf : fix stacking MoE expert tensors `torch.stack` and `torch.cat` don't do the same thing. * convert-hf : fix Mamba conversion Tested to work even with a SentencePiece-based tokenizer. * convert : use a string for the SentencePiece tokenizer path * convert-hf : display tensor shape * convert-hf : convert norms to f32 by default * convert-hf : sort model part names `os.listdir` is said to list files in arbitrary order. Sorting the file names should let "model-00009-of-00042.safetensors" be loaded before "model-00010-of-00042.safetensors". * convert-hf : use an ABC for Model again It seems Protocol can't be used as a statically type-checked ABC, because its subclasses also can't be instantiated. (why did it seem to work?) At least there's still a way to throw an error when forgetting to define the `model_arch` property of any registered Model subclasses. * convert-hf : use a plain class for Model, and forbid direct instantiation There are no abstract methods used anyway, so using ABC isn't really necessary. * convert-hf : more consistent formatting of cmdline args * convert-hf : align the message logged for converted tensors * convert-hf : fix Refact conversion * convert-hf : save memory with lazy evaluation * convert-hf : flake8 doesn't like lowercase L as a variable name * convert-hf : remove einops requirement for InternLM2 * convert-hf : faster model parts loading Instead of pre-loading them all into a dict, iterate on the tensors in the model parts progressively as needed in Model.write_tensors Conversion for some architectures relies on checking for the presence of specific tensor names, so for multi-part models, the weight map is read from the relevant json file to quickly get these names up-front. * convert-hf : minor changes for consistency * gguf-py : add tqdm as a dependency It's small, and used for a progress bar in GGUFWriter.write_tensors_to_file

To be able to fix/amend special tokens in a GGUF let's add two new arguments: * `--special-token <name> <value>` where `<name>` can be bos, eos, prefix, middle, etc. while `<value>` is the token value, f.ex. `"<｜fim▁begin｜>"` * `--special-token-by-id <name> <id>` where `<id>` is the ID of the token, f.ex. 32006 So, in order to f.ex. add fill-in-middle tokens to a GGUF you would do the following: ```bash gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<｜fim▁begin｜>" --special-token middle "<｜fim▁end｜>" --special-token suffix "<｜fim▁hole｜>" ``` (yes, fim_end is the `middle` token, because completion is a `prefix`/`suffix`/`middle` sequence (where `middle` is unfilled)) or ```bash gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<fim_prefix>" --special-token middle "<fim_middle>" --special-token suffix "<fim_suffix>" ``` etc... NB: The tokens have to exist already, trying to add non-existent token name/IDs will be ignored (with a warning), while non-existent values will fail (with an error).

…a.cpp into modify-special-tokens-metadata

github-actions · 2024-05-09T06:55:50Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 543 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8626.79ms p(95)=20741.52ms fails=, finish reason: stop=477 truncated=66
Prompt processing (pp): avg=96.28tk/s p(95)=386.26tk/s
Token generation (tg): avg=32.71tk/s p(95)=45.52tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=modify-special-tokens-metadata commit=144d99a00ae148d5a8421f24a301b0ce0a5b6eb9

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 543 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715237112 --> 1715237744
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 835.54, 835.54, 835.54, 835.54, 835.54, 687.41, 687.41, 687.41, 687.41, 687.41, 707.95, 707.95, 707.95, 707.95, 707.95, 704.06, 704.06, 704.06, 704.06, 704.06, 763.6, 763.6, 763.6, 763.6, 763.6, 762.43, 762.43, 762.43, 762.43, 762.43, 753.31, 753.31, 753.31, 753.31, 753.31, 787.82, 787.82, 787.82, 787.82, 787.82, 779.85, 779.85, 779.85, 779.85, 779.85, 774.6, 774.6, 774.6, 774.6, 774.6, 800.56, 800.56, 800.56, 800.56, 800.56, 811.58, 811.58, 811.58, 811.58, 811.58, 812.8, 812.8, 812.8, 812.8, 812.8, 785.02, 785.02, 785.02, 785.02, 785.02, 755.9, 755.9, 755.9, 755.9, 755.9, 766.8, 766.8, 766.8, 766.8, 766.8, 766.64, 766.64, 766.64, 766.64, 766.64, 747.35, 747.35, 747.35, 747.35, 747.35, 753.15, 753.15, 753.15, 753.15, 753.15, 760.85, 760.85, 760.85, 760.85, 760.85, 766.39, 766.39, 766.39, 766.39, 766.39, 768.74, 768.74, 768.74, 768.74, 768.74, 750.22, 750.22, 750.22, 750.22, 750.22, 753.56, 753.56, 753.56, 753.56, 753.56, 756.35, 756.35, 756.35, 756.35, 756.35, 756.2, 756.2, 756.2, 756.2, 756.2, 753.65, 753.65, 753.65, 753.65, 753.65, 754.76, 754.76, 754.76, 754.76, 754.76, 755.71, 755.71, 755.71, 755.71, 755.71, 762.54, 762.54, 762.54, 762.54, 762.54, 762.82, 762.82, 762.82, 762.82, 762.82, 768.34, 768.34, 768.34, 768.34, 768.34, 771.42, 771.42, 771.42, 771.42, 771.42, 775.11, 775.11, 775.11, 775.11, 775.11, 784.18, 784.18, 784.18, 784.18, 784.18, 773.37, 773.37, 773.37, 773.37, 773.37, 771.57, 771.57, 771.57, 771.57, 771.57, 775.23, 775.23, 775.23, 775.23, 775.23, 774.92, 774.92, 774.92, 774.92, 774.92, 777.29, 777.29, 777.29, 777.29, 777.29, 789.08, 789.08, 789.08, 789.08, 789.08, 785.22, 785.22, 785.22, 785.22, 785.22, 786.36, 786.36, 786.36, 786.36, 786.36, 786.09, 786.09, 786.09, 786.09, 786.09, 784.33, 784.33, 784.33, 784.33, 784.33, 782.34, 782.34, 782.34, 782.34, 782.34, 790.18, 790.18, 790.18, 790.18, 790.18, 790.39, 790.39, 790.39, 790.39, 790.39, 793.97, 793.97, 793.97, 793.97, 793.97, 795.23, 795.23, 795.23, 795.23, 795.23, 798.61, 798.61, 798.61, 798.61, 798.61, 803.84, 803.84, 803.84, 803.84, 803.84, 803.19, 803.19, 803.19, 803.19, 803.19, 807.17, 807.17, 807.17, 807.17, 807.17, 808.9, 808.9, 808.9, 808.9, 808.9, 807.91, 807.91, 807.91, 807.91, 807.91, 807.89, 807.89, 807.89, 807.89, 807.89, 809.25, 809.25, 809.25, 809.25, 809.25, 811.08, 811.08, 811.08, 811.08, 811.08, 811.83, 811.83, 811.83, 811.83, 811.83, 811.61, 811.61]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 543 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715237112 --> 1715237744
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 39.49, 39.49, 39.49, 39.49, 39.49, 35.74, 35.74, 35.74, 35.74, 35.74, 31.73, 31.73, 31.73, 31.73, 31.73, 29.75, 29.75, 29.75, 29.75, 29.75, 30.7, 30.7, 30.7, 30.7, 30.7, 31.34, 31.34, 31.34, 31.34, 31.34, 32.59, 32.59, 32.59, 32.59, 32.59, 33.51, 33.51, 33.51, 33.51, 33.51, 33.61, 33.61, 33.61, 33.61, 33.61, 33.88, 33.88, 33.88, 33.88, 33.88, 34.05, 34.05, 34.05, 34.05, 34.05, 34.03, 34.03, 34.03, 34.03, 34.03, 33.23, 33.23, 33.23, 33.23, 33.23, 32.36, 32.36, 32.36, 32.36, 32.36, 32.2, 32.2, 32.2, 32.2, 32.2, 32.25, 32.25, 32.25, 32.25, 32.25, 32.48, 32.48, 32.48, 32.48, 32.48, 32.39, 32.39, 32.39, 32.39, 32.39, 32.31, 32.31, 32.31, 32.31, 32.31, 32.19, 32.19, 32.19, 32.19, 32.19, 32.17, 32.17, 32.17, 32.17, 32.17, 32.43, 32.43, 32.43, 32.43, 32.43, 32.18, 32.18, 32.18, 32.18, 32.18, 32.39, 32.39, 32.39, 32.39, 32.39, 32.51, 32.51, 32.51, 32.51, 32.51, 32.53, 32.53, 32.53, 32.53, 32.53, 32.08, 32.08, 32.08, 32.08, 32.08, 31.69, 31.69, 31.69, 31.69, 31.69, 31.72, 31.72, 31.72, 31.72, 31.72, 31.92, 31.92, 31.92, 31.92, 31.92, 31.99, 31.99, 31.99, 31.99, 31.99, 32.28, 32.28, 32.28, 32.28, 32.28, 32.33, 32.33, 32.33, 32.33, 32.33, 32.2, 32.2, 32.2, 32.2, 32.2, 31.99, 31.99, 31.99, 31.99, 31.99, 31.73, 31.73, 31.73, 31.73, 31.73, 31.85, 31.85, 31.85, 31.85, 31.85, 32.03, 32.03, 32.03, 32.03, 32.03, 32.03, 32.03, 32.03, 32.03, 32.03, 32.21, 32.21, 32.21, 32.21, 32.21, 32.15, 32.15, 32.15, 32.15, 32.15, 32.14, 32.14, 32.14, 32.14, 32.14, 31.78, 31.78, 31.78, 31.78, 31.78, 31.72, 31.72, 31.72, 31.72, 31.72, 30.03, 30.03, 30.03, 30.03, 30.03, 29.78, 29.78, 29.78, 29.78, 29.78, 29.89, 29.89, 29.89, 29.89, 29.89, 29.91, 29.91, 29.91, 29.91, 29.91, 30.06, 30.06, 30.06, 30.06, 30.06, 30.11, 30.11, 30.11, 30.11, 30.11, 30.24, 30.24, 30.24, 30.24, 30.24, 30.25, 30.25, 30.25, 30.25, 30.25, 30.19, 30.19, 30.19, 30.19, 30.19, 29.99, 29.99, 29.99, 29.99, 29.99, 29.93, 29.93, 29.93, 29.93, 29.93, 30.02, 30.02, 30.02, 30.02, 30.02, 30.13, 30.13, 30.13, 30.13, 30.13, 30.26, 30.26, 30.26, 30.26, 30.26, 30.34, 30.34, 30.34, 30.34, 30.34, 30.37, 30.37, 30.37, 30.37, 30.37, 30.42, 30.42]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 543 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715237112 --> 1715237744
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.32, 0.32, 0.32, 0.32, 0.32, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15, 0.15, 0.33, 0.33, 0.33, 0.33, 0.33, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.33, 0.33, 0.33, 0.33, 0.33, 0.36, 0.36, 0.36, 0.36, 0.36, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.09, 0.09, 0.09, 0.09, 0.09, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.34, 0.34, 0.34, 0.34, 0.34, 0.53, 0.53, 0.53, 0.53, 0.53, 0.65, 0.65, 0.65, 0.65, 0.65, 0.7, 0.7, 0.7, 0.7, 0.7, 0.56, 0.56, 0.56, 0.56, 0.56, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.33, 0.33, 0.33, 0.33, 0.33, 0.28, 0.28, 0.28, 0.28, 0.28, 0.21, 0.21, 0.21, 0.21, 0.21, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 543 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715237112 --> 1715237744
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0]

CISC and others added 30 commits April 20, 2024 08:33

improve help text

8d36967

flake--

c4e6f6f

fix multiple tokens warning

a2410b6

common : try to fix Android CI (ggml-org#6780)

aed82f6

* common : disable get_math_cpu_count() until Android CI gets fixed * common : another try

doc : server tests require llama to be built with curl enabled (ggml-…

b8109bc

…org#6788)

make script executable

e5956f5

switch to namedtuple, no need to dataclass

ff5d21e

llava : use logger in llava-cli (ggml-org#6797)

89b0bf0

This change removes printf() logging so llava-cli is shell scriptable.

readme : add Fedora instructions (ggml-org#6783)

2cca09d

* added fedora to list of distros that may need the package (the packages have the same name on Fedora) * how to add clblast that is avalible in the fedora repos

doc : add link to falcon (ggml-org#6789)

e8d35f4

gguf-py : add IQ1_M to GGML_QUANT_SIZES (ggml-org#6761)

c1386c9

ggml : fix ggml_backend_cpu_supports_op() for CPY (#0)

b9cc76d

llama : add option to render special/control tokens (ggml-org#6807)

40f74e4

* make : fix common dep on llama.h * llama : add option to render special tokens * readme : add API change notice ggml-ci * swift : fix build

ci: fix job are cancelling each other (ggml-org#6781)

c0956b0

llama : fix typo in <|im_end|> token text (ggml-org#6745)

8960fe8

convert : add support of codeqwen due to tokenizer (ggml-org#6707)

3fec68b

* add support of codeqwen due to tokenizer * override load_hparams * fix typo * fix load_params * convert : fix whitespace --------- Co-authored-by: Georgi Gerganov <[email protected]>

llama : add phi 3 chat template (ggml-org#6857)

abd3314

* Add phi 3 chat template & tests * test : fix chat template result --------- Co-authored-by: Georgi Gerganov <[email protected]>

ggml : move 32-bit arm compat in ggml-impl.h (ggml-org#6865)

c0d1b3e

ggml-ci

Server: fix seed for multiple slots (ggml-org#6835)

28103f4

* Server: add tests for consistent results * sampling: separate rng per sampling context

server : do not apply Markdown formatting in code sections (ggml-org#…

3fe847b

…6850)

ggerganov and others added 27 commits May 8, 2024 09:14

metal : fix unused warning

c0e6fbf

compare-llama-bench.py: add missing basicConfig (ggml-org#7138)

acdce3c

* compare-llama-bench.py: add missing basicConfig * compare-llama-bench.py: Add line break between error message and print_help() * Add regular print() markdown table

py : also print the normalizers

7e0b6a7

convert : add BPE pre-tokenization for DBRX (ggml-org#7132)

4cd621c

* Add BPE pre-tokenization for DBRX. * Add vocab GGUFs. * Remove test. * Remove GGUFs.

clean up json_value & server_log (ggml-org#7142)

1fd9c17

llama : add BPE pre-tokenization for Qwen2 (ggml-org#7114)

229ffff

* Add BPE pre-tokenization for Qwen2. * minor : fixes --------- Co-authored-by: Ren Xuancheng <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

convert.py : --vocab-only generates false but valid params (ggml-org#…

ad211ed

…7027) An example of how this might be used in the style of baby-llama will be attached with this PR.

server : add_special option for tokenize endpoint (ggml-org#7059)

911b390

sgemm : AVX Q4_0 and Q8_0 (ggml-org#6891)

465263d

* basic avx implementation * style * combine denibble with load * reduce 256 to 128 (and back!) conversions * sse load * Update sgemm.cpp * oops oops

main : add --conversation / -cnv flag (ggml-org#7108)

83330d8

Revert "llava : add support for moondream vision language model (ggml…

9da243b

…-org#6899)" This reverts commit 46e12c4.

JSON: [key] -> .at(key), assert() -> GGML_ASSERT (ggml-org#7143)

c12452c

cmake : fix typo (ggml-org#7151)

4426e29

improve help text

27caf19

flake--

8737ca1

fix multiple tokens warning

3e3e7c3

make script executable

bc92f65

switch to namedtuple, no need to dataclass

87e2d73

typing++

981bd44

add progress bar

609df3c

Merge branch 'modify-special-tokens-metadata' of github.com:CISC/llam…

144d99a

…a.cpp into modify-special-tokens-metadata

CISC closed this May 9, 2024

CISC mentioned this pull request May 9, 2024

Add special token modification capability #7166

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add special token modification capability #6778

Add special token modification capability #6778

CISC commented Apr 20, 2024 •

edited

Loading

github-actions bot commented May 9, 2024

Add special token modification capability #6778

Add special token modification capability #6778

Conversation

CISC commented Apr 20, 2024 • edited Loading

github-actions bot commented May 9, 2024

CISC commented Apr 20, 2024 •

edited

Loading