Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add special token modification capability #6778

Closed
wants to merge 142 commits into from

Conversation

CISC
Copy link
Contributor

@CISC CISC commented Apr 20, 2024

To be able to fix/amend special tokens in a GGUF let's add two new arguments:

  • --special-token <name> <value> where <name> can be bos, eos, prefix, middle, etc. while <value> is the token value, f.ex. "<|fim▁begin|>"
  • --special-token-by-id <name> <id> where <id> is the ID of the token, f.ex. 32006

So, in order to f.ex. add fill-in-middle tokens to a GGUF you would do the following:

gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<|fim▁begin|>" --special-token middle "<|fim▁end|>" --special-token suffix "<|fim▁hole|>"

(yes, fim_end is the middle token, because completion is a prefix/suffix/middle sequence (where middle is unfilled))
or

gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<fim_prefix>" --special-token middle "<fim_middle>" --special-token suffix "<fim_suffix>"

etc...

NB: The tokens have to exist already, trying to add non-existent token name/IDs will be ignored (with a warning), while non-existent values will fail (with an error).

CISC and others added 30 commits April 20, 2024 08:33
To be able to fix/amend special tokens in a GGUF let's add two new arguments:
* `--special-token <name> <value>` where `<name>` can be bos, eos, prefix, middle, etc. while `<value>` is the token value, f.ex. `"<|fim▁begin|>"`
* `--special-token-by-id <name> <id>` where `<id>` is the ID of the token, f.ex. 32006

So, in order to f.ex. add fill-in-middle tokens to a GGUF you would do the following:
```bash
python3 gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<|fim▁begin|>" --special-token middle "<|fim▁hole|>" --special-token suffix "<|fim▁end|>"
```
* common : disable get_math_cpu_count() until Android CI gets fixed

* common : another try
* Support Llama 3 conversion

The tokenizer is BPE.

* style

* Accept suggestion

Co-authored-by: Sourab Mangrulkar <[email protected]>

* llama : add llama_token_is_eog()

ggml-ci

* llama : auto-detect more EOT tokens when missing in KV data

* convert : replacing EOS token is a hack

* llama : fix codegemma EOT token + add TODOs

* llama : fix model type string for 8B model

---------

Co-authored-by: Sourab Mangrulkar <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
This change removes printf() logging so llava-cli is shell scriptable.
* added fedora to list of distros that may need the package (the packages have the same name on Fedora)

* how to add clblast that is avalible in the fedora repos
* Added llama-3 chat template

* Update llama.cpp

Co-authored-by: Samuel Tallet <[email protected]>

* Update llama.cpp

Co-authored-by: Samuel Tallet <[email protected]>

* Update tests/test-chat-template.cpp

Co-authored-by: Samuel Tallet <[email protected]>

* Added EOS stop sequence according to ggml-org#6751 (comment)

* Removed adding of BOS token before first message

* Removed bos token from expected output from llama-3

* Update tests/test-chat-template.cpp

Co-authored-by: Rene Leonhardt <[email protected]>

* Update tests/test-chat-template.cpp

Co-authored-by: Rene Leonhardt <[email protected]>

* Added <|end_of_text|> as another stop token

* Reverted last change of adding the end_of_text stop word for llama 3

---------

Co-authored-by: Wouter Tichelaar <[email protected]>
Co-authored-by: Samuel Tallet <[email protected]>
Co-authored-by: Rene Leonhardt <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
* make : fix common dep on llama.h

* llama : add option to render special tokens

* readme : add API change notice

ggml-ci

* swift : fix build
* `build`: generate hex dumps of server assets on the fly

* build: workaround lack of -n on gnu xxd

* build: don't use xxd in cmake

* build: don't call xxd from build.zig

* build: more idiomatic hexing

* build: don't use xxd in Makefile (od hackery instead)

* build: avoid exceeding max cmd line limit in makefile hex dump

* build: hex dump assets at cmake build time (not config time)
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/1042fd8b148a9105f3c0aca3a6177fd1d9360ba5?narHash=sha256-3sbWO1mbpWsLepZGbWaMovSO7ndZeFqDSdX0hZ9nVyw%3D' (2024-04-10)
  → 'github:NixOS/nixpkgs/5c24cf2f0a12ad855f444c30b2421d044120c66f?narHash=sha256-XtTSSIB2DA6tOv%2Bl0FhvfDMiyCmhoRbNB%2B0SeInZkbk%3D' (2024-04-19)
Latest gcc complains here:
/home/airlied/devel/llama.cpp/ggml-alloc.c: In function ‘ggml_gallocr_new_n’:
/home/airlied/devel/llama.cpp/ggml-alloc.c:374:59: warning: ‘calloc’ sizes specified with ‘sizeof’ in the earlier argument and not in the later argument [-Wcalloc-transposed-args]
  374 |     ggml_gallocr_t galloc = (ggml_gallocr_t)calloc(sizeof(struct ggml_gallocr), 1);
      |                                                           ^~~~~~
/home/airlied/devel/llama.cpp/ggml-alloc.c:374:59: note: earlier argument should specify number of elements, later size of each element

and a bunch more.

calloc is specified to take nmemb first then size, so realign the code.

In a couple of places there was a * x, 1 so I fixed those to use calloc properly.
* llamafile : improve sgemm.cpp

- Re-enable by default
- Fix issue described in ggml-org#6716
- Make code more abstract, elegant, and maintainable
- Faster handling of weirdly shaped `m` an `n` edge cases

* Address review comments

* Help clang produce fma instructions

* Address review comments
…ag activated (ggml-org#6767)

* Fix FP32/FP16 build instructions

* Fix typo

* Recommended build instruction

Co-authored-by: Neo Zhang Jianyu <[email protected]>

* Recommended build instruction

Co-authored-by: Neo Zhang Jianyu <[email protected]>

* Recommended build instruction

Co-authored-by: Neo Zhang Jianyu <[email protected]>

* Add comments in Intel GPU linux

---------

Co-authored-by: Anas Ahouzi <[email protected]>
Co-authored-by: Neo Zhang Jianyu <[email protected]>
* add explicit phi3 support

* add explicit phi3 support

* remove unused code

* convert : add BOS token

* llama : match EOT token <|end|>

* llama : minor / style

* llama : tabs -> spaces

* convert : fix lint checks

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* add support of codeqwen due to tokenizer

* override load_hparams

* fix typo

* fix load_params

* convert : fix whitespace

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* Add phi 3 chat template & tests

* test : fix chat template result

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* Server: add tests for consistent results

* sampling: separate rng per sampling context
…g#6860)

* fix: revert showing control tokens by default

* feat: revert changes to default behavior of llama_token_to_piece; provide overridden declaration to receive "bool special" param to toggle showing control tokens

* feat: use the overridden declaration of llama_token_to_piece from common/common.cpp to specify "false" so that control tokens are not shown in chat completion responses"

* common : simplify

---------

Co-authored-by: Georgi Gerganov <[email protected]>
ggerganov and others added 27 commits May 8, 2024 09:14
* Introduce bfloat16 support

Many models on Hugging Face (e.g. Mistral, TinyLLaMA) use bfloat16 as
their canonical floating point format.

      ┌sign
      │
      │   ┌exponent
      │   │
      │   │      ┌mantissa
      │   │      │
      │┌──┴───┐┌─┴───┐
    0b0000000000000000 brain16

This encoding has the same number of exponent bits as float32. That
makes conversion relatively straightforward, even in the absence of
hardware support. For example, converting brain16 to binary32 means
simply shifting 16 bits to the left.

      ┌sign
      │
      │   ┌exponent
      │   │
      │   │      ┌mantissa
      │   │      │
      │┌──┴───┐┌─┴───────────────────┐
    0b00000000000000000000000000000000 IEEE binary32

The issue is that converting bf16 to fp16 can result in information
loss. Only 13% of bf16 numbers can be precisely represented in fp16
which in practice ends up being 99.71% of Mistral 7b v0.2's weights
however there is currently no way other than fp32 to get the others

      ┌sign
      │
      │  ┌exponent
      │  │
      │  │    ┌mantissa
      │  │    │
      │┌─┴─┐┌─┴──────┐
    0b0000000000000000 IEEE binary16

This change fixes that, by adding a bf16 data type to GGML. Support
for CPU inference has been implemented along with optimizations for
the AVX2, AVX512, and AVX512BF16 ISAs. Perplexity on Mistral 7b 0.2
improves somewhere around -0.0024 to -0.0046 compared to using fp16

* Remove GGML code that's not needed

* Minimize the GGML API surface area for BF16

* Remove bf16 luts

* Make the GGML header look nicer

* Fix documentation

* Apply ggerganov's fixes for test-backend-ops

* Add BF16 code for new ggml_validate_row_data() function
* compare-llama-bench.py: add missing basicConfig

* compare-llama-bench.py: Add line break between error message and print_help()

* Add regular print() markdown table
* Add BPE pre-tokenization for DBRX.

* Add vocab GGUFs.

* Remove test.

* Remove GGUFs.
* Add BPE pre-tokenization for Qwen2.

* minor : fixes

---------

Co-authored-by: Ren Xuancheng <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
…7027)

An example of how this might be used in the style of baby-llama will be attached with this PR.
* basic avx implementation

* style

* combine denibble with load

* reduce 256 to 128 (and back!) conversions

* sse load

* Update sgemm.cpp

* oops

oops
…org#7078)

* fix: use `malloc` instead of `posix_memalign` in `ggml-metal.m` to make it not crash Electron proccesses

* fix: typo

* fix: use `vm_allocate` instead of `posix_memalign`

* fix: don't call `newBufferWithBytesNoCopy` with `NULL` when `ggml_metal_host_malloc` returns `NULL`

* fix: use `vm_allocate` only on macOS
* Added themes support with two sample themes and a favicon.

* Newline

* Newline

* Newline

* Trailing whitespace

* Increased opacity for contrast

* Increase opacity.

Check actions cancelled for some other priority job and I can't seem to manually re-run them, so MOAR OPACITY

* Opacity action trigger.

Trying to re-trigger the cancelled action.

* One more opacity adjustment

This Actions pipeline is failing for random issues.

* Delete examples/server/themes/buttons_top/completion.js

This will be served from the static string built-in to server.

* Delete examples/server/themes/buttons_top/index.js

This will be served from the static string built-in to server.

* Delete examples/server/themes/wild/completion.js

This will be served from the static string built-in to server.

* Delete examples/server/themes/buttons_top/json-schema-to-grammar.mjs

This will be served from the static string built-in to server.

* Delete examples/server/themes/wild/index.js

This will be served from the static string built-in to server.

* Delete examples/server/themes/wild/json-schema-to-grammar.mjs

This will be served from the static string built-in to server.

* Replaced underscore.
* DRAFT: Introduction of CUDA Graphs to LLama.cpp

* FIx issues raised in comments

* Tidied to now only use CUDA runtime (not mixed with driver calls)

* disable for multi-gpu and batch size > 1

* Disable CUDA graphs for old GPU arch and with env var

* added missing CUDA_CHECKs

* Addressed comments

* further addressed comments

* limit to GGML_ALLOW_CUDA_GRAPHS defined in llama.cpp cmake

* Added more comprehensive graph node checking

* With mechanism to fall back if graph capture fails

* Revert "With mechanism to fall back if graph capture fails"

This reverts commit eb9f15f.

* Fall back if graph capture fails and address other comments

* - renamed GGML_ALLOW_CUDA_GRAPHS to GGML_CUDA_USE_GRAPHS

- rename env variable to disable CUDA graphs to GGML_CUDA_DISABLE_GRAPHS

- updated Makefile build to enable CUDA graphs

- removed graph capture failure checking in ggml_cuda_error
  using a global variable to track this is not thread safe, but I am also not safistied with checking an error by string
  if this is necessary to workaround some issues with graph capture with eg. cuBLAS, we can pass the ggml_backend_cuda_context to the error checking macro and store the result in the context

- fixed several resource leaks

- fixed issue with zero node graphs

- changed fixed size arrays to vectors

- removed the count of number of evaluations before start capturing, and instead changed the capture mode to relaxed

- removed the check for multiple devices so that it is still possible to use a single device, instead checks for split buffers to disable cuda graphs with -sm row

- changed the op for checking batch size to GGML_OP_ADD, should be more reliable than GGML_OP_SOFT_MAX

- code style fixes

- things to look into
  - VRAM usage of the cudaGraphExec_t, if it is significant we may need to make it optional
  - possibility of using cudaStreamBeginCaptureToGraph to keep track of which ggml graph nodes correspond to which cuda graph nodes

* fix build without cuda graphs

* remove outdated comment

* replace minimum cc value with a constant

---------

Co-authored-by: slaren <[email protected]>
* convert-hf : begin refactoring write_tensor

* convert : upgrade to sentencepiece v0.2.0

* convert-hf : remove unused n_dims in extra_*_tensors

* convert-hf : simplify MoE weights stacking

* convert-hf : flake8 linter doesn't like semicolons

* convert-hf : allow unusual model part names

For example, loading `model-00001-of-00001.safetensors` now works.

* convert-hf : fix stacking MoE expert tensors

`torch.stack` and `torch.cat` don't do the same thing.

* convert-hf : fix Mamba conversion

Tested to work even with a SentencePiece-based tokenizer.

* convert : use a string for the SentencePiece tokenizer path

* convert-hf : display tensor shape

* convert-hf : convert norms to f32 by default

* convert-hf : sort model part names

`os.listdir` is said to list files in arbitrary order.
Sorting the file names should let "model-00009-of-00042.safetensors"
be loaded before "model-00010-of-00042.safetensors".

* convert-hf : use an ABC for Model again

It seems Protocol can't be used as a statically type-checked ABC,
because its subclasses also can't be instantiated. (why did it seem to work?)

At least there's still a way to throw an error when forgetting to define
the `model_arch` property of any registered Model subclasses.

* convert-hf : use a plain class for Model, and forbid direct instantiation

There are no abstract methods used anyway,
so using ABC isn't really necessary.

* convert-hf : more consistent formatting of cmdline args

* convert-hf : align the message logged for converted tensors

* convert-hf : fix Refact conversion

* convert-hf : save memory with lazy evaluation

* convert-hf : flake8 doesn't like lowercase L as a variable name

* convert-hf : remove einops requirement for InternLM2

* convert-hf : faster model parts loading

Instead of pre-loading them all into a dict, iterate on the tensors
in the model parts progressively as needed in Model.write_tensors

Conversion for some architectures relies on checking for the presence
of specific tensor names, so for multi-part models, the weight map is read
from the relevant json file to quickly get these names up-front.

* convert-hf : minor changes for consistency

* gguf-py : add tqdm as a dependency

It's small, and used for a progress bar
in GGUFWriter.write_tensors_to_file
To be able to fix/amend special tokens in a GGUF let's add two new arguments:
* `--special-token <name> <value>` where `<name>` can be bos, eos, prefix, middle, etc. while `<value>` is the token value, f.ex. `"<|fim▁begin|>"`
* `--special-token-by-id <name> <id>` where `<id>` is the ID of the token, f.ex. 32006

So, in order to f.ex. add fill-in-middle tokens to a GGUF you would do the following:
```bash
gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<|fim▁begin|>" --special-token middle "<|fim▁end|>" --special-token suffix "<|fim▁hole|>"
```
(yes, fim_end is the `middle` token, because completion is a `prefix`/`suffix`/`middle` sequence (where `middle` is unfilled))
or
```bash
gguf-new-metadata.py input.gguf output.gguf --special-token prefix "<fim_prefix>" --special-token middle "<fim_middle>" --special-token suffix "<fim_suffix>"
```
etc...

NB: The tokens have to exist already, trying to add non-existent token name/IDs will be ignored (with a warning), while non-existent values will fail (with an error).
Copy link
Contributor

github-actions bot commented May 9, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 543 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8626.79ms p(95)=20741.52ms fails=, finish reason: stop=477 truncated=66
  • Prompt processing (pp): avg=96.28tk/s p(95)=386.26tk/s
  • Token generation (tg): avg=32.71tk/s p(95)=45.52tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=modify-special-tokens-metadata commit=144d99a00ae148d5a8421f24a301b0ce0a5b6eb9

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 543 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715237112 --> 1715237744
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 835.54, 835.54, 835.54, 835.54, 835.54, 687.41, 687.41, 687.41, 687.41, 687.41, 707.95, 707.95, 707.95, 707.95, 707.95, 704.06, 704.06, 704.06, 704.06, 704.06, 763.6, 763.6, 763.6, 763.6, 763.6, 762.43, 762.43, 762.43, 762.43, 762.43, 753.31, 753.31, 753.31, 753.31, 753.31, 787.82, 787.82, 787.82, 787.82, 787.82, 779.85, 779.85, 779.85, 779.85, 779.85, 774.6, 774.6, 774.6, 774.6, 774.6, 800.56, 800.56, 800.56, 800.56, 800.56, 811.58, 811.58, 811.58, 811.58, 811.58, 812.8, 812.8, 812.8, 812.8, 812.8, 785.02, 785.02, 785.02, 785.02, 785.02, 755.9, 755.9, 755.9, 755.9, 755.9, 766.8, 766.8, 766.8, 766.8, 766.8, 766.64, 766.64, 766.64, 766.64, 766.64, 747.35, 747.35, 747.35, 747.35, 747.35, 753.15, 753.15, 753.15, 753.15, 753.15, 760.85, 760.85, 760.85, 760.85, 760.85, 766.39, 766.39, 766.39, 766.39, 766.39, 768.74, 768.74, 768.74, 768.74, 768.74, 750.22, 750.22, 750.22, 750.22, 750.22, 753.56, 753.56, 753.56, 753.56, 753.56, 756.35, 756.35, 756.35, 756.35, 756.35, 756.2, 756.2, 756.2, 756.2, 756.2, 753.65, 753.65, 753.65, 753.65, 753.65, 754.76, 754.76, 754.76, 754.76, 754.76, 755.71, 755.71, 755.71, 755.71, 755.71, 762.54, 762.54, 762.54, 762.54, 762.54, 762.82, 762.82, 762.82, 762.82, 762.82, 768.34, 768.34, 768.34, 768.34, 768.34, 771.42, 771.42, 771.42, 771.42, 771.42, 775.11, 775.11, 775.11, 775.11, 775.11, 784.18, 784.18, 784.18, 784.18, 784.18, 773.37, 773.37, 773.37, 773.37, 773.37, 771.57, 771.57, 771.57, 771.57, 771.57, 775.23, 775.23, 775.23, 775.23, 775.23, 774.92, 774.92, 774.92, 774.92, 774.92, 777.29, 777.29, 777.29, 777.29, 777.29, 789.08, 789.08, 789.08, 789.08, 789.08, 785.22, 785.22, 785.22, 785.22, 785.22, 786.36, 786.36, 786.36, 786.36, 786.36, 786.09, 786.09, 786.09, 786.09, 786.09, 784.33, 784.33, 784.33, 784.33, 784.33, 782.34, 782.34, 782.34, 782.34, 782.34, 790.18, 790.18, 790.18, 790.18, 790.18, 790.39, 790.39, 790.39, 790.39, 790.39, 793.97, 793.97, 793.97, 793.97, 793.97, 795.23, 795.23, 795.23, 795.23, 795.23, 798.61, 798.61, 798.61, 798.61, 798.61, 803.84, 803.84, 803.84, 803.84, 803.84, 803.19, 803.19, 803.19, 803.19, 803.19, 807.17, 807.17, 807.17, 807.17, 807.17, 808.9, 808.9, 808.9, 808.9, 808.9, 807.91, 807.91, 807.91, 807.91, 807.91, 807.89, 807.89, 807.89, 807.89, 807.89, 809.25, 809.25, 809.25, 809.25, 809.25, 811.08, 811.08, 811.08, 811.08, 811.08, 811.83, 811.83, 811.83, 811.83, 811.83, 811.61, 811.61]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 543 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715237112 --> 1715237744
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 39.49, 39.49, 39.49, 39.49, 39.49, 35.74, 35.74, 35.74, 35.74, 35.74, 31.73, 31.73, 31.73, 31.73, 31.73, 29.75, 29.75, 29.75, 29.75, 29.75, 30.7, 30.7, 30.7, 30.7, 30.7, 31.34, 31.34, 31.34, 31.34, 31.34, 32.59, 32.59, 32.59, 32.59, 32.59, 33.51, 33.51, 33.51, 33.51, 33.51, 33.61, 33.61, 33.61, 33.61, 33.61, 33.88, 33.88, 33.88, 33.88, 33.88, 34.05, 34.05, 34.05, 34.05, 34.05, 34.03, 34.03, 34.03, 34.03, 34.03, 33.23, 33.23, 33.23, 33.23, 33.23, 32.36, 32.36, 32.36, 32.36, 32.36, 32.2, 32.2, 32.2, 32.2, 32.2, 32.25, 32.25, 32.25, 32.25, 32.25, 32.48, 32.48, 32.48, 32.48, 32.48, 32.39, 32.39, 32.39, 32.39, 32.39, 32.31, 32.31, 32.31, 32.31, 32.31, 32.19, 32.19, 32.19, 32.19, 32.19, 32.17, 32.17, 32.17, 32.17, 32.17, 32.43, 32.43, 32.43, 32.43, 32.43, 32.18, 32.18, 32.18, 32.18, 32.18, 32.39, 32.39, 32.39, 32.39, 32.39, 32.51, 32.51, 32.51, 32.51, 32.51, 32.53, 32.53, 32.53, 32.53, 32.53, 32.08, 32.08, 32.08, 32.08, 32.08, 31.69, 31.69, 31.69, 31.69, 31.69, 31.72, 31.72, 31.72, 31.72, 31.72, 31.92, 31.92, 31.92, 31.92, 31.92, 31.99, 31.99, 31.99, 31.99, 31.99, 32.28, 32.28, 32.28, 32.28, 32.28, 32.33, 32.33, 32.33, 32.33, 32.33, 32.2, 32.2, 32.2, 32.2, 32.2, 31.99, 31.99, 31.99, 31.99, 31.99, 31.73, 31.73, 31.73, 31.73, 31.73, 31.85, 31.85, 31.85, 31.85, 31.85, 32.03, 32.03, 32.03, 32.03, 32.03, 32.03, 32.03, 32.03, 32.03, 32.03, 32.21, 32.21, 32.21, 32.21, 32.21, 32.15, 32.15, 32.15, 32.15, 32.15, 32.14, 32.14, 32.14, 32.14, 32.14, 31.78, 31.78, 31.78, 31.78, 31.78, 31.72, 31.72, 31.72, 31.72, 31.72, 30.03, 30.03, 30.03, 30.03, 30.03, 29.78, 29.78, 29.78, 29.78, 29.78, 29.89, 29.89, 29.89, 29.89, 29.89, 29.91, 29.91, 29.91, 29.91, 29.91, 30.06, 30.06, 30.06, 30.06, 30.06, 30.11, 30.11, 30.11, 30.11, 30.11, 30.24, 30.24, 30.24, 30.24, 30.24, 30.25, 30.25, 30.25, 30.25, 30.25, 30.19, 30.19, 30.19, 30.19, 30.19, 29.99, 29.99, 29.99, 29.99, 29.99, 29.93, 29.93, 29.93, 29.93, 29.93, 30.02, 30.02, 30.02, 30.02, 30.02, 30.13, 30.13, 30.13, 30.13, 30.13, 30.26, 30.26, 30.26, 30.26, 30.26, 30.34, 30.34, 30.34, 30.34, 30.34, 30.37, 30.37, 30.37, 30.37, 30.37, 30.42, 30.42]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 543 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715237112 --> 1715237744
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.32, 0.32, 0.32, 0.32, 0.32, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15, 0.15, 0.33, 0.33, 0.33, 0.33, 0.33, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.33, 0.33, 0.33, 0.33, 0.33, 0.36, 0.36, 0.36, 0.36, 0.36, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.09, 0.09, 0.09, 0.09, 0.09, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.34, 0.34, 0.34, 0.34, 0.34, 0.53, 0.53, 0.53, 0.53, 0.53, 0.65, 0.65, 0.65, 0.65, 0.65, 0.7, 0.7, 0.7, 0.7, 0.7, 0.56, 0.56, 0.56, 0.56, 0.56, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.33, 0.33, 0.33, 0.33, 0.33, 0.28, 0.28, 0.28, 0.28, 0.28, 0.21, 0.21, 0.21, 0.21, 0.21, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 543 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715237112 --> 1715237744
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0]
                    
Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.