Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BPE pre-tokenization for DBRX. #7132

Merged
merged 4 commits into from
May 8, 2024
Merged

Conversation

dranger003
Copy link
Contributor

@dranger003 dranger003 commented May 7, 2024

Closes #7074.

Regex is identical to llama-3, so I re-used the same split.

https://huggingface.co/databricks/dbrx-instruct/blob/main/tokenizer.json

./build/bin/test-tokenizer-0 models/ggml-vocab-dbrx.gguf
...
Tests passed
output
$ ./build/bin/main -ngl 41 -c 4096 -s 0 -e -p "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWrite an essay about AI.<|im_end|>\n<|im_start|>assistant\n" -m /md0/models/databricks/ggml-dbrx-instruct-iq3_s.gguf
Log start
main: build = 2803 (b6aa6702)
main: built with cc (GCC) 13.2.1 20240417 for x86_64-pc-linux-gnu
main: seed  = 0
llama_model_loader: loaded meta data with 25 key-value pairs and 323 tensors from /md0/models/databricks/ggml-dbrx-instruct-iq3_s.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = dbrx
llama_model_loader: - kv   1:                               general.name str              = dbrx
llama_model_loader: - kv   2:                           dbrx.block_count u32              = 40
llama_model_loader: - kv   3:                        dbrx.context_length u32              = 32768
llama_model_loader: - kv   4:                      dbrx.embedding_length u32              = 6144
llama_model_loader: - kv   5:                   dbrx.feed_forward_length u32              = 10752
llama_model_loader: - kv   6:                  dbrx.attention.head_count u32              = 48
llama_model_loader: - kv   7:               dbrx.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                        dbrx.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:                   dbrx.attention.clamp_kqv f32              = 8.000000
llama_model_loader: - kv  10:                          general.file_type u32              = 26
llama_model_loader: - kv  11:                          dbrx.expert_count u32              = 16
llama_model_loader: - kv  12:                     dbrx.expert_used_count u32              = 4
llama_model_loader: - kv  13:          dbrx.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = dbrx
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 100257
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 100257
llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 100257
llama_model_loader: - kv  22:            tokenizer.ggml.padding_token_id u32              = 100277
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type  f16:   40 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq3_s:  201 tensors
llm_load_vocab: special tokens definition check successful ( 96/100352 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = dbrx
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 100352
llm_load_print_meta: n_merges         = 100000
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 6144
llm_load_print_meta: n_head           = 48
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 6
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 8.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 10752
llm_load_print_meta: n_expert         = 16
llm_load_print_meta: n_expert_used    = 4
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 16x12B
llm_load_print_meta: model ftype      = IQ3_S - 3.4375 bpw
llm_load_print_meta: model params     = 131.60 B
llm_load_print_meta: model size       = 52.89 GiB (3.45 BPW)
llm_load_print_meta: general.name     = dbrx
llm_load_print_meta: BOS token        = 100257 '<|endoftext|>'
llm_load_print_meta: EOS token        = 100257 '<|endoftext|>'
llm_load_print_meta: UNK token        = 100257 '<|endoftext|>'
llm_load_print_meta: PAD token        = 100277 '<|pad|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 100279 '<|im_end|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 3 CUDA devices:
  Device 0: NVIDIA RTX 5000 Ada Generation, compute capability 8.9, VMM: yes
  Device 1: NVIDIA RTX 5000 Ada Generation, compute capability 8.9, VMM: yes
  Device 2: NVIDIA RTX 5000 Ada Generation, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.68 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:        CPU buffer size =   252.66 MiB
llm_load_tensors:      CUDA0 buffer size = 18699.84 MiB
llm_load_tensors:      CUDA1 buffer size = 18699.84 MiB
llm_load_tensors:      CUDA2 buffer size = 16510.80 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   224.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   224.00 MiB
llama_kv_cache_init:      CUDA2 KV buffer size =   192.00 MiB
llama_new_context_with_model: KV self size  =  640.00 MiB, K (f16):  320.00 MiB, V (f16):  320.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.38 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      CUDA0 compute buffer size =   516.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   516.01 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =   516.02 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    44.02 MiB
llama_new_context_with_model: graph nodes  = 2246
llama_new_context_with_model: graph splits = 4

system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0


<|endoftext|><|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Write an essay about AI.<|im_end|>
<|im_start|>assistant
Artificial intelligence, or AI, has been a topic of great interest and debate in recent years. The potential for AI to revolutionize a way we live, work, and communicate is enormous. However, it is important to consider both the advantages and disadvantages of AI before embracing it wholeheartedly.

On the plus side, AI has the potential to greatly improve efficiency and productivity in a variety of fields. For instance, in a factory setting, AI can streamline a production line, ensuring that each component is produced and assembled as efficiently as possible. As a result, a company can produce more widgets a any given hour than it otherwise might. Furthermore, AI can also bring about a dramatic reduction in a errors and a resulting increase in a quality. In a service industry, AI can use data to anticipate and meet a customer's needs before they even realize what those needs are.

On a minus side, though, there are a few potential drawbacks to AI that we should at least consider. First, AI can lead to a reduction in a human employment. As AI grow more sophisticated, they may be able to perform a task previously done by a human worker. This can lead to a displacement of a worker, as a machine take over a job. Second, AI can also lead to a reduction in a privacy. Since AI can analyze a great deal of data, they may be able to make a prediction about a person's behavior, intent, or feeling. This can feel a invasive and even a violation of a privacy.

In conclusion, AI has both a potential to improve and a challenge our lives. However, it is important to balance a advantages and disadvantages before putting AI to work for us. By considering a potential impact of AI, we can help to guide a responsible and a beneficial development and implementation of a technology. Thank you.<|im_end|> [end of text]

llama_print_timings:        load time =   17066.24 ms
llama_print_timings:      sample time =      15.70 ms /   368 runs   (    0.04 ms per token, 23440.98 tokens per second)
llama_print_timings: prompt eval time =     444.10 ms /    26 tokens (   17.08 ms per token,    58.55 tokens per second)
llama_print_timings:        eval time =   13533.10 ms /   367 runs   (   36.87 ms per token,    27.12 tokens per second)
llama_print_timings:       total time =   14215.50 ms /   393 tokens
Log end

Copy link
Contributor

github-actions bot commented May 8, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 555 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8430.39ms p(95)=21123.17ms fails=, finish reason: stop=494 truncated=61
  • Prompt processing (pp): avg=93.73tk/s p(95)=382.7tk/s
  • Token generation (tg): avg=33.14tk/s p(95)=47.76tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=bpe-dbrx commit=6c90dda02170cf42f5d5c154536634fd897e5284

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715165719 --> 1715166347
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 411.49, 411.49, 411.49, 411.49, 411.49, 838.21, 838.21, 838.21, 838.21, 838.21, 817.01, 817.01, 817.01, 817.01, 817.01, 813.67, 813.67, 813.67, 813.67, 813.67, 782.87, 782.87, 782.87, 782.87, 782.87, 769.54, 769.54, 769.54, 769.54, 769.54, 769.64, 769.64, 769.64, 769.64, 769.64, 804.76, 804.76, 804.76, 804.76, 804.76, 803.8, 803.8, 803.8, 803.8, 803.8, 806.54, 806.54, 806.54, 806.54, 806.54, 831.37, 831.37, 831.37, 831.37, 831.37, 869.45, 869.45, 869.45, 869.45, 869.45, 870.9, 870.9, 870.9, 870.9, 870.9, 847.49, 847.49, 847.49, 847.49, 847.49, 851.35, 851.35, 851.35, 851.35, 851.35, 853.14, 853.14, 853.14, 853.14, 853.14, 850.57, 850.57, 850.57, 850.57, 850.57, 848.09, 848.09, 848.09, 848.09, 848.09, 848.29, 848.29, 848.29, 848.29, 848.29, 847.59, 847.59, 847.59, 847.59, 847.59, 852.34, 852.34, 852.34, 852.34, 852.34, 857.43, 857.43, 857.43, 857.43, 857.43, 870.03, 870.03, 870.03, 870.03, 870.03, 871.84, 871.84, 871.84, 871.84, 871.84, 871.93, 871.93, 871.93, 871.93, 871.93, 885.25, 885.25, 885.25, 885.25, 885.25, 881.22, 881.22, 881.22, 881.22, 881.22, 878.17, 878.17, 878.17, 878.17, 878.17, 878.08, 878.08, 878.08, 878.08, 878.08, 882.92, 882.92, 882.92, 882.92, 882.92, 882.34, 882.34, 882.34, 882.34, 882.34, 881.74, 881.74, 881.74, 881.74, 881.74, 883.18, 883.18, 883.18, 883.18, 883.18, 858.41, 858.41, 858.41, 858.41, 858.41, 863.7, 863.7, 863.7, 863.7, 863.7, 872.28, 872.28, 872.28, 872.28, 872.28, 871.91, 871.91, 871.91, 871.91, 871.91, 869.75, 869.75, 869.75, 869.75, 869.75, 872.1, 872.1, 872.1, 872.1, 872.1, 875.02, 875.02, 875.02, 875.02, 875.02, 881.71, 881.71, 881.71, 881.71, 881.71, 888.53, 888.53, 888.53, 888.53, 888.53, 852.9, 852.9, 852.9, 852.9, 852.9, 851.32, 851.32, 851.32, 851.32, 851.32, 850.63, 850.63, 850.63, 850.63, 850.63, 848.83, 848.83, 848.83, 848.83, 848.83, 854.38, 854.38, 854.38, 854.38, 854.38, 853.94, 853.94, 853.94, 853.94, 853.94, 859.68, 859.68, 859.68, 859.68, 859.68, 858.57, 858.57, 858.57, 858.57, 858.57, 861.16, 861.16, 861.16, 861.16, 861.16, 864.98, 864.98, 864.98, 864.98, 864.98, 863.74, 863.74, 863.74, 863.74, 863.74, 868.31, 868.31, 868.31, 868.31, 868.31, 869.68, 869.68, 869.68, 869.68, 869.68, 869.44, 869.44, 869.44, 869.44, 869.44, 869.13, 869.13, 869.13, 869.13, 869.13, 870.46, 870.46, 870.46, 870.46, 870.46, 870.91, 870.91, 870.91, 870.91, 870.91, 874.04, 874.04, 874.04, 874.04, 874.04, 873.63, 873.63, 873.63, 873.63, 873.63]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715165719 --> 1715166347
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 45.43, 45.43, 45.43, 45.43, 45.43, 36.89, 36.89, 36.89, 36.89, 36.89, 29.51, 29.51, 29.51, 29.51, 29.51, 28.82, 28.82, 28.82, 28.82, 28.82, 30.68, 30.68, 30.68, 30.68, 30.68, 30.83, 30.83, 30.83, 30.83, 30.83, 33.31, 33.31, 33.31, 33.31, 33.31, 34.22, 34.22, 34.22, 34.22, 34.22, 34.31, 34.31, 34.31, 34.31, 34.31, 34.99, 34.99, 34.99, 34.99, 34.99, 35.19, 35.19, 35.19, 35.19, 35.19, 33.97, 33.97, 33.97, 33.97, 33.97, 33.8, 33.8, 33.8, 33.8, 33.8, 33.2, 33.2, 33.2, 33.2, 33.2, 32.43, 32.43, 32.43, 32.43, 32.43, 32.66, 32.66, 32.66, 32.66, 32.66, 32.89, 32.89, 32.89, 32.89, 32.89, 32.59, 32.59, 32.59, 32.59, 32.59, 32.49, 32.49, 32.49, 32.49, 32.49, 32.25, 32.25, 32.25, 32.25, 32.25, 31.8, 31.8, 31.8, 31.8, 31.8, 31.9, 31.9, 31.9, 31.9, 31.9, 31.89, 31.89, 31.89, 31.89, 31.89, 32.07, 32.07, 32.07, 32.07, 32.07, 32.32, 32.32, 32.32, 32.32, 32.32, 32.35, 32.35, 32.35, 32.35, 32.35, 31.74, 31.74, 31.74, 31.74, 31.74, 31.47, 31.47, 31.47, 31.47, 31.47, 31.67, 31.67, 31.67, 31.67, 31.67, 31.88, 31.88, 31.88, 31.88, 31.88, 31.97, 31.97, 31.97, 31.97, 31.97, 32.16, 32.16, 32.16, 32.16, 32.16, 32.25, 32.25, 32.25, 32.25, 32.25, 32.19, 32.19, 32.19, 32.19, 32.19, 32.05, 32.05, 32.05, 32.05, 32.05, 31.73, 31.73, 31.73, 31.73, 31.73, 31.59, 31.59, 31.59, 31.59, 31.59, 31.63, 31.63, 31.63, 31.63, 31.63, 31.72, 31.72, 31.72, 31.72, 31.72, 31.86, 31.86, 31.86, 31.86, 31.86, 31.94, 31.94, 31.94, 31.94, 31.94, 31.78, 31.78, 31.78, 31.78, 31.78, 31.3, 31.3, 31.3, 31.3, 31.3, 31.22, 31.22, 31.22, 31.22, 31.22, 30.86, 30.86, 30.86, 30.86, 30.86, 29.9, 29.9, 29.9, 29.9, 29.9, 29.93, 29.93, 29.93, 29.93, 29.93, 29.95, 29.95, 29.95, 29.95, 29.95, 30.14, 30.14, 30.14, 30.14, 30.14, 30.2, 30.2, 30.2, 30.2, 30.2, 30.32, 30.32, 30.32, 30.32, 30.32, 30.29, 30.29, 30.29, 30.29, 30.29, 30.09, 30.09, 30.09, 30.09, 30.09, 30.01, 30.01, 30.01, 30.01, 30.01, 30.1, 30.1, 30.1, 30.1, 30.1, 30.21, 30.21, 30.21, 30.21, 30.21, 30.36, 30.36, 30.36, 30.36, 30.36, 30.43, 30.43, 30.43, 30.43, 30.43, 30.56, 30.56, 30.56, 30.56, 30.56, 30.56, 30.56, 30.56, 30.56, 30.56, 30.57, 30.57, 30.57, 30.57, 30.57]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715165719 --> 1715166347
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.12, 0.12, 0.12, 0.12, 0.12, 0.4, 0.4, 0.4, 0.4, 0.4, 0.25, 0.25, 0.25, 0.25, 0.25, 0.12, 0.12, 0.12, 0.12, 0.12, 0.21, 0.21, 0.21, 0.21, 0.21, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.21, 0.21, 0.21, 0.21, 0.21, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.25, 0.25, 0.25, 0.25, 0.25, 0.31, 0.31, 0.31, 0.31, 0.31, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.3, 0.3, 0.3, 0.3, 0.3, 0.25, 0.25, 0.25, 0.25, 0.25, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.23, 0.23, 0.23, 0.23, 0.23, 0.2, 0.2, 0.2, 0.2, 0.2, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.28, 0.28, 0.28, 0.28, 0.28, 0.46, 0.46, 0.46, 0.46, 0.46, 0.54, 0.54, 0.54, 0.54, 0.54, 0.53, 0.53, 0.53, 0.53, 0.53, 0.52, 0.52, 0.52, 0.52, 0.52, 0.1, 0.1, 0.1, 0.1, 0.1, 0.19, 0.19, 0.19, 0.19, 0.19, 0.11, 0.11, 0.11, 0.11, 0.11, 0.19, 0.19, 0.19, 0.19, 0.19, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.25, 0.25, 0.25, 0.25, 0.25, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.14, 0.14, 0.14, 0.14, 0.14, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715165719 --> 1715166347
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0]
                    
Loading

@ggerganov
Copy link
Member

ggerganov commented May 8, 2024

Any idea why the conversion fails for me:

Nvm - I had to pull the repo

@@ -84,6 +84,7 @@ llama_test(test-tokenizer-0 NAME test-tokenizer-0-starcoder ARGS ${CMAKE
llama_test(test-tokenizer-0 NAME test-tokenizer-0-gpt-2 ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-gpt-2.gguf)
llama_test(test-tokenizer-0 NAME test-tokenizer-0-refact ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-refact.gguf)
llama_test(test-tokenizer-0 NAME test-tokenizer-0-command-r ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-command-r.gguf)
llama_test(test-tokenizer-0 NAME test-tokenizer-0-dbrx ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-dbrx.gguf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add tests only for new types of pre-tokenizer in order to keep the binary data in the repo small. Remove the models/ggml-vocab-dbrx.* and let's merge

@ggerganov ggerganov merged commit 4cd621c into ggml-org:master May 8, 2024
56 of 61 checks passed
@dranger003 dranger003 deleted the bpe-dbrx branch January 3, 2025 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DBRX GGUF conversion no longer working
2 participants