fix : lookup word in vocab before doing BPE merges #7193

tonyfettes · 2024-05-10T08:04:27Z

For llama-3, I found there is an inconsistency between llama.cpp's tokenizer and Huggingface's tokenizers. Example:

 Việt

llama.cpp:

 11655 -> ' Vi'
 26298 -> 'ệ'
    83 -> 't'

Huggingface's tokenizers with tokenizer.json from llama-3:

After comparing the implementation, it seems that Huggingface's tokenizers will try to lookup a split word in the vocabulary first, and push to the result tokens if found; if not, it will try to merge the word at byte level instead. In llama.cpp, we always do the byte-level merge, hence the inconsistency.

This is a simple fix to the problem, by just looking the word up before do the merging.

PS: I have checked with tiktoken and it seems they did the same thing at src/lib.rs:228 in CoreBPE::_encode_native

PPS: I searched tokenizer.json from all BPE models (some are license-walled so I checked their variants) and it seems that llama-3 is the only one doing this?

Model	tokenizer.json
DBRX (Walled)	https://huggingface.co/turboderp/dbrx-instruct-exl2/tree/2.3bpw
Deepseek LLM	https://huggingface.co/deepseek-ai/deepseek-llm-67b-chat/raw/main/tokenizer.json
Deepseek Coder	https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct/raw/main/tokenizer.json
Falcon	https://huggingface.co/tiiuae/falcon-7b/raw/main/tokenizer.json
Starcoder	https://huggingface.co/bigcode/starcoder/raw/main/tokenizer.json
Refact	https://huggingface.co/smallcloudai/Refact-1_6B-fim/blob/main/tokenizer.json
Command R+	https://huggingface.co/CohereForAI/c4ai-command-r-plus/blob/main/tokenizer.json
GPT2	https://huggingface.co/openai-community/gpt2/raw/main/tokenizer.json
OLMo	https://huggingface.co/allenai/OLMo-7B-Instruct/raw/main/tokenizer.json
Qwen2 (Qwen1.5)	https://huggingface.co/Qwen/Qwen1.5-110B-Chat/raw/main/tokenizer.json

llama.cpp

ggerganov · 2024-05-10T14:28:00Z

This change not only fixed the llama3 tokenization, but it also improved the performance by a factor of x4:

./tests/test-tokenizer-0.sh llama-bpe ./build/wikitext-2-raw/wiki.train.raw

master

Testing llama-bpe on ./build/wikitext-2-raw/wiki.train.raw ...
main : tokenized in 3141.467 ms (py)
main : tokenized in 6085.319 ms (cpp)
1842692c1842692,1842694
< 101798
---
> 11655
> 26298
> 83
Tokenization differs!

PR

Testing llama-bpe on ./build/wikitext-2-raw/wiki.train.raw ...
main : tokenized in 3157.516 ms (py)
main : tokenized in 1408.991 ms (cpp)
Tokenization is correct!

We now tokenize wiki.train.raw 2x faster than Python AutoTokenizer

ggerganov · 2024-05-10T14:40:09Z

PPS: I searched tokenizer.json from all BPE models (some are license-walled so I checked their variants) and it seems that llama-3 is the only one doing this?

Which parameter in the tokenizer config determines this behaviour?

tonyfettes · 2024-05-10T14:44:21Z

@ggerganov "ignore_merges", under "model"

ggerganov

Let's merge after the green is CI

ggerganov · 2024-05-11T06:59:12Z

llama.cpp

+            if (ignore_merges && vocab.token_to_id.find(word) != vocab.token_to_id.end()) {
+                llm_symbol sym;
+                sym.text = word.c_str();
+                sym.n = word.size();
+                sym.prev = final_prev_index;
+                sym.next = -1;
+                if (final_prev_index != -1) {
+                    symbols_final[final_prev_index].next = symbols_final.size();
+                }
+                symbols_final.emplace_back(sym);
+                final_prev_index = symbols_final.size() - 1;
+                continue;
+            }
+


Let's apply @jaime-m-p's suggestion here, to reduce the code duplication in this loop:

#6965 (comment)

github-actions · 2024-05-11T17:39:38Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 551 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8538.14ms p(95)=21008.61ms fails=, finish reason: stop=490 truncated=61
Prompt processing (pp): avg=104.75tk/s p(95)=461.86tk/s
Token generation (tg): avg=34.2tk/s p(95)=49.2tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=llama3-tokenizer-ignore-merge commit=b8d3cd5337bfa74f816138af84e7181c5208f717

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715448542 --> 1715449172
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 334.04, 334.04, 334.04, 334.04, 334.04, 539.06, 539.06, 539.06, 539.06, 539.06, 540.1, 540.1, 540.1, 540.1, 540.1, 568.12, 568.12, 568.12, 568.12, 568.12, 600.22, 600.22, 600.22, 600.22, 600.22, 664.0, 664.0, 664.0, 664.0, 664.0, 669.42, 669.42, 669.42, 669.42, 669.42, 687.3, 687.3, 687.3, 687.3, 687.3, 704.77, 704.77, 704.77, 704.77, 704.77, 705.71, 705.71, 705.71, 705.71, 705.71, 725.62, 725.62, 725.62, 725.62, 725.62, 750.53, 750.53, 750.53, 750.53, 750.53, 782.72, 782.72, 782.72, 782.72, 782.72, 797.23, 797.23, 797.23, 797.23, 797.23, 708.58, 708.58, 708.58, 708.58, 708.58, 718.95, 718.95, 718.95, 718.95, 718.95, 725.3, 725.3, 725.3, 725.3, 725.3, 750.02, 750.02, 750.02, 750.02, 750.02, 753.58, 753.58, 753.58, 753.58, 753.58, 755.08, 755.08, 755.08, 755.08, 755.08, 761.87, 761.87, 761.87, 761.87, 761.87, 768.28, 768.28, 768.28, 768.28, 768.28, 779.72, 779.72, 779.72, 779.72, 779.72, 772.27, 772.27, 772.27, 772.27, 772.27, 778.04, 778.04, 778.04, 778.04, 778.04, 793.89, 793.89, 793.89, 793.89, 793.89, 792.22, 792.22, 792.22, 792.22, 792.22, 790.81, 790.81, 790.81, 790.81, 790.81, 791.91, 791.91, 791.91, 791.91, 791.91, 797.01, 797.01, 797.01, 797.01, 797.01, 798.59, 798.59, 798.59, 798.59, 798.59, 796.94, 796.94, 796.94, 796.94, 796.94, 800.09, 800.09, 800.09, 800.09, 800.09, 812.08, 812.08, 812.08, 812.08, 812.08, 817.73, 817.73, 817.73, 817.73, 817.73, 828.4, 828.4, 828.4, 828.4, 828.4, 827.68, 827.68, 827.68, 827.68, 827.68, 825.77, 825.77, 825.77, 825.77, 825.77, 828.25, 828.25, 828.25, 828.25, 828.25, 832.11, 832.11, 832.11, 832.11, 832.11, 837.64, 837.64, 837.64, 837.64, 837.64, 848.16, 848.16, 848.16, 848.16, 848.16, 833.41, 833.41, 833.41, 833.41, 833.41, 832.42, 832.42, 832.42, 832.42, 832.42, 830.71, 830.71, 830.71, 830.71, 830.71, 834.03, 834.03, 834.03, 834.03, 834.03, 833.95, 833.95, 833.95, 833.95, 833.95, 835.53, 835.53, 835.53, 835.53, 835.53, 838.98, 838.98, 838.98, 838.98, 838.98, 841.39, 841.39, 841.39, 841.39, 841.39, 843.69, 843.69, 843.69, 843.69, 843.69, 848.57, 848.57, 848.57, 848.57, 848.57, 847.22, 847.22, 847.22, 847.22, 847.22, 851.13, 851.13, 851.13, 851.13, 851.13, 852.54, 852.54, 852.54, 852.54, 852.54, 853.08, 853.08, 853.08, 853.08, 853.08, 852.6, 852.6, 852.6, 852.6, 852.6, 853.59, 853.59, 853.59, 853.59, 853.59, 854.47, 854.47, 854.47, 854.47, 854.47, 857.58, 857.58, 857.58, 857.58, 857.58, 858.06, 858.06, 858.06, 858.06, 858.06, 858.06]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715448542 --> 1715449172
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 37.94, 37.94, 37.94, 37.94, 37.94, 42.61, 42.61, 42.61, 42.61, 42.61, 34.13, 34.13, 34.13, 34.13, 34.13, 30.78, 30.78, 30.78, 30.78, 30.78, 31.39, 31.39, 31.39, 31.39, 31.39, 31.7, 31.7, 31.7, 31.7, 31.7, 32.86, 32.86, 32.86, 32.86, 32.86, 33.5, 33.5, 33.5, 33.5, 33.5, 34.0, 34.0, 34.0, 34.0, 34.0, 33.88, 33.88, 33.88, 33.88, 33.88, 33.87, 33.87, 33.87, 33.87, 33.87, 34.01, 34.01, 34.01, 34.01, 34.01, 33.91, 33.91, 33.91, 33.91, 33.91, 33.11, 33.11, 33.11, 33.11, 33.11, 32.43, 32.43, 32.43, 32.43, 32.43, 32.04, 32.04, 32.04, 32.04, 32.04, 32.14, 32.14, 32.14, 32.14, 32.14, 32.42, 32.42, 32.42, 32.42, 32.42, 32.07, 32.07, 32.07, 32.07, 32.07, 32.03, 32.03, 32.03, 32.03, 32.03, 31.85, 31.85, 31.85, 31.85, 31.85, 31.82, 31.82, 31.82, 31.82, 31.82, 31.92, 31.92, 31.92, 31.92, 31.92, 31.76, 31.76, 31.76, 31.76, 31.76, 32.09, 32.09, 32.09, 32.09, 32.09, 32.08, 32.08, 32.08, 32.08, 32.08, 31.74, 31.74, 31.74, 31.74, 31.74, 31.3, 31.3, 31.3, 31.3, 31.3, 31.34, 31.34, 31.34, 31.34, 31.34, 31.54, 31.54, 31.54, 31.54, 31.54, 31.56, 31.56, 31.56, 31.56, 31.56, 31.69, 31.69, 31.69, 31.69, 31.69, 31.72, 31.72, 31.72, 31.72, 31.72, 31.68, 31.68, 31.68, 31.68, 31.68, 31.67, 31.67, 31.67, 31.67, 31.67, 31.51, 31.51, 31.51, 31.51, 31.51, 31.05, 31.05, 31.05, 31.05, 31.05, 31.06, 31.06, 31.06, 31.06, 31.06, 31.27, 31.27, 31.27, 31.27, 31.27, 31.37, 31.37, 31.37, 31.37, 31.37, 31.46, 31.46, 31.46, 31.46, 31.46, 31.29, 31.29, 31.29, 31.29, 31.29, 31.07, 31.07, 31.07, 31.07, 31.07, 30.67, 30.67, 30.67, 30.67, 30.67, 29.85, 29.85, 29.85, 29.85, 29.85, 29.57, 29.57, 29.57, 29.57, 29.57, 29.52, 29.52, 29.52, 29.52, 29.52, 29.45, 29.45, 29.45, 29.45, 29.45, 29.56, 29.56, 29.56, 29.56, 29.56, 29.63, 29.63, 29.63, 29.63, 29.63, 29.76, 29.76, 29.76, 29.76, 29.76, 29.77, 29.77, 29.77, 29.77, 29.77, 29.72, 29.72, 29.72, 29.72, 29.72, 29.53, 29.53, 29.53, 29.53, 29.53, 29.49, 29.49, 29.49, 29.49, 29.49, 29.67, 29.67, 29.67, 29.67, 29.67, 29.7, 29.7, 29.7, 29.7, 29.7, 29.87, 29.87, 29.87, 29.87, 29.87, 30.0, 30.0, 30.0, 30.0, 30.0, 30.04, 30.04, 30.04, 30.04, 30.04, 30.05, 30.05, 30.05, 30.05, 30.05, 30.1]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715448542 --> 1715449172
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.14, 0.14, 0.14, 0.14, 0.14, 0.42, 0.42, 0.42, 0.42, 0.42, 0.23, 0.23, 0.23, 0.23, 0.23, 0.27, 0.27, 0.27, 0.27, 0.27, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.25, 0.25, 0.25, 0.25, 0.25, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.35, 0.35, 0.35, 0.35, 0.35, 0.2, 0.2, 0.2, 0.2, 0.2, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.1, 0.1, 0.1, 0.1, 0.1, 0.23, 0.23, 0.23, 0.23, 0.23, 0.09, 0.09, 0.09, 0.09, 0.09, 0.12, 0.12, 0.12, 0.12, 0.12, 0.32, 0.32, 0.32, 0.32, 0.32, 0.25, 0.25, 0.25, 0.25, 0.25, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.37, 0.37, 0.37, 0.37, 0.37, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1, 0.1, 0.1, 0.23, 0.23, 0.23, 0.23, 0.23, 0.43, 0.43, 0.43, 0.43, 0.43, 0.62, 0.62, 0.62, 0.62, 0.62, 0.54, 0.54, 0.54, 0.54, 0.54, 0.38, 0.38, 0.38, 0.38, 0.38, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.1, 0.1, 0.1, 0.1, 0.1, 0.17, 0.17, 0.17, 0.17, 0.17, 0.33, 0.33, 0.33, 0.33, 0.33, 0.3, 0.3, 0.3, 0.3, 0.3, 0.23, 0.23, 0.23, 0.23, 0.23, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.09, 0.09, 0.09, 0.09, 0.09, 0.12, 0.12, 0.12, 0.12, 0.12, 0.21, 0.21, 0.21, 0.21, 0.21, 0.16, 0.16, 0.16, 0.16, 0.16, 0.26]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715448542 --> 1715449172
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0]

tonyfettes marked this pull request as draft May 10, 2024 08:05

tonyfettes changed the title ~~Llama3 tokenizer ignore merge~~ fix : lookup word in vocab before doing BPE merges May 10, 2024

tonyfettes marked this pull request as ready for review May 10, 2024 08:48

mofosyne added Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level bugfix fixes an issue or bug labels May 10, 2024

mofosyne requested a review from goerch May 10, 2024 10:27

ggerganov reviewed May 10, 2024

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

tonyfettes force-pushed the llama3-tokenizer-ignore-merge branch from 4ba2e5c to 63207d1 Compare May 10, 2024 13:02

ggerganov approved these changes May 10, 2024

View reviewed changes

tonyfettes and others added 7 commits May 11, 2024 09:51

fix: llama-3 ignore_merges

c21d5e1

test: add test for llama-3 bpe ignore_merges

c761493

fix: set ignore_merges only for llama-3

8a51d3b

fix: test-tokenizer-1-bpe --ingore-merges detection

5d30a6d

fix: copy to fix fallthrough

1fb5b55

fix: change ignore_merges to bool

c3d0f41

fix: add ignore merges tests to cmake

0c9a0ae

tonyfettes force-pushed the llama3-tokenizer-ignore-merge branch from 0f48f9e to 0c9a0ae Compare May 11, 2024 01:52

ggerganov reviewed May 11, 2024

View reviewed changes

llama : alternative merge ignore logic

b8d3cd5

ggerganov merged commit f99e1e4 into ggml-org:master May 11, 2024
54 of 60 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix : lookup word in vocab before doing BPE merges #7193

fix : lookup word in vocab before doing BPE merges #7193

tonyfettes commented May 10, 2024 •

edited

Loading

ggerganov commented May 10, 2024

ggerganov commented May 10, 2024

tonyfettes commented May 10, 2024

ggerganov left a comment

ggerganov May 11, 2024

github-actions bot commented May 11, 2024

fix : lookup word in vocab before doing BPE merges #7193

fix : lookup word in vocab before doing BPE merges #7193

Conversation

tonyfettes commented May 10, 2024 • edited Loading

ggerganov commented May 10, 2024

ggerganov commented May 10, 2024

tonyfettes commented May 10, 2024

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov May 11, 2024

Choose a reason for hiding this comment

github-actions bot commented May 11, 2024

tonyfettes commented May 10, 2024 •

edited

Loading