Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests : add test-tokenizer-0.sh #7036

Merged
merged 15 commits into from
May 4, 2024
Merged

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented May 2, 2024

Add more extensive tokenizer test that takes a text file, tokenizes it using transformers and llama.cpp and compares the results.

# run once
python3 convert-hf-to-gguf-update.py <hf_token>

# tests OK
./tests/test-tokenizer-0.sh llama-spm ./build/wikitext-2-raw/wiki.train.raw
./tests/test-tokenizer-0.sh llama-bpe ./build/wikitext-2-raw/wiki.train.raw
./tests/test-tokenizer-0.sh gpt-2     ./build/wikitext-2-raw/wiki.train.raw
./tests/test-tokenizer-0.sh phi-3     ./build/wikitext-2-raw/wiki.train.raw
./tests/test-tokenizer-0.sh starcoder ./build/wikitext-2-raw/wiki.train.raw
./tests/test-tokenizer-0.sh falcon    ./build/wikitext-2-raw/wiki.train.raw
./tests/test-tokenizer-0.sh refact    ./build/wikitext-2-raw/wiki.train.raw

# tests Fail
./tests/test-tokenizer-0.sh deepseek-llm   ./build/wikitext-2-raw/wiki.train.raw
./tests/test-tokenizer-0.sh deepseek-coder ./build/wikitext-2-raw/wiki.train.raw
./tests/test-tokenizer-0.sh mpt            ./build/wikitext-2-raw/wiki.train.raw

Need to find the reason why the tokenization differs in the Fail cases. For example, DeepSeek models fail like this:

make -j tests/test-tokenizer-0 && ./tests/test-tokenizer-0 ./models/ggml-vocab-deepseek-coder.gguf

src: 'Führer'
res: 'Führer'
tok: 37 2864 71 6247 
main : failed test:    'Führer'
main : detokenized to: 'Führer' instead of 'Führer'
main : expected tokens:     37 'F',  32009 'ü',     71 'h',   6247 'rer', 
main : got tokens:          37 'F',   2864 'ü',     71 'h',   6247 'rer', 

Added script for generating the unicode ranges in unicode-data.cpp:

python3 scripts/gen-unicode-data.py

@ggerganov ggerganov force-pushed the gg/add-tokenizer-test-script branch from 9998b08 to ce7d3a0 Compare May 2, 2024 05:52
Copy link
Contributor

github-actions bot commented May 2, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 547 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8554.04ms p(95)=20486.45ms fails=, finish reason: stop=484 truncated=63
  • Prompt processing (pp): avg=99.29tk/s p(95)=413.58tk/s
  • Token generation (tg): avg=33.5tk/s p(95)=49.64tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=gg/add-tokenizer-test-script commit=7e11d409fa2fc1868fa04c5e02d905b8499f2a66

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1714801796 --> 1714802424
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 293.76, 293.76, 293.76, 293.76, 293.76, 784.9, 784.9, 784.9, 784.9, 784.9, 704.71, 704.71, 704.71, 704.71, 704.71, 740.89, 740.89, 740.89, 740.89, 740.89, 802.3, 802.3, 802.3, 802.3, 802.3, 819.16, 819.16, 819.16, 819.16, 819.16, 816.77, 816.77, 816.77, 816.77, 816.77, 832.12, 832.12, 832.12, 832.12, 832.12, 836.9, 836.9, 836.9, 836.9, 836.9, 851.6, 851.6, 851.6, 851.6, 851.6, 851.75, 851.75, 851.75, 851.75, 851.75, 865.93, 865.93, 865.93, 865.93, 865.93, 906.25, 906.25, 906.25, 906.25, 906.25, 936.38, 936.38, 936.38, 936.38, 936.38, 958.16, 958.16, 958.16, 958.16, 958.16, 949.06, 949.06, 949.06, 949.06, 949.06, 949.71, 949.71, 949.71, 949.71, 949.71, 943.79, 943.79, 943.79, 943.79, 943.79, 961.23, 961.23, 961.23, 961.23, 961.23, 957.17, 957.17, 957.17, 957.17, 957.17, 950.7, 950.7, 950.7, 950.7, 950.7, 955.53, 955.53, 955.53, 955.53, 955.53, 954.9, 954.9, 954.9, 954.9, 954.9, 962.32, 962.32, 962.32, 962.32, 962.32, 963.72, 963.72, 963.72, 963.72, 963.72, 960.09, 960.09, 960.09, 960.09, 960.09, 960.5, 960.5, 960.5, 960.5, 960.5, 943.24, 943.24, 943.24, 943.24, 943.24, 937.15, 937.15, 937.15, 937.15, 937.15, 934.73, 934.73, 934.73, 934.73, 934.73, 933.46, 933.46, 933.46, 933.46, 933.46, 936.26, 936.26, 936.26, 936.26, 936.26, 934.4, 934.4, 934.4, 934.4, 934.4, 935.02, 935.02, 935.02, 935.02, 935.02, 938.66, 938.66, 938.66, 938.66, 938.66, 948.97, 948.97, 948.97, 948.97, 948.97, 948.26, 948.26, 948.26, 948.26, 948.26, 922.19, 922.19, 922.19, 922.19, 922.19, 920.76, 920.76, 920.76, 920.76, 920.76, 922.27, 922.27, 922.27, 922.27, 922.27, 923.73, 923.73, 923.73, 923.73, 923.73, 932.84, 932.84, 932.84, 932.84, 932.84, 932.86, 932.86, 932.86, 932.86, 932.86, 920.65, 920.65, 920.65, 920.65, 920.65, 918.74, 918.74, 918.74, 918.74, 918.74, 916.54, 916.54, 916.54, 916.54, 916.54, 914.66, 914.66, 914.66, 914.66, 914.66, 920.67, 920.67, 920.67, 920.67, 920.67, 919.4, 919.4, 919.4, 919.4, 919.4, 922.01, 922.01, 922.01, 922.01, 922.01, 920.96, 920.96, 920.96, 920.96, 920.96, 923.79, 923.79, 923.79, 923.79, 923.79, 925.11, 925.11, 925.11, 925.11, 925.11, 923.1, 923.1, 923.1, 923.1, 923.1, 928.52, 928.52, 928.52, 928.52, 928.52, 928.62, 928.62, 928.62, 928.62, 928.62, 927.87, 927.87, 927.87, 927.87, 927.87, 928.37, 928.37, 928.37, 928.37, 928.37, 928.84, 928.84, 928.84, 928.84, 928.84, 928.56, 928.56, 928.56, 928.56, 928.56, 930.31, 930.31, 930.31, 930.31, 930.31]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1714801796 --> 1714802424
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 39.63, 39.63, 39.63, 39.63, 39.63, 42.71, 42.71, 42.71, 42.71, 42.71, 29.67, 29.67, 29.67, 29.67, 29.67, 30.53, 30.53, 30.53, 30.53, 30.53, 32.18, 32.18, 32.18, 32.18, 32.18, 32.54, 32.54, 32.54, 32.54, 32.54, 33.81, 33.81, 33.81, 33.81, 33.81, 34.3, 34.3, 34.3, 34.3, 34.3, 34.73, 34.73, 34.73, 34.73, 34.73, 34.81, 34.81, 34.81, 34.81, 34.81, 34.7, 34.7, 34.7, 34.7, 34.7, 34.36, 34.36, 34.36, 34.36, 34.36, 33.36, 33.36, 33.36, 33.36, 33.36, 33.14, 33.14, 33.14, 33.14, 33.14, 32.65, 32.65, 32.65, 32.65, 32.65, 31.72, 31.72, 31.72, 31.72, 31.72, 31.76, 31.76, 31.76, 31.76, 31.76, 32.1, 32.1, 32.1, 32.1, 32.1, 32.05, 32.05, 32.05, 32.05, 32.05, 31.68, 31.68, 31.68, 31.68, 31.68, 31.12, 31.12, 31.12, 31.12, 31.12, 31.13, 31.13, 31.13, 31.13, 31.13, 31.21, 31.21, 31.21, 31.21, 31.21, 31.33, 31.33, 31.33, 31.33, 31.33, 31.04, 31.04, 31.04, 31.04, 31.04, 31.19, 31.19, 31.19, 31.19, 31.19, 31.24, 31.24, 31.24, 31.24, 31.24, 31.31, 31.31, 31.31, 31.31, 31.31, 30.88, 30.88, 30.88, 30.88, 30.88, 30.85, 30.85, 30.85, 30.85, 30.85, 31.05, 31.05, 31.05, 31.05, 31.05, 31.23, 31.23, 31.23, 31.23, 31.23, 31.31, 31.31, 31.31, 31.31, 31.31, 31.46, 31.46, 31.46, 31.46, 31.46, 31.54, 31.54, 31.54, 31.54, 31.54, 31.5, 31.5, 31.5, 31.5, 31.5, 31.37, 31.37, 31.37, 31.37, 31.37, 31.34, 31.34, 31.34, 31.34, 31.34, 31.45, 31.45, 31.45, 31.45, 31.45, 31.63, 31.63, 31.63, 31.63, 31.63, 31.76, 31.76, 31.76, 31.76, 31.76, 31.75, 31.75, 31.75, 31.75, 31.75, 31.64, 31.64, 31.64, 31.64, 31.64, 31.56, 31.56, 31.56, 31.56, 31.56, 30.87, 30.87, 30.87, 30.87, 30.87, 30.22, 30.22, 30.22, 30.22, 30.22, 29.88, 29.88, 29.88, 29.88, 29.88, 29.83, 29.83, 29.83, 29.83, 29.83, 29.91, 29.91, 29.91, 29.91, 29.91, 29.98, 29.98, 29.98, 29.98, 29.98, 30.11, 30.11, 30.11, 30.11, 30.11, 30.16, 30.16, 30.16, 30.16, 30.16, 30.17, 30.17, 30.17, 30.17, 30.17, 29.96, 29.96, 29.96, 29.96, 29.96, 29.9, 29.9, 29.9, 29.9, 29.9, 29.94, 29.94, 29.94, 29.94, 29.94, 30.1, 30.1, 30.1, 30.1, 30.1, 30.17, 30.17, 30.17, 30.17, 30.17, 30.28, 30.28, 30.28, 30.28, 30.28, 30.34, 30.34, 30.34, 30.34, 30.34, 30.35, 30.35, 30.35, 30.35, 30.35]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1714801796 --> 1714802424
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.14, 0.14, 0.14, 0.14, 0.14, 0.39, 0.39, 0.39, 0.39, 0.39, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.24, 0.24, 0.24, 0.24, 0.24, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.25, 0.25, 0.25, 0.25, 0.25, 0.32, 0.32, 0.32, 0.32, 0.32, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.25, 0.25, 0.25, 0.25, 0.25, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.22, 0.22, 0.22, 0.22, 0.22, 0.32, 0.32, 0.32, 0.32, 0.32, 0.22, 0.22, 0.22, 0.22, 0.22, 0.15, 0.15, 0.15, 0.15, 0.15, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.32, 0.32, 0.32, 0.32, 0.32, 0.46, 0.46, 0.46, 0.46, 0.46, 0.51, 0.51, 0.51, 0.51, 0.51, 0.49, 0.49, 0.49, 0.49, 0.49, 0.29, 0.29, 0.29, 0.29, 0.29, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.17, 0.17, 0.17, 0.17, 0.17, 0.23, 0.23, 0.23, 0.23, 0.23, 0.11, 0.11, 0.11, 0.11, 0.11, 0.09, 0.09, 0.09, 0.09, 0.09, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1714801796 --> 1714802424
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0]
                    
Loading

@ggerganov
Copy link
Owner Author

I think there is a bug in the way we handle added tokens. I'm experimenting with DeepSeek-Coder:

https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base

llama.cpp tokenizes ü to 2864 which is OK, but there is also the added token 32009 which transformers tokenizer selects instead:

    {
      "id": 32009,
      "content": "ü",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    },

If I remove this added token from the tokenizer.config then the transformers tokenization also outputs 2864. So this means we are not handling the added tokens in the same way.

Any ideas how to fix this?

@CISC
Copy link
Contributor

CISC commented May 3, 2024

If I remove this added token from the tokenizer.config then the transformers tokenization also outputs 2864. So this means we are not handling the added tokens in the same way.

I believe the issue is this, added tokens are always looked up first.

Any ideas how to fix this?

AFAICT the only way to fix this is to add added tokens to the GGUF separately, which will be esp. complicated if the added tokens are merged to the middle of the existing vocab (otherwise just adding an index to the beginning of added tokens would be enough).

@ggerganov
Copy link
Owner Author

From what I've found, the problem seem to be that as part of the pre-tokenization, we perform some byte-to-unicode mapping here:

llama.cpp/unicode.cpp

Lines 213 to 217 in 3275e60

std::string encoded_token;
for (char & c : text_utf) {
encoded_token += unicode_byte_to_utf8(c);
}
bpe_encoded_words.emplace_back(encoded_token);

llama.cpp/unicode.cpp

Lines 151 to 173 in 3275e60

static std::unordered_map<uint8_t, std::string> unicode_byte_to_utf8_map() {
std::unordered_map<uint8_t, std::string> map;
for (int ch = u'!'; ch <= u'~'; ++ch) {
assert(0 <= ch && ch < 256);
map[ch] = unicode_cpt_to_utf8(ch);
}
for (int ch = u'¡'; ch <= u'¬'; ++ch) {
assert(0 <= ch && ch < 256);
map[ch] = unicode_cpt_to_utf8(ch);
}
for (int ch = u'®'; ch <= u'ÿ'; ++ch) {
assert(0 <= ch && ch < 256);
map[ch] = unicode_cpt_to_utf8(ch);
}
auto n = 0;
for (int ch = 0; ch < 256; ++ch) {
if (map.find(ch) == map.end()) {
map[ch] = unicode_cpt_to_utf8(256 + n);
++n;
}
}
return map;
}

This converts the string ü to the string ü. This new string is exactly the token 2864, which detokenizes to ü via the llama_decode_text() function. The problem is that we don't even consider the token 32009, because ü is not present in the pre-tokenized string.

@teleprint-me
Copy link
Contributor

Why is the upper limit set to 256? Isn't that the ASCII range?

The range of valid Unicode code points is from U+0000 (hexadecimal 0) to U+1FFFFF (hexadecimal FFFF), which covers more than 1 million unique characters.

@ggerganov ggerganov force-pushed the gg/add-tokenizer-test-script branch from 74fa6cd to 5f30e30 Compare May 4, 2024 05:12
@ggerganov
Copy link
Owner Author

Why is the upper limit set to 256? Isn't that the ASCII range?

This seems to be some strategy to reduce the vocab size:

https://github.com/openai/gpt-2/blob/master/src/encoder.py#L8-L28

@ggerganov ggerganov merged commit 92139b9 into master May 4, 2024
58 of 63 checks passed
@ggerganov ggerganov deleted the gg/add-tokenizer-test-script branch May 4, 2024 05:32
@teleprint-me
Copy link
Contributor

teleprint-me commented May 4, 2024

I find it fascinating how we have a tendency to over-complicate simple ideas. I'm all too guilty of this myself.

# simplified function definition
@lru_cache()
def bytes_to_unicode(size: int = 256) -> dict[int, str]:
    """
    This function generates a dictionary mapping each byte to its corresponding Unicode character.

    :param size: The total number of bytes in the encoding space (default is 256 for ASCII).

    :return: A dictionary containing mappings between bytes and their respective Unicode characters.
    """

    # list of visible characters:
    # (ord("!"), ord("~") + 1); (ord("¡"), ord("¬") + 1); (ord("®"), ord("ÿ") + 1);
    visible = list(range(33, 127)) + list(range(161, 173)) + list(range(174, 256))

    mapping: dict = {}
    for byte in list(range(size)):
        # convert "visible" characters
        if byte in visible:
            mapping[byte] = chr(byte)
        else:  # translate and convert non-printable characters
            mapping[byte] = chr(byte + size)
    return mapping

where the upper limit can be defined as upper_limit = 2**8 = 256. This should be extendable by choice. So probably allow a variable upper limit depending on the size of the input to reduce time complexity.

output

Get the mapping:

mapping = bytes_to_unicode()
gpt_mapping = gpt_bytes_to_unicode()
for key in mapping.keys():
    assert mapping[key] == gpt_mapping[key]

from pprint import pprint  # pretty print output
pprint(mapping)

Mapping output:

18:30:48 | ~
  λ python -i /tmp/bytes_to_unicode.py
{0: 'Ā',
 1: 'ā',
 2: 'Ă',
 3: 'ă',
 4: 'Ą',
 5: 'ą',
 6: 'Ć',
 7: 'ć',
 8: 'Ĉ',
 9: 'ĉ',
 10: 'Ċ',
 11: 'ċ',
 12: 'Č',
 13: 'č',
 14: 'Ď',
 15: 'ď',
 16: 'Đ',
 17: 'đ',
 18: 'Ē',
 19: 'ē',
 20: 'Ĕ',
 21: 'ĕ',
 22: 'Ė',
 23: 'ė',
 24: 'Ę',
 25: 'ę',
 26: 'Ě',
 27: 'ě',
 28: 'Ĝ',
 29: 'ĝ',
 30: 'Ğ',
 31: 'ğ',
 32: 'Ġ',
 33: '!',
 34: '"',
 35: '#',
 36: '$',
 37: '%',
 38: '&',
 39: "'",
 40: '(',
 41: ')',
 42: '*',
 43: '+',
 44: ',',
 45: '-',
 46: '.',
 47: '/',
 48: '0',
 49: '1',
 50: '2',
 51: '3',
 52: '4',
 53: '5',
 54: '6',
 55: '7',
 56: '8',
 57: '9',
 58: ':',
 59: ';',
 60: '<',
 61: '=',
 62: '>',
 63: '?',
 64: '@',
 65: 'A',
 66: 'B',
 67: 'C',
 68: 'D',
 69: 'E',
 70: 'F',
 71: 'G',
 72: 'H',
 73: 'I',
 74: 'J',
 75: 'K',
 76: 'L',
 77: 'M',
 78: 'N',
 79: 'O',
 80: 'P',
 81: 'Q',
 82: 'R',
 83: 'S',
 84: 'T',
 85: 'U',
 86: 'V',
 87: 'W',
 88: 'X',
 89: 'Y',
 90: 'Z',
 91: '[',
 92: '\\',
 93: ']',
 94: '^',
 95: '_',
 96: '`',
 97: 'a',
 98: 'b',
 99: 'c',
 100: 'd',
 101: 'e',
 102: 'f',
 103: 'g',
 104: 'h',
 105: 'i',
 106: 'j',
 107: 'k',
 108: 'l',
 109: 'm',
 110: 'n',
 111: 'o',
 112: 'p',
 113: 'q',
 114: 'r',
 115: 's',
 116: 't',
 117: 'u',
 118: 'v',
 119: 'w',
 120: 'x',
 121: 'y',
 122: 'z',
 123: '{',
 124: '|',
 125: '}',
 126: '~',
 127: 'ſ',
 128: 'ƀ',
 129: 'Ɓ',
 130: 'Ƃ',
 131: 'ƃ',
 132: 'Ƅ',
 133: 'ƅ',
 134: 'Ɔ',
 135: 'Ƈ',
 136: 'ƈ',
 137: 'Ɖ',
 138: 'Ɗ',
 139: 'Ƌ',
 140: 'ƌ',
 141: 'ƍ',
 142: 'Ǝ',
 143: 'Ə',
 144: 'Ɛ',
 145: 'Ƒ',
 146: 'ƒ',
 147: 'Ɠ',
 148: 'Ɣ',
 149: 'ƕ',
 150: 'Ɩ',
 151: 'Ɨ',
 152: 'Ƙ',
 153: 'ƙ',
 154: 'ƚ',
 155: 'ƛ',
 156: 'Ɯ',
 157: 'Ɲ',
 158: 'ƞ',
 159: 'Ɵ',
 160: 'Ơ',
 161: '¡',
 162: '¢',
 163: '£',
 164: '¤',
 165: '¥',
 166: '¦',
 167: '§',
 168: '¨',
 169: '©',
 170: 'ª',
 171: '«',
 172: '¬',
 173: 'ƭ',
 174: '®',
 175: '¯',
 176: '°',
 177: '±',
 178: '²',
 179: '³',
 180: '´',
 181: 'µ',
 182: '',
 183: '·',
 184: '¸',
 185: '¹',
 186: 'º',
 187: '»',
 188: '¼',
 189: '½',
 190: '¾',
 191: '¿',
 192: 'À',
 193: 'Á',
 194: 'Â',
 195: 'Ã',
 196: 'Ä',
 197: 'Å',
 198: 'Æ',
 199: 'Ç',
 200: 'È',
 201: 'É',
 202: 'Ê',
 203: 'Ë',
 204: 'Ì',
 205: 'Í',
 206: 'Î',
 207: 'Ï',
 208: 'Ð',
 209: 'Ñ',
 210: 'Ò',
 211: 'Ó',
 212: 'Ô',
 213: 'Õ',
 214: 'Ö',
 215: '×',
 216: 'Ø',
 217: 'Ù',
 218: 'Ú',
 219: 'Û',
 220: 'Ü',
 221: 'Ý',
 222: 'Þ',
 223: 'ß',
 224: 'à',
 225: 'á',
 226: 'â',
 227: 'ã',
 228: 'ä',
 229: 'å',
 230: 'æ',
 231: 'ç',
 232: 'è',
 233: 'é',
 234: 'ê',
 235: 'ë',
 236: 'ì',
 237: 'í',
 238: 'î',
 239: 'ï',
 240: 'ð',
 241: 'ñ',
 242: 'ò',
 243: 'ó',
 244: 'ô',
 245: 'õ',
 246: 'ö',
 247: '÷',
 248: 'ø',
 249: 'ù',
 250: 'ú',
 251: 'û',
 252: 'ü',
 253: 'ý',
 254: 'þ',
 255: 'ÿ'}
>>> 

I'm looking into it though.

nopperl pushed a commit to nopperl/llama.cpp that referenced this pull request May 5, 2024
* tests : add test-tokenizer-0.sh

* unicode : add all unicode number ranges

* starcoder : fix pre-tokenizer

* tests : add test that fails with DeepSeek tokenizers

* falcon : fix regex

* unicode : regenerate unicode tables

* refact : add tokenizer model

* lint : fix

* tests : disable failing tests

ggml-ci

* refact : add tests files

ggml-ci

* convert : print -> logging

ggml-ci

* lint : fix

* unicode : digit -> number

* phi-3 : update
teleprint-me pushed a commit to teleprint-me/llama.cpp that referenced this pull request May 7, 2024
* tests : add test-tokenizer-0.sh

* unicode : add all unicode number ranges

* starcoder : fix pre-tokenizer

* tests : add test that fails with DeepSeek tokenizers

* falcon : fix regex

* unicode : regenerate unicode tables

* refact : add tokenizer model

* lint : fix

* tests : disable failing tests

ggml-ci

* refact : add tests files

ggml-ci

* convert : print -> logging

ggml-ci

* lint : fix

* unicode : digit -> number

* phi-3 : update
teleprint-me pushed a commit to teleprint-me/llama.cpp that referenced this pull request May 7, 2024
* tests : add test-tokenizer-0.sh

* unicode : add all unicode number ranges

* starcoder : fix pre-tokenizer

* tests : add test that fails with DeepSeek tokenizers

* falcon : fix regex

* unicode : regenerate unicode tables

* refact : add tokenizer model

* lint : fix

* tests : disable failing tests

ggml-ci

* refact : add tests files

ggml-ci

* convert : print -> logging

ggml-ci

* lint : fix

* unicode : digit -> number

* phi-3 : update
@DOGEwbx
Copy link

DOGEwbx commented May 8, 2024

I think there is a bug in the way we handle added tokens. I'm experimenting with DeepSeek-Coder:

https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base

llama.cpp tokenizes ü to 2864 which is OK, but there is also the added token 32009 which transformers tokenizer selects instead:

    {
      "id": 32009,
      "content": "ü",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    },

If I remove this added token from the tokenizer.config then the transformers tokenization also outputs 2864. So this means we are not handling the added tokens in the same way.

Any ideas how to fix this?
@ggerganov

Hi, I found that current llama.cpp can not pass the unit tests for deepseek models. The problem you mentioned looks like the issue huggingface tokenizers have solved huggingface/tokenizers#1392

for the newly published deepseek v2 and deepseekcoder v1.5, these added tokens are removed.

@ggerganov
Copy link
Owner Author

@DOGEwbx Thanks - will try deepseek-coder v1.5 then. DS v2 will probably take some time to support #7118

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Very important issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants