Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenization: no double BOS tokens #7107

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

JohannesGaessler
Copy link
Collaborator

Relevant discussion: #7062

The llama.cpp tokenizer currently adds a BOS token unconditionally. However, I think it would make more sense not to do this if the prompt already starts with a BOS token since this can presumably lead to bad results. This PR makes it so that a BOS token is only added if the prompt does not already start with one.

@slaren
Copy link
Member

slaren commented May 6, 2024

This is not so clear to me. There is already a parameter add_special to configure this behavior, is the user asks for bos to be added, is it really a good idea to ignore that in some cases?

Copy link
Contributor

github-actions bot commented May 6, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 532 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8790.23ms p(95)=21278.59ms fails=, finish reason: stop=469 truncated=63
  • Prompt processing (pp): avg=117.36tk/s p(95)=537.35tk/s
  • Token generation (tg): avg=35.66tk/s p(95)=46.86tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=tokenize-fix-bos commit=d3286d6eca42c0c66410f90eda1bf6beafedead2

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 532 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715078721 --> 1715079345
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 658.37, 658.37, 658.37, 658.37, 658.37, 940.59, 940.59, 940.59, 940.59, 940.59, 879.49, 879.49, 879.49, 879.49, 879.49, 905.83, 905.83, 905.83, 905.83, 905.83, 941.53, 941.53, 941.53, 941.53, 941.53, 912.08, 912.08, 912.08, 912.08, 912.08, 906.5, 906.5, 906.5, 906.5, 906.5, 913.37, 913.37, 913.37, 913.37, 913.37, 920.2, 920.2, 920.2, 920.2, 920.2, 913.43, 913.43, 913.43, 913.43, 913.43, 921.03, 921.03, 921.03, 921.03, 921.03, 940.8, 940.8, 940.8, 940.8, 940.8, 955.12, 955.12, 955.12, 955.12, 955.12, 948.17, 948.17, 948.17, 948.17, 948.17, 905.16, 905.16, 905.16, 905.16, 905.16, 804.09, 804.09, 804.09, 804.09, 804.09, 809.06, 809.06, 809.06, 809.06, 809.06, 806.65, 806.65, 806.65, 806.65, 806.65, 822.15, 822.15, 822.15, 822.15, 822.15, 827.53, 827.53, 827.53, 827.53, 827.53, 826.3, 826.3, 826.3, 826.3, 826.3, 832.56, 832.56, 832.56, 832.56, 832.56, 832.8, 832.8, 832.8, 832.8, 832.8, 833.93, 833.93, 833.93, 833.93, 833.93, 852.73, 852.73, 852.73, 852.73, 852.73, 847.45, 847.45, 847.45, 847.45, 847.45, 850.0, 850.0, 850.0, 850.0, 850.0, 864.16, 864.16, 864.16, 864.16, 864.16, 860.37, 860.37, 860.37, 860.37, 860.37, 859.63, 859.63, 859.63, 859.63, 859.63, 859.55, 859.55, 859.55, 859.55, 859.55, 862.37, 862.37, 862.37, 862.37, 862.37, 863.04, 863.04, 863.04, 863.04, 863.04, 861.95, 861.95, 861.95, 861.95, 861.95, 864.33, 864.33, 864.33, 864.33, 864.33, 861.71, 861.71, 861.71, 861.71, 861.71, 855.37, 855.37, 855.37, 855.37, 855.37, 831.84, 831.84, 831.84, 831.84, 831.84, 831.27, 831.27, 831.27, 831.27, 831.27, 831.12, 831.12, 831.12, 831.12, 831.12, 833.54, 833.54, 833.54, 833.54, 833.54, 830.7, 830.7, 830.7, 830.7, 830.7, 832.55, 832.55, 832.55, 832.55, 832.55, 802.41, 802.41, 802.41, 802.41, 802.41, 787.86, 787.86, 787.86, 787.86, 787.86, 786.46, 786.46, 786.46, 786.46, 786.46, 785.96, 785.96, 785.96, 785.96, 785.96, 780.77, 780.77, 780.77, 780.77, 780.77, 779.81, 779.81, 779.81, 779.81, 779.81, 779.63, 779.63, 779.63, 779.63, 779.63, 773.47, 773.47, 773.47, 773.47, 773.47, 772.37, 772.37, 772.37, 772.37, 772.37, 776.12, 776.12, 776.12, 776.12, 776.12, 781.44, 781.44, 781.44, 781.44, 781.44, 780.71, 780.71, 780.71, 780.71, 780.71, 789.03, 789.03, 789.03, 789.03, 789.03, 789.13, 789.13, 789.13, 789.13, 789.13, 791.19, 791.19, 791.19, 791.19, 791.19, 791.77, 791.77, 791.77, 791.77, 791.77, 792.0, 792.0, 792.0, 792.0, 792.0, 794.01, 794.01, 794.01]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 532 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715078721 --> 1715079345
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 44.49, 44.49, 44.49, 44.49, 44.49, 34.22, 34.22, 34.22, 34.22, 34.22, 28.22, 28.22, 28.22, 28.22, 28.22, 27.74, 27.74, 27.74, 27.74, 27.74, 28.83, 28.83, 28.83, 28.83, 28.83, 29.72, 29.72, 29.72, 29.72, 29.72, 30.96, 30.96, 30.96, 30.96, 30.96, 31.56, 31.56, 31.56, 31.56, 31.56, 32.14, 32.14, 32.14, 32.14, 32.14, 32.32, 32.32, 32.32, 32.32, 32.32, 32.49, 32.49, 32.49, 32.49, 32.49, 32.72, 32.72, 32.72, 32.72, 32.72, 32.7, 32.7, 32.7, 32.7, 32.7, 31.93, 31.93, 31.93, 31.93, 31.93, 31.94, 31.94, 31.94, 31.94, 31.94, 31.01, 31.01, 31.01, 31.01, 31.01, 30.99, 30.99, 30.99, 30.99, 30.99, 31.05, 31.05, 31.05, 31.05, 31.05, 31.22, 31.22, 31.22, 31.22, 31.22, 30.86, 30.86, 30.86, 30.86, 30.86, 30.52, 30.52, 30.52, 30.52, 30.52, 30.44, 30.44, 30.44, 30.44, 30.44, 30.38, 30.38, 30.38, 30.38, 30.38, 30.47, 30.47, 30.47, 30.47, 30.47, 30.57, 30.57, 30.57, 30.57, 30.57, 30.58, 30.58, 30.58, 30.58, 30.58, 30.65, 30.65, 30.65, 30.65, 30.65, 30.63, 30.63, 30.63, 30.63, 30.63, 30.57, 30.57, 30.57, 30.57, 30.57, 30.38, 30.38, 30.38, 30.38, 30.38, 30.49, 30.49, 30.49, 30.49, 30.49, 30.67, 30.67, 30.67, 30.67, 30.67, 30.75, 30.75, 30.75, 30.75, 30.75, 30.93, 30.93, 30.93, 30.93, 30.93, 31.06, 31.06, 31.06, 31.06, 31.06, 31.11, 31.11, 31.11, 31.11, 31.11, 30.86, 30.86, 30.86, 30.86, 30.86, 30.84, 30.84, 30.84, 30.84, 30.84, 30.75, 30.75, 30.75, 30.75, 30.75, 30.87, 30.87, 30.87, 30.87, 30.87, 31.0, 31.0, 31.0, 31.0, 31.0, 31.22, 31.22, 31.22, 31.22, 31.22, 31.23, 31.23, 31.23, 31.23, 31.23, 31.06, 31.06, 31.06, 31.06, 31.06, 30.96, 30.96, 30.96, 30.96, 30.96, 30.7, 30.7, 30.7, 30.7, 30.7, 30.35, 30.35, 30.35, 30.35, 30.35, 29.45, 29.45, 29.45, 29.45, 29.45, 29.45, 29.45, 29.45, 29.45, 29.45, 29.46, 29.46, 29.46, 29.46, 29.46, 29.54, 29.54, 29.54, 29.54, 29.54, 29.65, 29.65, 29.65, 29.65, 29.65, 29.73, 29.73, 29.73, 29.73, 29.73, 29.75, 29.75, 29.75, 29.75, 29.75, 29.58, 29.58, 29.58, 29.58, 29.58, 29.51, 29.51, 29.51, 29.51, 29.51, 29.43, 29.43, 29.43, 29.43, 29.43, 29.55, 29.55, 29.55, 29.55, 29.55, 29.67, 29.67, 29.67, 29.67, 29.67, 29.79, 29.79, 29.79, 29.79, 29.79, 29.82, 29.82, 29.82]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 532 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715078721 --> 1715079345
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.18, 0.18, 0.18, 0.18, 0.18, 0.42, 0.42, 0.42, 0.42, 0.42, 0.23, 0.23, 0.23, 0.23, 0.23, 0.15, 0.15, 0.15, 0.15, 0.15, 0.24, 0.24, 0.24, 0.24, 0.24, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.25, 0.25, 0.25, 0.25, 0.25, 0.2, 0.2, 0.2, 0.2, 0.2, 0.11, 0.11, 0.11, 0.11, 0.11, 0.29, 0.29, 0.29, 0.29, 0.29, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.26, 0.26, 0.26, 0.26, 0.26, 0.29, 0.29, 0.29, 0.29, 0.29, 0.31, 0.31, 0.31, 0.31, 0.31, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.36, 0.36, 0.36, 0.36, 0.36, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.32, 0.32, 0.32, 0.32, 0.32, 0.31, 0.31, 0.31, 0.31, 0.31, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.27, 0.27, 0.27, 0.27, 0.27, 0.18, 0.18, 0.18, 0.18, 0.18, 0.29, 0.29, 0.29, 0.29, 0.29, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.34, 0.34, 0.34, 0.34, 0.34, 0.52, 0.52, 0.52, 0.52, 0.52, 0.59, 0.59, 0.59, 0.59, 0.59, 0.48, 0.48, 0.48, 0.48, 0.48, 0.48, 0.48, 0.48, 0.48, 0.48, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.3, 0.3, 0.3, 0.3, 0.3, 0.32, 0.32, 0.32, 0.32, 0.32, 0.24, 0.24, 0.24, 0.24, 0.24, 0.13, 0.13, 0.13, 0.13, 0.13, 0.08, 0.08, 0.08, 0.08, 0.08, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 532 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715078721 --> 1715079345
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0]
                    
Loading

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented May 6, 2024

Sorry, I didn't see the argument add_special of llama_tokenize. When we define a user as someone using the API specified by llama.h then I definitely agree that this is the wrong way to fix it. I was only looking at the perspective from using binaries like main, tokenize, or server. If you copy-paste an instruct format with <|begin_of_text|> into those you get two BOS tokens. So I think it would make more sense to fix it in those binaries? Edit: or rather, to fix it via common.h.

@JohannesGaessler
Copy link
Collaborator Author

I added a function llama_fix_double_bos to common.h. The functions llama_tokenize in common.h have gained new arguments fix_double_bos which default to false and apply the double BOS fix if set to true. There is no change to the API in llama.h. I went through the files in which llama_tokenize is used and tried to figure out when to apply the fix. Due to the order of arguments I also had to make explicit choices for parse_special in some cases. Specifically:

  • common/train.cpp, examples/batched.swift/Sources/main.swift: use llama.h API directly -> no change.
  • batched, beam-search, passkey, save-load-state, simple: BOS was being added for the prompt, the prompt was not being parsed for special tokens due to the default being false. Now special tokens in the prompt are also being parsed. Double BOS fix added.
  • embedding: BOS was being added for the prompt, the prompt was not being parsed for special tokens due to explicitly setting this to false. I did not change this but I think this should be set to true in order to be consistent with e.g. main, feedback would be appreciated. Double BOS fix added.
  • eval_callback, gritlm, llama-android, retrieval: I don't feel confident regarding what the code actually does so I made no changes.
  • imatrix: BOS was being added for the prompt, the prompt was not being parsed for special tokens due to the default being false. I made this to now parse special tokens; I think there are cases where either setting could cause problems (special tokens in general text corpus, training an imatrix for an instruct model on e.g. a chat log with special tokens) but I think that the latter case is more likely to happen and cause problems. Double BOS fix applied.
  • infill, parallel: Special tokens were not being parsed due to the default being false. I enabled special token parsing and applied the double BOS fix for those tokenization calls where a BOS token is being added.
  • llava-cli, lookahead, lookup, lookup-create, lookup-stats, main, server, speculative, tokenize: Special tokens were already being parsed, apply double BOS fix to those calls that add a BOS token.
  • perplexity: Special tokens were not being parsed due to the default being false. I enabled special token parsing and applied the double BOS fix for those tokenization calls where a BOS token is being added. I think it's better to enable special token parsing since if I wanted to evaluate the effects of quantization for an instruct model I think the best way to do so would be to evaluate KL divergence over an instruct chat log in which case special tokens definitely should be parsed. I did not find instances of special tokens in Wikitext. Double BOS fix applied.
  • Tests: no changes.

@ggerganov
Copy link
Member

I'm not convinced that the fix_double_bos change is necessary. Typically, in the examples we don't specify the BOS token explicitly in the prompt - I believe this has been the case for a while now. Seems some of the confusion in #7062 came from people explicitly adding the BOS token in the prompt -p. This is more of an incorrect usage of llama.cpp examples and wrong expectations, so I'm not sure we should add this extra handling - the solution is to not add the BOS token explicitly

Regarding parse_special:

The way I think about this option is that it should be false when we tokenize user input (e.g. chat message, code for code completion, code for fim completion, text for perplexity calculation, etc.). It should be true only for the portions that are used to guide the model into some structured format - like a chat template (e.g. <|im_start|>user, <|im_end|>assistant). So technically, special tokens should not come from user inputs and even if they do, we must parse them as text - otherwise the structured format can be altered and break (i.e. prompt injection). This has been logic for setting parse_special the way it is on master

This all make sense when all special tokens such as <|im_start|>, <|end_of_text|>, <|eot|>, etc. are correctly defined as being special in the tokenizer config, but it seems there are already models out there in which the specialness of the tokens is pretty random. In such cases, our logic would break. But I think this is a problem of the model creators and again, I'm not sure we should do anything on llama.cpp side

Regarding perplexity, I don't see why we have to parse special tokens in the input text. We run perplexity only on base models and accidentally encountering a string that looks like a special token (for example </s>) in wikitext should be considered as the plain text </s> and not the EOS token.

What do you think?

@JohannesGaessler
Copy link
Collaborator Author

What do you think?

I fully admit that I am not very knowledgeable about tokenization and the handling of special tokens so I will simply accept your judgement regarding what's best. This PR was fairly low-effort so feel free to close it.

Generally speaking, I would prefer to minimize the potential for user error, if only to reduce the number of bug reports about vague issues with inference results. In this particular case I invested time into an investigation where the conclusion will (unless I'm misunderstanding you) be that there were no bugs all along and that it's an issue with incorrect user assumptions about special tokens. This is time that I could have used more productively; I knew that llama.cpp adds BOS tokens but I still did not spot the double BOS tokens until after the fact. I think the most important factor for the success of an open-source project is the amount and quality of work put in by contributors so I want to minimize unproductive work.

The way I think about this option is that it should be false when we tokenize user input (e.g. chat message, code for code completion, code for fim completion, text for perplexity calculation, etc.).

Thank you for the clarification, I think the way it's used on master does not consistently follow this principle though. I definitely did not use it that way myself when I created the files related to n-gram lookup.

Regarding perplexity, I don't see why we have to parse special tokens in the input text. We run perplexity only on base models and accidentally encountering a string that looks like a special token (for example ) in wikitext should be considered as the plain text and not the EOS token.

Calculating perplexity of an instruct model does not make sense I think. However, I think calculating the KL divergence of a quantized instruct model vs. the FP16 logits on an instruct dataset can potentially yield useful information; I was thinking about it in that aspect. But I forgot about </s>, that particular special token is short and unspecific enough that I would definitely not want to enable special token parsing for perplexity.

@ggerganov
Copy link
Member

This is time that I could have used more productively; I knew that llama.cpp adds BOS tokens but I still did not spot the double BOS tokens until after the fact. I think the most important factor for the success of an open-source project is the amount and quality of work put in by contributors so I want to minimize unproductive work.

Yeah, I see that the discussion was quite long - thanks for taking the time to look into it.

It's probably best to invest time in improving the issue reporting process. We can provide means to generate some detailed report of a model, for example in HTML / Markdown format (see #6715). Such a report can contain sample usage commands with the various llama.cpp tools, tokenization results, statistics, etc. Having such information should help both the users and the contributors a lot

Thank you for the clarification, I think the way it's used on master does not consistently follow this principle though. I definitely did not use it that way myself when I created the files related to n-gram lookup.

server is also not following it exactly, but there are TODO's to fix this eventually:

https://github.com/ggerganov/llama.cpp/blob/947d3ad27d94f1addef76b5d64c314618f063933/examples/server/server.cpp#L762-L767

Maybe for imatrix it always makes sense to enable though because most of the datasets used for instruct models probably follow the instruction format of the models.

@JohannesGaessler
Copy link
Collaborator Author

It's probably best to invest time in improving the issue reporting process.

Speaking of which, one of the things that I find annoying is that issues along the lines of "results wrong/bad/gibberish" frequently get people saying they have the same issue when it's really a different issue. I'm not 100% sure about how to fix this though.

@Jeximo
Copy link
Contributor

Jeximo commented May 7, 2024

people saying they have the same issue when it's really a different issue

llama.cpp sees issues even for downstream projects - it's a lot to sift through.

Probably users don't recognize that the model, or how it's made may be the problem. Graphical GGUF viewer #6715 will help everyone better identify what to focus on.

@ggerganov
Copy link
Member

people saying they have the same issue when it's really a different issue

llama.cpp sees issues even for downstream projects - it's a lot to sift through.

Yeah, it's a lot. And the number of open PRs keeps growing too

@bartowski1182
Copy link
Contributor

on the subject of reporting issues, it would be nice if you could somehow have a JSON structured input that would run a test on llama.cpp directly, as you said a LOT of issues are from downstream work, I've had the odd case where I almost went to report an issue but then built llama.cpp and ran ./main only to not see the issue at all

if there was a way to specify a model from HF, and a prompt, and have that run somewhere and output the results into the issue report, that could speed up a lot of confusion

@mofosyne mofosyne added refactoring Refactoring Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 9, 2024
@andysalerno
Copy link
Contributor

Adding my two cents as a user-

In server, as far as I can tell, there's no option to log the final input text, post-chat-formatting. When using the OpenAI-style API, this means it's hard to be absolutely sure the chat formatting is applied correctly. So it would be great if there was some option to log the post-formatting, but pre-tokenization inputs, so we can see the final text that is being fed to the model and identify any formatting mistakes.

It might also be nice to emit warnings for the double BOS tokens scenario. It's reasonable, I think, to not fix extra BOS tokens, on the principle of garbage-in-garbage-out :D But logging a warning would at least give visibility to the issue. Same for other common formatting mistakes.

Not sure how llama.cpp is handling the Jinja templating, but if the Jinja engine supports throwing errors, it could be nice to encourage model makers to write chat templates that throw errors if validations fail (like how Mistral's template will throw if the input is not perfectly alternating user/assistant/user/assistant). Having it blow up with an obvious error instead of silently continuing would catch mistakes a lot faster.

@reuank
Copy link
Contributor

reuank commented May 16, 2024

So it would be great if there was some option to log the post-formatting, but pre-tokenization inputs, so we can see the final text that is being fed to the model and identify any formatting mistakes.

I 100% agree with @andysalerno on this. In a project, I am applying a chat template myself before giving the prompt to llama.cpp. I have seen a massive drop in output quality (in this case: the performance on a multiple choice benchmark with a chain of thought step in between) when using the template compared to when I don't use it, but it's hard to find out what is going wrong. It would be great to add an option that echoes the exact prompt (and its token ids) that was fed to the model.

I think #7332 is a good first step, but having more debug options would be even more helpful.

@JohannesGaessler
Copy link
Collaborator Author

For binaries like main there already is the --verbose-prompt option but I'm not seeing it in the server documentation (but someone feel free to correct me).

@abc-nix
Copy link

abc-nix commented May 16, 2024

For binaries like main there already is the --verbose-prompt option but I'm not seeing it in the server documentation (but someone feel free to correct me).

With the --verbose option server does log the input.

{"tid":"140469774090240","timestamp":1715890225,"level":"VERB","function":"update_slots","line":1948,"msg":"prompt tokenized","id_slot":0,"id_task":0,"n_ctx":8192,"n_keep":0,"n_prompt_tokens":52,"prompt_tokens":"<|begin_of_text|>This is a conversation between User and Llama, a friendly chatbot. Llama is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.\n\nUser: Write only 3 words.\nLlama:"}

EDIT: Though no token IDs.

@reuank
Copy link
Contributor

reuank commented May 16, 2024

@abc-nix Thanks for the hint! I added the --verbose server option to the README: #7335

However, I would still prefer an option to get the processed and tokenized prompt in the server response JSON.

reuank added a commit to reuank/ThinkBench that referenced this pull request May 16, 2024
…the llama.cpp server in order to avoid double BOS at the beginning of the prompt (see discussions at ggml-org/llama.cpp#7107 (comment) and ggml-org/llama.cpp#7332)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactoring Refactoring Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants