-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokenization: no double BOS tokens #7107
base: master
Are you sure you want to change the base?
tokenization: no double BOS tokens #7107
Conversation
This is not so clear to me. There is already a parameter |
Sorry, I didn't see the argument |
149603e
to
3d1b1ef
Compare
3d1b1ef
to
d3286d6
Compare
I added a function
|
I'm not convinced that the Regarding The way I think about this option is that it should be This all make sense when all special tokens such as Regarding What do you think? |
I fully admit that I am not very knowledgeable about tokenization and the handling of special tokens so I will simply accept your judgement regarding what's best. This PR was fairly low-effort so feel free to close it. Generally speaking, I would prefer to minimize the potential for user error, if only to reduce the number of bug reports about vague issues with inference results. In this particular case I invested time into an investigation where the conclusion will (unless I'm misunderstanding you) be that there were no bugs all along and that it's an issue with incorrect user assumptions about special tokens. This is time that I could have used more productively; I knew that llama.cpp adds BOS tokens but I still did not spot the double BOS tokens until after the fact. I think the most important factor for the success of an open-source project is the amount and quality of work put in by contributors so I want to minimize unproductive work.
Thank you for the clarification, I think the way it's used on master does not consistently follow this principle though. I definitely did not use it that way myself when I created the files related to n-gram lookup.
Calculating perplexity of an instruct model does not make sense I think. However, I think calculating the KL divergence of a quantized instruct model vs. the FP16 logits on an instruct dataset can potentially yield useful information; I was thinking about it in that aspect. But I forgot about |
Yeah, I see that the discussion was quite long - thanks for taking the time to look into it. It's probably best to invest time in improving the issue reporting process. We can provide means to generate some detailed report of a model, for example in HTML / Markdown format (see #6715). Such a report can contain sample usage commands with the various
Maybe for |
Speaking of which, one of the things that I find annoying is that issues along the lines of "results wrong/bad/gibberish" frequently get people saying they have the same issue when it's really a different issue. I'm not 100% sure about how to fix this though. |
Probably users don't recognize that the model, or how it's made may be the problem. Graphical GGUF viewer #6715 will help everyone better identify what to focus on. |
Yeah, it's a lot. And the number of open PRs keeps growing too |
on the subject of reporting issues, it would be nice if you could somehow have a JSON structured input that would run a test on llama.cpp directly, as you said a LOT of issues are from downstream work, I've had the odd case where I almost went to report an issue but then built llama.cpp and ran ./main only to not see the issue at all if there was a way to specify a model from HF, and a prompt, and have that run somewhere and output the results into the issue report, that could speed up a lot of confusion |
Adding my two cents as a user- In It might also be nice to emit warnings for the double BOS tokens scenario. It's reasonable, I think, to not fix extra BOS tokens, on the principle of garbage-in-garbage-out :D But logging a warning would at least give visibility to the issue. Same for other common formatting mistakes. Not sure how llama.cpp is handling the Jinja templating, but if the Jinja engine supports throwing errors, it could be nice to encourage model makers to write chat templates that throw errors if validations fail (like how Mistral's template will throw if the input is not perfectly alternating user/assistant/user/assistant). Having it blow up with an obvious error instead of silently continuing would catch mistakes a lot faster. |
I 100% agree with @andysalerno on this. In a project, I am applying a chat template myself before giving the prompt to llama.cpp. I have seen a massive drop in output quality (in this case: the performance on a multiple choice benchmark with a chain of thought step in between) when using the template compared to when I don't use it, but it's hard to find out what is going wrong. It would be great to add an option that echoes the exact prompt (and its token ids) that was fed to the model. I think #7332 is a good first step, but having more debug options would be even more helpful. |
For binaries like |
With the
EDIT: Though no token IDs. |
…the llama.cpp server in order to avoid double BOS at the beginning of the prompt (see discussions at ggml-org/llama.cpp#7107 (comment) and ggml-org/llama.cpp#7332)
Relevant discussion: #7062
The llama.cpp tokenizer currently adds a BOS token unconditionally. However, I think it would make more sense not to do this if the prompt already starts with a BOS token since this can presumably lead to bad results. This PR makes it so that a BOS token is only added if the prompt does not already start with one.