-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Token based (or sequence of token based) repetition penalty exclusion #26902
Comments
+1 |
an addendum to this for at least transformers, maybe TGI also, is if this system works, perhaps can add it as a tokenizer config setting similar to the chat templates so users dont have to implement this for whatever inference they do |
for reference hf transformers' code for this is even simpler: class RepetitionPenaltyLogitsProcessor(LogitsProcessor):
def __init__(self, penalty: float):
if not isinstance(penalty, float) or not (penalty > 0):
raise ValueError(f"`penalty` has to be a strictly positive float, but is {penalty}")
self.penalty = penalty
@add_start_docstrings(LOGITS_PROCESSOR_INPUTS_DOCSTRING)
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
score = torch.gather(scores, 1, input_ids)
# if score < 0 then repetition penalty has to be multiplied to reduce the token probabilities
score = torch.where(score < 0, score * self.penalty, score / self.penalty)
scores.scatter_(1, input_ids, score)
return scores |
This is a really good point that I hadn't thought of! This code is mostly @gante's area but I think I can see how to implement this, so if you're busy let me know and I can take it! |
Hey also I should add, it would be really beneficial if we can select whole sections of a prompt format to omit as well. Thinking more on this - for lets say, RAG or Memories, other things where you have context that the model should be drawing on as closely as possible, it seems rep penalty would interfere with its ability to properly recite from source material in context. This seems doable but probably harder to implement. There are also much harder to deal with problems like the following:
Where this may be the model being dumb, or may be because of rep penalty. For solving this one, I really have no idea, since its so dynamic and situational.. |
Yes, we generally need a better solution than simple (stupid) repetition penalty. There are many use cases where the LLM needs to repeat information verbatim, especially when generating code. ChatGPT/GPT-4 does that extremely well, where you talk back and forth to iterate over the code. Local LLMs wouldn't be able to handle that with regular repetition penalty settings. Or imagine a situation like this (disregarding that LLMs aren't calculators ;)): 0+2=2, 2+0=2, 3-1=2, 4-2=2, 1+1= With all those 2's already in the prompt, the answer is likely not what we'd expect. |
Hey folks! 👋 Before further discussion, a recap of the repetition penalty:
Based on what I know, I would need stronger evidence, well beyond a few examples (that might be false positives since LLMs typically sample), to support your claims. I also agree with these two comments that are skeptical about this being a problem that the "improved" repetition penalty would solve (1 2). However, this does not prevent you from using it with In conclusion, I'm skeptical but open to being convinced otherwise with evidence :) |
Thanks for the recap! Since I'm doing a lot of model comparisons and tests with multi-turn chats, I use deterministic settings (do_sample=false with oobabooga's text-generation-webui or with llama.cpp/koboldcpp temperature=0, top_k=1, top_p=0, top_a=0) to eliminate as many random factors as possible. I'm using repetition penalty 1.18, range 0, no slope. My intention is to test what the model considers the most likely generation, which isn't perfect of course, but outside of running an infinite number of gens and picking the average, it's the best I could come up with. Always open for better ideas, though. Just so you know my setup and with which settings I observed the issues I consider as problems caused by repetition penalty for many months. If you think no repetition penalty would be better (now that llama.cpp's tokenizer bug that messes up EOS and other special tokens is fixed - ggml-org/llama.cpp#3538 - which could have contributed to the excessive repetition issues so many Llama 2 models exhibited), I'd happily test going without repetition penalty. |
@StefanDanielSchwarz thank you for your reply :) In a deterministic setup you should see improvements with a moderate repetition penalty like yours, as it is common for the model to repeat concepts (or even get into never-ending loops). The best would be a blind test with sampling, like it is done in lmsys' chatbot arena. After a few hundred evals, it should be clear whether it makes a difference or not to exclude special tokens from the repetition penalty or not. Keep in mind that, if the difference is minimal, less code is better! |
@gante But how would you handle the use case of e. g. code generation? Imagine a pair programmer/co-pilot scenario which I use a lot with ChatGPT/GPT-4: Describe what program you want, LLM gives you the code, you tell it what to change, and after a lot of back-and-forth, it's usable. The slightest repetition penalty could ruin that, so we'd probably need a way to exempt code blocks from repetition penalty. Same for RAG/retrieved memories as @teknium1 mentioned. |
@StefanDanielSchwarz It may indeed, but this is the first time I'm reading this issue (keeping in mind that I'm tagged in everything Regardless of my opinion, we need to see clear results before we add new code to |
@gante Thanks for your feedback again. I agree with you about clear results being required. Hopefully this discussion has raised awareness of this (potential) issue. So far I've been quite happy with repetition penalty 1.18 and my deterministic settings, and the problems I noticed might be attributed to other factors like improper tokenization, quantization, or model-specific quirks. So I'll keep my eyes open and hope others do the same, so that if there is an actual issue, it will eventually be proven and fixed. Thank you all for your attention and please do keep up the great work! 😎👍 |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Feature request
I started this issue in TGI but it applies to all inference code that has a form of rep penalty, will paste my feature request notes from there here as well, find original here: huggingface/text-generation-inference#1170
Hello, I would like to propose a feature that allows you to set a list of tokens, or even token sequences, that can be excluded from repetition penalty calculations.
The reasoning for this being that, given a prompt format for multiturn, such as:
Or even worse, a format like ChatML, where it is in now standard case using <|im_end|> as a stopping token and included in every turn, it seems only logical that given these tokens all appear in every turn, that, especially in short token turn sequences, repetition penalty will destroy the validity of these prompt formats.
While I havent noticed this using Hermes 2, it may be solely because it has long responses, where, if avg turn length is very few tokens, the problem may become more prominent.
Motivation
Your contribution
The following is TGI's code, I haven't looked at transformer's code, but I assume the principles are the same:
I'm thinking we take the input_id's before getting scored and simply replacing it with input_id's that remove any from some variable setting a list of token ids or token strings->id's.
The text was updated successfully, but these errors were encountered: