-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sampling : add XTC sampler #9742
Conversation
Adds XTC sampler, not activated by default, but recommended settings by default.
To be more inline with the original implementation, chance is calculated once at the beginning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very familiar with the code base, but XTC is a very interesting idea to me so I took a read through to see how you implemented it, and left a few comments in case they are helpful. In case not, feel free to ignore. :)
I have tested this with a number of models of different sizes / types via Text Generation / oobabooga on day of release. |
What is the theory behind the Here's an example: Imagine the input is
With the original XTC algorithm and Now let's say you set That doesn't seem right. |
These example and explanations are true for baseline models, but may not work for finetuned models that were trained against clichéd phrases. Consider a model which, as a result of finetuning, is now less likely to output the phrase in your example: less likely doesn't mean "erased entirely", and such models may still be able to generate it, but with lower probability. In such case XTC will remove a more probable (supposedly less clichéd) variant and increase the chance of choosing a previously less probable cliché. In the long run it can steer such finetuned model back to its baseline, "undoing" the effect of finetuning to some extent. At the same time XTC may still be useful for such a model if a middle range is carefully set. I would assume that it is possible to cut out a range where penalized clichés may still appear, further improving model's capabilities. In practice, I have tested only a few such models, but I was getting better (as in, less typical) responses with lowered upper threshold. You can look at various benchmarks that estimate creativity and, more relevant in our case, so called "slop", which is exactly the type of phrases we want to avoid using XTC. It's worth testing XTC with such models too. As a note, while testing new fixed random, I've encountered refusals on |
Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable.
So the purpose of I suspect that this parameter is useless in most cases, actively harmful unless extremely carefully calibrated for each model and input, and above all else, confusing and very difficult to interpret for most users. The standard XTC threshold gives you a straightforward knob to tweak, with a clearly defined meaning: The probability that you consider "viable". By contrast, with IMO, |
No, it a measurement of a model's perceived creativity. If a model is already creative enough, more clichés will be less probable - and the range of tokens we perceive as undesirable would have lower ceiling. Look at it from a broader perspective: every model needs a direction to create an output a user might need. To achieve that, we can perform an internal work and an external one on that model. Internal work is training/finetuning - if affects the internal data of the model, and the results of such work are "baked" into the model itself. While not usually used, it is possible for any well-made model to output coherently (almost) without samplers. External work are sampling algorithms, RAG, etc. They are used to redirect the model towards a more desirable output for every input a user might have. We can also say that prompt engineering falls into this category (as opposed to creating training data pairs, for example, which would be internal work). Now, let's look at XTC again: the initial idea was to remove the most "boring", repetitive tokens from consideration. This means that we see the results of the internal work done on the model as "incomplete" from what we see as "creativity" of a model. To make it "complete", we use XTC and get exactly (or almost exactly) what we need. However, if the internal work done one a model has already considered that creativity might be "incomplete", and further training was performed, that model by itself will be closer to what we want to achieve. We see the base output of such model as "more creative" in comparison to another (usually, the base/instruct versions) model. This also means that undesirable tokens will have lower probabilities during sampling. With the original XTC, however, we will still penalize top tokens only. There are no mechanisms to control it's behavior towards upper top tokens, because raising the bar affects lower top tokens, and changing the probabilities affects all top tokens. This means that we don't differentiate between models that are more creative and models that are less creative - lower threshold cannot reflect that as it controls lower top tokens, not the upper ones. By introducing
I would argue that functionality is the king: giving users more options and more control over the tools they use is more important than keeping algorithms simple. KISS is important, but it should not hurt usability. In this case: original XTC is a sampler that works with ranges of tokens from top only, with |
1. Scan token from top till the first non-penalizable 2. Remove the last captured token (the least probable above threshold) 3. Shift all tokens to override the remaining penalizable 4. Penalize and put them at the the bottom.
I have reworked the algorithm, given that std::sort is no longer used. It should be ready now. |
After tinkering with the algorithm more I've tested two options: The more elegant one is to iterate through all tokens from back to front, which makes capturing the last op token easier: std::vector<llama_token_data> tokens_left;
bool last_captured = false;
for (int i = (cur_p->size - 1); i >= 0; --i) {
if (cur_p->data[i].p < ctx->threshold || cur_p->data[i].p > ctx->threshold_max) {
tokens_left.emplace_back(cur_p->data[i]);
} else if (!last_captured) {
tokens_left.emplace_back(cur_p->data[i]);
last_captured = true;
}
}
if (last_captured) {
int to_remove = cur_p->size - tokens_left.size();
if (to_remove >= 1 && tokens_left.size() >= ctx->min_keep) {
memcpy(cur_p->data, tokens_left.data(), tokens_left.size()*sizeof(llama_token_data));
cur_p->size = tokens_left.size();
// since tokens are in reverse order, we'll need to sort them later
cur_p->sorted = false;
}
} The problem is, it requires sorting later, which will be performed by std::vector<llama_token_data> tokens_left;
bool last_captured = true;
for (size_t i = 0; i < cur_p->size; ++i) {
if (cur_p->data[i].p < ctx->threshold || cur_p->data[i].p > ctx->threshold_max) {
// capturing the last top token
if (!last_captured && i > 0 && cur_p->data[i-1].p >= ctx->threshold) {
tokens_left.emplace_back(cur_p->data[i-1]);
last_captured = true;
}
tokens_left.emplace_back(cur_p->data[i]);
}
}
// in case all candidates are penalizable
if (cur_p->data[cur_p->size - 1].p >= ctx->threshold) tokens_left.emplace_back(cur_p->data[cur_p->size - 1]);
if (tokens_left.size() < ctx->min_keep) return;
size_t to_remove = cur_p->size - tokens_left.size();
if (to_remove >= 1) {
memcpy(cur_p->data, tokens_left.data(), tokens_left.size()*sizeof(llama_token_data));
cur_p->size = tokens_left.size();
} However, the second one feels bulky. Which one is more acceptable? @slaren can I request your help here? |
If I understand the algorithm correctly, this implementation seems too complicated. Since the tokens are sorted by probability, you should only need to first find the index of the last token to remove, and then move the remaining tokens to the top of the list. The implementation in textgen-webui also seems to completely disable itself if any special tokens would be removed, which I imagine is important. |
It will be slightly more complex due to addition of max threshold, but yes, I understand. I have also added XTC into
It's not a part of the original idea, and I haven't encountered any problems with special tokens in llama.cpp specifically - that's why I left it out for now. If any problems are found later, I'll definitely add it. Thank you! |
Wouldn't doing so bias the responses to be shorter, on average? If EOT is a special token and can be elevated in probability by XTC when other (more likely) tokens are removed, but never decreased in probability by itself being removed, it seems like responses will be shorter. I've tried this implementation a bit now and it does occasionally ramble a bit at the end of a response, especially when generating code (not a great use for XTC but entertaining at least). That might be caused by removing EOT, so perhaps there's good reason to avoid it. But I would definitely prefer an occasional ramble over shortened responses. |
That's an interesting proposal, but it sounds like a completely new sampler, rather than an addition to XTC. I recommend submitting that mechanism as a separate sampler under a new name. It's pretty much the inverse of XTC in the way it removes tokens. |
Yes, I tested more in the meantime and see potential in two ideas:
|
examples/main/README.md
Outdated
|
||
By removing top tokens XTC can improve the variety of answers, break writing clichés and inhibit repition, since clichés and repeated phrases are usually more likely to appear. By keeping the last token above the threshold, XTC ensures that the answer is still coherent. XTC is meant to be used for creative tasks, but feel free to experiment with different settings for different models. | ||
|
||
Being experimental and unique, XTC is disabled by default. The recommended combination of samplers is Min-P followed by XTC on its default settings: `--sampling-seq mx --min-p 0.02 -xtc-p 0.5`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it still necessary to pass an explicit sampler chain via --sampling-seq
in order to activate XTC? I thought that it is now in the sampler chain by default, and disabled by having xtc_probability
set to 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@p-e-w While XTC is included into the sampler queue by default, it is put after all other truncating samplers. As such, the recommended combinations of samplers, as per your words in oobabooga/text-generation-webui#6335 , requires passing samplers chain explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it. Seems like the order is correct by default (XTC after truncation, which Min-P is). And all samplers are set to neutral (off) by default, right? So what does --sampling-seq mx
do that wouldn't happen otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@p-e-w Not all samplers - Top K is 40 by default and Top P is 0.95. So, either they need to be set to 0 and 1.0 respectively, or (which is easier and more logical) sampling queue should be limited to only the samplers we need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😮 I had no idea! So llama.cpp users are getting crap samplers from the Stone Age without even realizing it. That's terrible. I would have expected a clean slate that samples from the raw model distribution unless parameters are set explicitly.
Anyway, this means your example command is correct, of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly, this topic has already been brought up some time ago in other PRs. However, since in most cases llama.cpp is used as a library through another app or as a server, this issue is mostly related to llama-cli users. You can look at index-new.html
(new server UI): it has different "default" values with top_k
and top_p
turned off, and I assume any other frontend will have a payload with all parameters set as needed.
But yes, this is an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Llama-cli user here via scripts... thanks for bringing this up at any rate, Improved my (custom) benchmark scores just by adjusting settings to disable Top K and Top P.
I've seen the default params adjusted a few times for llama-cli; feels like this would be a good change.
delete (llama_sampler_xtc *) smpl->ctx; | ||
} | ||
|
||
static void llama_sampler_xtc_reset(struct llama_sampler * smpl) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK this is necessary to properly reset seed and maintain repeatability, as recommended by @slaren earlier.
src/llama-sampling.cpp
Outdated
int pos_last = 0; | ||
|
||
for (size_t i = 0; i < cur_p->size; ++i) { | ||
if (cur_p->data[i].p - ctx->threshold >= -1e-5) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this epsilon instead of a regular comparison?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it after running tests - they were failing due to precision problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, no. Never change correct code to satisfy tests. If the code is semantically correct, and the tests don't pass, the tests need to be adapted to account for things such as floating point shenanigans. But changing the code itself is always wrong, unless of course the code has a bug.
The correct way to express this condition is
if (cur_p->data[i].p >= ctx->threshold) {
Nothing else will do. And the tests need to work with that, or the tests are wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it should be considered a problem with tests (precision problem is a wider topic), but alright.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My basic point is that the tests serve the code, not the other way round. Code is only ever changed in response to failing tests if the code is found to have a bug.
I had a similar problem with tests for an experimental sampler I wrote for llama.cpp a while ago, and I was able to work around it by using a different set of token probabilities in the tests that I constructed specifically so that after probability renormalization, the resulting values were exactly representable in floating point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did another full review of the code and while some parts are quite subtle, I'm now pretty confident that the implementation is correct (though I did not run it myself yet).
@slaren: This is ready to go as far as I'm concerned.
} | ||
|
||
if (cur_p->size - pos_last >= ctx->min_keep && pos_last > 0) { | ||
cur_p->data += pos_last; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may potentially break 3rd party code that expects this pointer to be unchanged (eg. to free it after sampling). I don't think this is necessarily a problem, but we should make it clear that this pointer may be changed by the samplers, and applications should not rely on it being unchanged.
* Initial XTC commit Adds XTC sampler, not activated by default, but recommended settings by default. * Cleanup * Simplified chances calculation To be more inline with the original implementation, chance is calculated once at the beginning. * First fixes by comments Still need to look into sorting * Fixed trailing backspaces * Fixed RNG to be reproduceable Thanks to @slaren for directions * Fixed forgotten header * Moved `min_keep` Moved from conditions to a simple check at the end. * Fixed broken randomization Thanks to @slaren for explanation * Swapped sorting for a custom algorithm Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable. * Algorithm rework 1. Scan token from top till the first non-penalizable 2. Remove the last captured token (the least probable above threshold) 3. Shift all tokens to override the remaining penalizable 4. Penalize and put them at the the bottom. * Added XTC to `test-sampling` * Simplified algorithm and more tests * Updated info in common and args * Merged back lost commits in common and arg * Update dump info in common * Fixed incorrect min_keep check * Added XTC to README * Renamed parameters, fixed info and defaults * probability is at 0 by default, but XTC is included in sampling queue * threshold higher than 0.5 switches XTC off * Initial server support * Added XTC to server UIs * Fixed labels in old server UI * Made algorithm safer and more readable * Removed xtc_threshold_max * Fixed arg after update * Quick fixes by comments * Simplified algorithm since threshold_max is removed * Renamed random distribution * Fixed tests and outdated README * Small fixes
* Initial XTC commit Adds XTC sampler, not activated by default, but recommended settings by default. * Cleanup * Simplified chances calculation To be more inline with the original implementation, chance is calculated once at the beginning. * First fixes by comments Still need to look into sorting * Fixed trailing backspaces * Fixed RNG to be reproduceable Thanks to @slaren for directions * Fixed forgotten header * Moved `min_keep` Moved from conditions to a simple check at the end. * Fixed broken randomization Thanks to @slaren for explanation * Swapped sorting for a custom algorithm Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable. * Algorithm rework 1. Scan token from top till the first non-penalizable 2. Remove the last captured token (the least probable above threshold) 3. Shift all tokens to override the remaining penalizable 4. Penalize and put them at the the bottom. * Added XTC to `test-sampling` * Simplified algorithm and more tests * Updated info in common and args * Merged back lost commits in common and arg * Update dump info in common * Fixed incorrect min_keep check * Added XTC to README * Renamed parameters, fixed info and defaults * probability is at 0 by default, but XTC is included in sampling queue * threshold higher than 0.5 switches XTC off * Initial server support * Added XTC to server UIs * Fixed labels in old server UI * Made algorithm safer and more readable * Removed xtc_threshold_max * Fixed arg after update * Quick fixes by comments * Simplified algorithm since threshold_max is removed * Renamed random distribution * Fixed tests and outdated README * Small fixes
* Initial XTC commit Adds XTC sampler, not activated by default, but recommended settings by default. * Cleanup * Simplified chances calculation To be more inline with the original implementation, chance is calculated once at the beginning. * First fixes by comments Still need to look into sorting * Fixed trailing backspaces * Fixed RNG to be reproduceable Thanks to @slaren for directions * Fixed forgotten header * Moved `min_keep` Moved from conditions to a simple check at the end. * Fixed broken randomization Thanks to @slaren for explanation * Swapped sorting for a custom algorithm Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable. * Algorithm rework 1. Scan token from top till the first non-penalizable 2. Remove the last captured token (the least probable above threshold) 3. Shift all tokens to override the remaining penalizable 4. Penalize and put them at the the bottom. * Added XTC to `test-sampling` * Simplified algorithm and more tests * Updated info in common and args * Merged back lost commits in common and arg * Update dump info in common * Fixed incorrect min_keep check * Added XTC to README * Renamed parameters, fixed info and defaults * probability is at 0 by default, but XTC is included in sampling queue * threshold higher than 0.5 switches XTC off * Initial server support * Added XTC to server UIs * Fixed labels in old server UI * Made algorithm safer and more readable * Removed xtc_threshold_max * Fixed arg after update * Quick fixes by comments * Simplified algorithm since threshold_max is removed * Renamed random distribution * Fixed tests and outdated README * Small fixes
* Initial XTC commit Adds XTC sampler, not activated by default, but recommended settings by default. * Cleanup * Simplified chances calculation To be more inline with the original implementation, chance is calculated once at the beginning. * First fixes by comments Still need to look into sorting * Fixed trailing backspaces * Fixed RNG to be reproduceable Thanks to @slaren for directions * Fixed forgotten header * Moved `min_keep` Moved from conditions to a simple check at the end. * Fixed broken randomization Thanks to @slaren for explanation * Swapped sorting for a custom algorithm Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable. * Algorithm rework 1. Scan token from top till the first non-penalizable 2. Remove the last captured token (the least probable above threshold) 3. Shift all tokens to override the remaining penalizable 4. Penalize and put them at the the bottom. * Added XTC to `test-sampling` * Simplified algorithm and more tests * Updated info in common and args * Merged back lost commits in common and arg * Update dump info in common * Fixed incorrect min_keep check * Added XTC to README * Renamed parameters, fixed info and defaults * probability is at 0 by default, but XTC is included in sampling queue * threshold higher than 0.5 switches XTC off * Initial server support * Added XTC to server UIs * Fixed labels in old server UI * Made algorithm safer and more readable * Removed xtc_threshold_max * Fixed arg after update * Quick fixes by comments * Simplified algorithm since threshold_max is removed * Renamed random distribution * Fixed tests and outdated README * Small fixes
This is my implementation of the XTC sampler as described by @p-e-w in oobabooga/text-generation-webui#6335 , the main difference is
threshold_max
parameter which seems to help with finetunes. Additional "thank you" goes to @LostRuins who implemented XTC a bit earlier than me (probabilities + logits system is a bit confusing at first).Technically, this sampler:
xtc_p
xtc_t
and below upper thresholdxtc_t_max
This means that at least one "most probable" token is always left untouched, which should keep things stable. For more details and explanations it is better to read @p-e-w 's original PR as it is very thorough and comprehensive.
Additionally, I have not encountered problems with
eos
tokens that were described in discussions of the original implementation, so I have not added any checks for it.Being a unique sampler, XTC is not included in the queue by default (like Mirostats)
In my experience, the effect varies between models and even sizes of models. XTC improves creativity greatly, but may break models in specific cases (for example, multi-language capabilities of Nemo can be slightly worsened for inflective languages). With larger models XTC may require very careful choice of parameters - I have seen more refusals in Mistral Large 123b with default XTC settings, for example (although, not on my implementation, but through Kobold.cpp).
In general, however, XTC is a really interesting idea and a sampler worth testing.