Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sampling : add XTC sampler #9742

Merged
merged 49 commits into from
Oct 15, 2024
Merged

sampling : add XTC sampler #9742

merged 49 commits into from
Oct 15, 2024

Conversation

MaggotHATE
Copy link
Contributor

@MaggotHATE MaggotHATE commented Oct 4, 2024

This is my implementation of the XTC sampler as described by @p-e-w in oobabooga/text-generation-webui#6335 , the main difference is threshold_max parameter which seems to help with finetunes. Additional "thank you" goes to @LostRuins who implemented XTC a bit earlier than me (probabilities + logits system is a bit confusing at first).

XTC is a novel sampler that turns truncation on its head: Instead of pruning the least likely tokens, under certain circumstances, it removes the most likely tokens from consideration.

Technically, this sampler:

  • checks for probability xtc_p
  • searches for all tokens with probabilities above threshold xtc_t and below upper threshold xtc_t_max
  • penalizes and removes all such tokens besides the one least probable of them.

This means that at least one "most probable" token is always left untouched, which should keep things stable. For more details and explanations it is better to read @p-e-w 's original PR as it is very thorough and comprehensive.

Additionally, I have not encountered problems with eos tokens that were described in discussions of the original implementation, so I have not added any checks for it.

Being a unique sampler, XTC is not included in the queue by default (like Mirostats)

In my experience, the effect varies between models and even sizes of models. XTC improves creativity greatly, but may break models in specific cases (for example, multi-language capabilities of Nemo can be slightly worsened for inflective languages). With larger models XTC may require very careful choice of parameters - I have seen more refusals in Mistral Large 123b with default XTC settings, for example (although, not on my implementation, but through Kobold.cpp).

In general, however, XTC is a really interesting idea and a sampler worth testing.

Adds XTC sampler, not activated by default, but recommended settings by default.
To be more inline with the original implementation, chance is calculated once at the beginning.
Copy link
Contributor

@pnb pnb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with the code base, but XTC is a very interesting idea to me so I took a read through to see how you implemented it, and left a few comments in case they are helpful. In case not, feel free to ignore. :)

@MaggotHATE MaggotHATE marked this pull request as draft October 4, 2024 15:19
@David-AU-github
Copy link

I have tested this with a number of models of different sizes / types via Text Generation / oobabooga on day of release.
This works very well and a strong addition to LLAMACPP.

@p-e-w
Copy link

p-e-w commented Oct 5, 2024

What is the theory behind the xtc_t_max parameter? It seems to me that in many cases, that parameter would do the opposite of what XTC is supposed to do: Reduce creativity compared to not using XTC at all.

Here's an example: Imagine the input is ... shiver down his, and the predictions from the model are

spine 0.65
back 0.15
throat 0.12
...

With the original XTC algorithm and xtc_t set to 0.1, spine and back will be eliminated, leaving throat and the rest of the distribution. The behavior is more creative because the most clichéd continuations are gone.

Now let's say you set xtc_t to 0.1 again, and you also set the xtc_t_max parameter from this PR to 0.6. In that case, the algorithm will eliminate only back, leaving spine because it is above the upper threshold. The probability mass of back is distributed to the remaining tokens after renormalization, and the most clichéd continuation (spine), the one we want to avoid above all else, is now even more likely than it would have been if we had done nothing at all.

That doesn't seem right.

@MaggotHATE
Copy link
Contributor Author

Here's an example: Imagine the input is ... shiver down his

These example and explanations are true for baseline models, but may not work for finetuned models that were trained against clichéd phrases.

Consider a model which, as a result of finetuning, is now less likely to output the phrase in your example: less likely doesn't mean "erased entirely", and such models may still be able to generate it, but with lower probability. In such case XTC will remove a more probable (supposedly less clichéd) variant and increase the chance of choosing a previously less probable cliché. In the long run it can steer such finetuned model back to its baseline, "undoing" the effect of finetuning to some extent.

At the same time XTC may still be useful for such a model if a middle range is carefully set. I would assume that it is possible to cut out a range where penalized clichés may still appear, further improving model's capabilities. In practice, I have tested only a few such models, but I was getting better (as in, less typical) responses with lowered upper threshold.

You can look at various benchmarks that estimate creativity and, more relevant in our case, so called "slop", which is exactly the type of phrases we want to avoid using XTC. It's worth testing XTC with such models too.

As a note, while testing new fixed random, I've encountered refusals on mini-magnum-12b-v1.1.Q8_0 where they never happened even on baseline Nemo Instruct, with parameters M=0.020->xtc=0.20-0.99 (99% chance) (and for statistics I was counting tokens that came through XTC sampler and that were removed: 611/37275(1.64%) ). While this is an extreme example (99% chance), it shows that it is possible to get an undesirable result in some corner cases. The addition of upper threshold simply makes XTC a bit more flexible without breaking it (default value is still 1.0).

Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable.
@p-e-w
Copy link

p-e-w commented Oct 6, 2024

Consider a model which, as a result of finetuning, is now less likely to output the phrase in your example: less likely doesn't mean "erased entirely", and such models may still be able to generate it, but with lower probability. In such case XTC will remove a more probable (supposedly less clichéd) variant and increase the chance of choosing a previously less probable cliché. In the long run it can steer such finetuned model back to its baseline, "undoing" the effect of finetuning to some extent.

So the purpose of xtc_t_max is to act as a countermeasure so that the finetune's anti-cliché countermeasures don't get overridden by XTC, which is itself an anti-cliché countermeasure? That's highly convoluted.

I suspect that this parameter is useless in most cases, actively harmful unless extremely carefully calibrated for each model and input, and above all else, confusing and very difficult to interpret for most users. The standard XTC threshold gives you a straightforward knob to tweak, with a clearly defined meaning: The probability that you consider "viable". By contrast, with xtc_t_max, you have to guess to what degree the finetune has already suppressed clichés, and that degree might not even be uniform across all clichés. Essentially, xtc_t_max means nothing at all. It's just a value that you can fiddle with in the hope that you will like the output better.

IMO, xtc_t_max is a philosophical mismatch for XTC, injecting complexity and volatility into an otherwise simple and straightforward sampler. I considered several additional parameters while designing XTC, but in the end decided against including any of them, because simplicity is king, and interpretability is what makes parameters useful in the first place.

@MaggotHATE
Copy link
Contributor Author

So the purpose of xtc_t_max is to act as a countermeasure so that the finetune's anti-cliché countermeasures don't get overridden by XTC, which is itself an anti-cliché countermeasure?

No, it a measurement of a model's perceived creativity. If a model is already creative enough, more clichés will be less probable - and the range of tokens we perceive as undesirable would have lower ceiling.

Look at it from a broader perspective: every model needs a direction to create an output a user might need. To achieve that, we can perform an internal work and an external one on that model.

Internal work is training/finetuning - if affects the internal data of the model, and the results of such work are "baked" into the model itself. While not usually used, it is possible for any well-made model to output coherently (almost) without samplers.

External work are sampling algorithms, RAG, etc. They are used to redirect the model towards a more desirable output for every input a user might have. We can also say that prompt engineering falls into this category (as opposed to creating training data pairs, for example, which would be internal work).

Now, let's look at XTC again: the initial idea was to remove the most "boring", repetitive tokens from consideration. This means that we see the results of the internal work done on the model as "incomplete" from what we see as "creativity" of a model. To make it "complete", we use XTC and get exactly (or almost exactly) what we need.

However, if the internal work done one a model has already considered that creativity might be "incomplete", and further training was performed, that model by itself will be closer to what we want to achieve. We see the base output of such model as "more creative" in comparison to another (usually, the base/instruct versions) model. This also means that undesirable tokens will have lower probabilities during sampling.

With the original XTC, however, we will still penalize top tokens only. There are no mechanisms to control it's behavior towards upper top tokens, because raising the bar affects lower top tokens, and changing the probabilities affects all top tokens. This means that we don't differentiate between models that are more creative and models that are less creative - lower threshold cannot reflect that as it controls lower top tokens, not the upper ones.

By introducing xtc_t_max we give a user more control over the perceived creativity, making it easier to find per-model settings. It has always been the case with presets which might be sufficient for one model, but not sufficient for another. Considering the large number of models nowadays, xtc_t_max is an important addition and a useful flexibility.

because simplicity is king

I would argue that functionality is the king: giving users more options and more control over the tools they use is more important than keeping algorithms simple. KISS is important, but it should not hurt usability. In this case: original XTC is a sampler that works with ranges of tokens from top only, with xtc_t_max it works with any range of tokens, from top or in the middle. Users who don't need to fiddle with it, will keep it on default, while those who want to experiment will be able to try middle ranges. Possibilities are doubled.

1. Scan token from top till the first non-penalizable
2. Remove the last captured token (the least probable above threshold)
3. Shift all tokens to override the remaining penalizable
4. Penalize and put them at the the bottom.
@MaggotHATE MaggotHATE marked this pull request as ready for review October 6, 2024 11:27
@MaggotHATE
Copy link
Contributor Author

I have reworked the algorithm, given that std::sort is no longer used. It should be ready now.

@MaggotHATE MaggotHATE requested a review from slaren October 6, 2024 11:31
@MaggotHATE
Copy link
Contributor Author

MaggotHATE commented Oct 7, 2024

After tinkering with the algorithm more I've tested two options:

The more elegant one is to iterate through all tokens from back to front, which makes capturing the last op token easier:

    std::vector<llama_token_data> tokens_left;
    bool last_captured = false;

    for (int i = (cur_p->size - 1); i >= 0; --i) {
        if (cur_p->data[i].p < ctx->threshold || cur_p->data[i].p > ctx->threshold_max) {
            tokens_left.emplace_back(cur_p->data[i]);
        } else if (!last_captured) {
            tokens_left.emplace_back(cur_p->data[i]);
            last_captured = true;
        }
    }

    if (last_captured) {
        int to_remove = cur_p->size - tokens_left.size();

        if (to_remove >= 1 && tokens_left.size() >= ctx->min_keep) {
            memcpy(cur_p->data, tokens_left.data(), tokens_left.size()*sizeof(llama_token_data));

            cur_p->size = tokens_left.size();
            // since tokens are in reverse order, we'll need to sort them later
            cur_p->sorted = false;
        }
    }

The problem is, it requires sorting later, which will be performed by softmax later. That is expensive, and we can avoid it by going from the front:

    std::vector<llama_token_data> tokens_left;
    bool last_captured = true;

    for (size_t i = 0; i < cur_p->size; ++i) {
        if (cur_p->data[i].p < ctx->threshold || cur_p->data[i].p > ctx->threshold_max) {
            // capturing the last top token
            if (!last_captured && i > 0 && cur_p->data[i-1].p >= ctx->threshold) {
                tokens_left.emplace_back(cur_p->data[i-1]);
                last_captured = true;
            }

            tokens_left.emplace_back(cur_p->data[i]);
        }
    }

    // in case all candidates are penalizable
    if (cur_p->data[cur_p->size - 1].p >= ctx->threshold) tokens_left.emplace_back(cur_p->data[cur_p->size - 1]);

    if (tokens_left.size() < ctx->min_keep) return;

    size_t to_remove = cur_p->size - tokens_left.size();

    if (to_remove >= 1) {
        memcpy(cur_p->data, tokens_left.data(), tokens_left.size()*sizeof(llama_token_data));
        cur_p->size = tokens_left.size();
    }

However, the second one feels bulky. Which one is more acceptable? @slaren can I request your help here?

@slaren
Copy link
Member

slaren commented Oct 7, 2024

If I understand the algorithm correctly, this implementation seems too complicated. Since the tokens are sorted by probability, you should only need to first find the index of the last token to remove, and then move the remaining tokens to the top of the list. The implementation in textgen-webui also seems to completely disable itself if any special tokens would be removed, which I imagine is important.

@MaggotHATE
Copy link
Contributor Author

MaggotHATE commented Oct 7, 2024

you should only need to first find the index of the last token to remove, and then move the remaining tokens to the top of the list

It will be slightly more complex due to addition of max threshold, but yes, I understand. I have also added XTC into test-sampling for self-check.

disable itself if any special tokens would be removed

It's not a part of the original idea, and I haven't encountered any problems with special tokens in llama.cpp specifically - that's why I left it out for now. If any problems are found later, I'll definitely add it. Thank you!

@pnb
Copy link
Contributor

pnb commented Oct 7, 2024

... The implementation in textgen-webui also seems to completely disable itself if any special tokens would be removed, which I imagine is important.

Wouldn't doing so bias the responses to be shorter, on average? If EOT is a special token and can be elevated in probability by XTC when other (more likely) tokens are removed, but never decreased in probability by itself being removed, it seems like responses will be shorter.

I've tried this implementation a bit now and it does occasionally ramble a bit at the end of a response, especially when generating code (not a great use for XTC but entertaining at least). That might be caused by removing EOT, so perhaps there's good reason to avoid it. But I would definitely prefer an occasional ramble over shortened responses.

@p-e-w
Copy link

p-e-w commented Oct 14, 2024

This can be done by penalizing exactly one top token if threshold is above 0.5, I will add it as a separate PR once this gets merged.

That's an interesting proposal, but it sounds like a completely new sampler, rather than an addition to XTC. I recommend submitting that mechanism as a separate sampler under a new name. It's pretty much the inverse of XTC in the way it removes tokens.

@MaggotHATE
Copy link
Contributor Author

MaggotHATE commented Oct 14, 2024

I recommend submitting that mechanism as a separate sampler under a new name.

Yes, I tested more in the meantime and see potential in two ideas:

  1. A simple range-based removal would allow penalizing from 1.0 till 0.8, for example, doing exactly that.
  2. A top tokens disposition would check if the difference between first two tokens is higher than X (in this case, for example, 0.65) and, if true, remove the top token. This would allow removing only the most "dominating" tokens while keeping those with more variety (for example, in candidates 40%, 40%, 10%,etc) - or vise versa if, again, a range is defined.


By removing top tokens XTC can improve the variety of answers, break writing clichés and inhibit repition, since clichés and repeated phrases are usually more likely to appear. By keeping the last token above the threshold, XTC ensures that the answer is still coherent. XTC is meant to be used for creative tasks, but feel free to experiment with different settings for different models.

Being experimental and unique, XTC is disabled by default. The recommended combination of samplers is Min-P followed by XTC on its default settings: `--sampling-seq mx --min-p 0.02 -xtc-p 0.5`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it still necessary to pass an explicit sampler chain via --sampling-seq in order to activate XTC? I thought that it is now in the sampler chain by default, and disabled by having xtc_probability set to 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@p-e-w While XTC is included into the sampler queue by default, it is put after all other truncating samplers. As such, the recommended combinations of samplers, as per your words in oobabooga/text-generation-webui#6335 , requires passing samplers chain explicitly.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it. Seems like the order is correct by default (XTC after truncation, which Min-P is). And all samplers are set to neutral (off) by default, right? So what does --sampling-seq mx do that wouldn't happen otherwise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@p-e-w Not all samplers - Top K is 40 by default and Top P is 0.95. So, either they need to be set to 0 and 1.0 respectively, or (which is easier and more logical) sampling queue should be limited to only the samplers we need.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😮 I had no idea! So llama.cpp users are getting crap samplers from the Stone Age without even realizing it. That's terrible. I would have expected a clean slate that samples from the raw model distribution unless parameters are set explicitly.

Anyway, this means your example command is correct, of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, this topic has already been brought up some time ago in other PRs. However, since in most cases llama.cpp is used as a library through another app or as a server, this issue is mostly related to llama-cli users. You can look at index-new.html (new server UI): it has different "default" values with top_k and top_p turned off, and I assume any other frontend will have a payload with all parameters set as needed.

But yes, this is an issue.

Copy link
Contributor

@strawberrymelonpanda strawberrymelonpanda Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Llama-cli user here via scripts... thanks for bringing this up at any rate, Improved my (custom) benchmark scores just by adjusting settings to disable Top K and Top P.

I've seen the default params adjusted a few times for llama-cli; feels like this would be a good change.

delete (llama_sampler_xtc *) smpl->ctx;
}

static void llama_sampler_xtc_reset(struct llama_sampler * smpl) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK this is necessary to properly reset seed and maintain repeatability, as recommended by @slaren earlier.

int pos_last = 0;

for (size_t i = 0; i < cur_p->size; ++i) {
if (cur_p->data[i].p - ctx->threshold >= -1e-5) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this epsilon instead of a regular comparison?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it after running tests - they were failing due to precision problem.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, no. Never change correct code to satisfy tests. If the code is semantically correct, and the tests don't pass, the tests need to be adapted to account for things such as floating point shenanigans. But changing the code itself is always wrong, unless of course the code has a bug.

The correct way to express this condition is

if (cur_p->data[i].p >= ctx->threshold) {

Nothing else will do. And the tests need to work with that, or the tests are wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it should be considered a problem with tests (precision problem is a wider topic), but alright.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My basic point is that the tests serve the code, not the other way round. Code is only ever changed in response to failing tests if the code is found to have a bug.

I had a similar problem with tests for an experimental sampler I wrote for llama.cpp a while ago, and I was able to work around it by using a different set of token probabilities in the tests that I constructed specifically so that after probability renormalization, the resulting values were exactly representable in floating point.

Copy link

@p-e-w p-e-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did another full review of the code and while some parts are quite subtle, I'm now pretty confident that the implementation is correct (though I did not run it myself yet).

@slaren: This is ready to go as far as I'm concerned.

}

if (cur_p->size - pos_last >= ctx->min_keep && pos_last > 0) {
cur_p->data += pos_last;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may potentially break 3rd party code that expects this pointer to be unchanged (eg. to free it after sampling). I don't think this is necessarily a problem, but we should make it clear that this pointer may be changed by the samplers, and applications should not rely on it being unchanged.

@slaren slaren merged commit fbc98b7 into ggml-org:master Oct 15, 2024
52 checks passed
drollings pushed a commit to drollings/llama.cpp that referenced this pull request Oct 18, 2024
* Initial XTC commit

Adds XTC sampler, not activated by default, but recommended settings by default.

* Cleanup

* Simplified chances calculation

To be more inline with the original implementation, chance is calculated once at the beginning.

* First fixes by comments

Still need to look into sorting

* Fixed trailing backspaces

* Fixed RNG to be reproduceable 

Thanks to @slaren for directions

* Fixed forgotten header

* Moved `min_keep` 

Moved from conditions to a simple check at the end.

* Fixed broken randomization

Thanks to @slaren for explanation

* Swapped sorting for a custom algorithm

Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable.

* Algorithm rework

1. Scan token from top till the first non-penalizable
2. Remove the last captured token (the least probable above threshold)
3. Shift all tokens to override the remaining penalizable
4. Penalize and put them at the the bottom.

* Added XTC to `test-sampling`

* Simplified algorithm and more tests

* Updated info in common and args

* Merged back lost commits in common and arg

* Update dump info in common

* Fixed incorrect min_keep check

* Added XTC to README

* Renamed parameters, fixed info and defaults

* probability is at 0 by default, but XTC is included in sampling queue
* threshold higher than 0.5 switches XTC off

* Initial server support

* Added XTC to server UIs

* Fixed labels in old server UI

* Made algorithm safer and more readable

* Removed xtc_threshold_max

* Fixed arg after update

* Quick fixes by comments

* Simplified algorithm since threshold_max is removed

* Renamed random distribution

* Fixed tests and outdated README

* Small fixes
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
* Initial XTC commit

Adds XTC sampler, not activated by default, but recommended settings by default.

* Cleanup

* Simplified chances calculation

To be more inline with the original implementation, chance is calculated once at the beginning.

* First fixes by comments

Still need to look into sorting

* Fixed trailing backspaces

* Fixed RNG to be reproduceable 

Thanks to @slaren for directions

* Fixed forgotten header

* Moved `min_keep` 

Moved from conditions to a simple check at the end.

* Fixed broken randomization

Thanks to @slaren for explanation

* Swapped sorting for a custom algorithm

Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable.

* Algorithm rework

1. Scan token from top till the first non-penalizable
2. Remove the last captured token (the least probable above threshold)
3. Shift all tokens to override the remaining penalizable
4. Penalize and put them at the the bottom.

* Added XTC to `test-sampling`

* Simplified algorithm and more tests

* Updated info in common and args

* Merged back lost commits in common and arg

* Update dump info in common

* Fixed incorrect min_keep check

* Added XTC to README

* Renamed parameters, fixed info and defaults

* probability is at 0 by default, but XTC is included in sampling queue
* threshold higher than 0.5 switches XTC off

* Initial server support

* Added XTC to server UIs

* Fixed labels in old server UI

* Made algorithm safer and more readable

* Removed xtc_threshold_max

* Fixed arg after update

* Quick fixes by comments

* Simplified algorithm since threshold_max is removed

* Renamed random distribution

* Fixed tests and outdated README

* Small fixes
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* Initial XTC commit

Adds XTC sampler, not activated by default, but recommended settings by default.

* Cleanup

* Simplified chances calculation

To be more inline with the original implementation, chance is calculated once at the beginning.

* First fixes by comments

Still need to look into sorting

* Fixed trailing backspaces

* Fixed RNG to be reproduceable 

Thanks to @slaren for directions

* Fixed forgotten header

* Moved `min_keep` 

Moved from conditions to a simple check at the end.

* Fixed broken randomization

Thanks to @slaren for explanation

* Swapped sorting for a custom algorithm

Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable.

* Algorithm rework

1. Scan token from top till the first non-penalizable
2. Remove the last captured token (the least probable above threshold)
3. Shift all tokens to override the remaining penalizable
4. Penalize and put them at the the bottom.

* Added XTC to `test-sampling`

* Simplified algorithm and more tests

* Updated info in common and args

* Merged back lost commits in common and arg

* Update dump info in common

* Fixed incorrect min_keep check

* Added XTC to README

* Renamed parameters, fixed info and defaults

* probability is at 0 by default, but XTC is included in sampling queue
* threshold higher than 0.5 switches XTC off

* Initial server support

* Added XTC to server UIs

* Fixed labels in old server UI

* Made algorithm safer and more readable

* Removed xtc_threshold_max

* Fixed arg after update

* Quick fixes by comments

* Simplified algorithm since threshold_max is removed

* Renamed random distribution

* Fixed tests and outdated README

* Small fixes
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* Initial XTC commit

Adds XTC sampler, not activated by default, but recommended settings by default.

* Cleanup

* Simplified chances calculation

To be more inline with the original implementation, chance is calculated once at the beginning.

* First fixes by comments

Still need to look into sorting

* Fixed trailing backspaces

* Fixed RNG to be reproduceable 

Thanks to @slaren for directions

* Fixed forgotten header

* Moved `min_keep` 

Moved from conditions to a simple check at the end.

* Fixed broken randomization

Thanks to @slaren for explanation

* Swapped sorting for a custom algorithm

Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable.

* Algorithm rework

1. Scan token from top till the first non-penalizable
2. Remove the last captured token (the least probable above threshold)
3. Shift all tokens to override the remaining penalizable
4. Penalize and put them at the the bottom.

* Added XTC to `test-sampling`

* Simplified algorithm and more tests

* Updated info in common and args

* Merged back lost commits in common and arg

* Update dump info in common

* Fixed incorrect min_keep check

* Added XTC to README

* Renamed parameters, fixed info and defaults

* probability is at 0 by default, but XTC is included in sampling queue
* threshold higher than 0.5 switches XTC off

* Initial server support

* Added XTC to server UIs

* Fixed labels in old server UI

* Made algorithm safer and more readable

* Removed xtc_threshold_max

* Fixed arg after update

* Quick fixes by comments

* Simplified algorithm since threshold_max is removed

* Renamed random distribution

* Fixed tests and outdated README

* Small fixes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants