-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : remove Tail-Free sampling #10071
Conversation
Yes please. Despite its reasonable-sounding theoretical foundations, TFS has not stood the test of time and the collective experience of tens of thousands of power users has shown that it is soundly beaten by Min-P in practice, at a fraction of its complexity. FWIW, I believe that Typical and Mirostat should meet the same fate. Judging from presets posted in forums and model cards, they're very rarely used nowadays, and Mirostat in particular is difficult to build an intuition for. The lesson appears to be that samplers using entropy/perplexity as a control metric are overlooking one or multiple key properties that strongly determine output quality. A combination of Min-P and XTC can do anything those samplers can do and more, and do it in a way that is comprehensible to a human operator thinking about probability distributions. Heck, even Top-P and Top-K (for values > 1) should probably be deprecated and show a warning when used. Min-P fixes the conceptual flaws of both of them and is flat out superior. AFAIK, those samplers originate from the early OpenAI days and were then incorporated into HF Transformers, after which everyone else cargo culted them into every frontend and backend. Their continued availability (and the fact that they are presented at the very top of most frontends' sampler settings) really hurts the experience of new users, who probably shouldn't be using any sampler other than Temperature, Min-P, XTC, and DRY. |
@p-e-w Thanks for the insights. I will use this opportunity to ask you about what do you think is the best sampling strategy when the LLM is used for fill-in-the-middle (FIM) tasks (e.g. code completion). Do you have any experience and recommendations? My intuition is that one would generally want greedy sampling in this case (since we are not looking for variety and/or creativity), but with some conservative way to terminate early as soon as the LLM is not "confident" about the next token. Do you agree and if yes, what would be a good approach to model this during sampling? |
Although I too am very interested in FIM, I don't have any practical experience with it yet. That being said, coding appears to be the primary application of FIM, and with coding one would very rarely want anything other than the model's "best guess", i.e. greedy sampling. Note that paradoxically, greedily sampling the most likely token at each position does not guarantee that the most likely sequence of tokens is being sampled, but this cannot be fixed with the way we currently do sampling. The termination strategy is indeed the crux of the matter. While the FIM models I've seen do generate EOS (or a special end-of-infill token) when the infill is "complete", for coding, you often don't want them to go that far. Intuitively, you may want the model to infill a unit of code, for example an indentation block. As such, I expect that a hybrid heuristic composed of bracket/indentation matching and a maximum output length could work in practice. Stopping when the confidence is low (e.g. measured by some entropy function on the distribution) might be a double-edged sword. Consider this infill task: int i = 0;
int j = 0;
for ([FIM]
if (j < i) {
...
} It's likely that there will be one or two loops involving Instead, a frontend might use a block-based termination strategy (e.g. stop when encountering a |
I think better methods for evaluating samplers are sorely needed. The in my opinion biggest difficulty lies in acquiring enough data to make statistically significant statements. I see two ways that could maybe work:
Deprecation warnings would be fine I guess but if we advertise our server as OpenAI compatible we definitely should not remove them. |
This reminds me of the main method described in Chain-of-Thought Reasoning without Prompting paper. Sampling multiple paths with later evaluation might become a useful inference strategy in future, let alone testing existing samplers. |
Personally I'd like to see them turned off by default in the llama-cli defaults. I benchmark models / settings against a personal real-world use-case test (coding, translation, math, RAG), and setting top_k to 0 and top_p to 1.0 improved my top benchmark scores in multiple models. |
I'm adding breaking change label because this PR touches |
I will vote to disable all samplers, or at least set them to default values that will not cut any token. As the person who first implemented the TFS sampler in llama.cpp, I would say that Min P, XTC, and temperature samplers are all you need. |
* feat: sync llama.cpp * fix: fix submodule update - as part of llama.cpp sync * chore: remove unnecessary comment * chore(example): revert unnecessary changes * feat: sync llama.cpp * fix: remove tfs_z ref: ggerganov/llama.cpp#10071 * fix(cpp): skip gpu device if n_gpu_layers <= 0 ref: ggerganov/llama.cpp#10132 --------- Co-authored-by: Jhen-Jie Hong <[email protected]>
Top-k along with grammar is a poweful combination to get small models to give better results in Yes/no answers. If this is a maintainability issue, only then removing these make sense. Otherwise they are well understood and not everyone just wants to chat with llm. We are looking at programmatic call use. |
Supersedes #9604