Add Classifier-Free Guidance sampling #24536

Vermeille · 2023-06-28T02:09:29Z

EDIT: ===========================
As I see many people copy pasting this initial code that was meant to be a basis for discussion, here is a cleaner version (yet not perfect! We're still doing improvement rounds with the huggingface team to improve it! Check the state of the PR until it's not merged! #24654 ).

from transformers import (GPT2Tokenizer, AutoModelForCausalLM,
                          GPTNeoXForCausalLM, AutoTokenizer)
import numpy as np
import torch
from transformers import (LogitsProcessor, LogitsProcessorList,
                          MinLengthLogitsProcessor, TemperatureLogitsWarper,
                          TopKLogitsWarper, TopPLogitsWarper,
                          TypicalLogitsWarper)
from transformers.generation import LogitNormalization
import torch.nn.functional as F

class CFGLogits(LogitsProcessor):
    r"""Logits processor for Classifier-Free Guidance (CFG). The processors
    computes a weighted average across scores from prompt conditional and prompt unconditional (or negative) logits,
    parameterized by the `guidance_scale`. The unconditional scores are computed internally by prompting `model` with
    the `uncond` branch. Finally, according to CFG Rescale, the reweighted logits are interpolated back with weight
    `rescale_factor` the conditional ones to smooth the effect and increase output quality.

    See [the paper](https://arxiv.org/abs/2306.17806) for more information.

    Args:
        guidance_scale (float):
            The guidance scale for classifier free guidance (CFG). CFG is enabled by setting `guidance_scale > 1`.
            Higher guidance scale encourages the model to generate samples that are more closely linked to the input
            prompt, usually at the expense of poorer quality.
        uncond (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
            Indices of input sequence tokens in the vocabulary for the unconditional branch.
        model:
            The LM computing the unconditional scores. Supposedly the same as the one computing the conditional scores.
            Both models must use the same tokenizer.
    """

    def __init__(self, guidance_scale, uncond, model):
        self.guidance_scale = guidance_scale
        self.uncond = uncond
        self.model = model
        self.out = None
        self.rescale_factor = rescale_factor

    def __call__(self, input_ids, scores):
        scores = F.log_softmax(scores, dim=-1)
        if self.guidance_scale == 1:
            return scores

        if self.out is None:
            self.out = self.model(self.uncond, use_cache=True)
        else:
            self.out = self.model(
                input_ids[:, -1:],
                use_cache=True,
                past_key_values=self.out.past_key_values,
            )
        unconditional_logits = F.log_softmax(self.out.logits[0][-1:], dim=-1)
        out = self.guidance_scale * (scores - unconditional_logits) + unconditional_logits
        return out
        

# paper usage: (copying and editing @grantCelley 's answer)
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import LogitsProcessorList, TemperatureLogitsWarper, TopPLogitsWarper

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-160m")

model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-160m")

prompt = tokenizer("Today a dragon flew over Paris, France,", return_tensors='pt')
# either provide a negative prompt:
neg_prompt = tokenizer("A sad event happened,", return_tensors='pt')['input_ids']
# or don't:
# neg_prompt = prompt['input_ids'][:, -1:]

device='cuda:0'
model.to(device)
outputs = model.generate(
    input_ids=prompt['input_ids'].to(device),
    attention_mask=prompt['attention_mask'].to(device),
    max_new_tokens=125,
    logits_processor=LogitsProcessorList([
        # inputs_cfg usually is the last token of the prompt but there are
        # possibilities of negative prompting that are explored in the paper
        CFGLogits(1.5, neg_prompt.to(device), model),
        TemperatureLogitsWarper(0.8),
        TopPLogitsWarper(0.95),
    ]),
    do_sample=True,
)

print(tokenizer.decode(outputs[0]))

===============================

Feature request

Hello!
I wish to contribute CFG sampling. I'm working with EleutherAI and @StellaAthena and will have a paper about it by Friday. CFG brings non trivial improvements on many standard benchmarks. It contrast the logits for the next token $P(w_t|w_{..t}, prompt)$ to that of the input deprived of the prompt $P(w_t|w_{..t})$, by defining

$$ \log P_{\text{cfg}}(w|w_{..t}, prompt) = \log P(w|w_{..t}) + \text{cfg} * (\log P(w|w_{..t}, prompt) - \log P(w|w_{..t}) $$

And then we can blend $\log P_{\text{cfg}}$ with $\log P(w|w_{..t}, prompt)$ to smoothen that distribution a bit, but it's optional.

Motivation

My current implementation is:

class CFGLogits(LogitsWarper):

    def __init__(self, cfg, inputs, model, verbose=True):
        self.cfg = cfg
        self.inputs = inputs
        self.model = model
        self.out = None
        self.verbose = verbose

    def __call__(self, input_ids, scores):
        if self.cfg == 1:
            return F.log_softmax(scores, dim=-1)
        scores = F.log_softmax(scores, dim=-1)
        if self.out is None:
            self.out = self.model(self.inputs.to(device), use_cache=True)
        else:
            self.out = self.model(input_ids[:, -1:],
                                  use_cache=True,
                                  past_key_values=self.out.past_key_values)
        unconditional_logits = F.log_softmax(self.out.logits[0][-1:], dim=-1)
        out = self.cfg * (scores - unconditional_logits) + unconditional_logits
        out = F.log_softmax(out, dim=-1)
        return 0.7 * out + 0.3 * scores

# usage:

outputs = model.generate(
    input_ids=inputs['input_ids'].to(device),
    attention_mask=inputs['attention_mask'].to(device),
    max_new_tokens=l,
    logits_processor=LogitsProcessorList([
        # inputs_cfg usually is the last token of the prompt but there are
        # possibilities of negative prompting that are explored in the paper
        CFGLogits(cfg, inputs_cfg, model),
        TemperatureLogitsWarper(0.8),
        TopPLogitsWarper(0.95),
    ]),
    do_sample=True,
)

I am not familiar enough with the design guidelines of HF to know if this implementation as a LogitsWarper is satisfactory.

just a few figures supporting the claims:

Your contribution

I can contribute the code but I need to be guided as I don't know the exact design guidelines and overall architecture of HF.

Thank you for your time!

sgugger · 2023-06-28T12:28:43Z

cc @gante
But let's see if the community requests this added feature before implementing it in the library proper :-)

gante · 2023-06-28T14:40:35Z

Hey @Vermeille 👋

I have the impression that our MusicGen PR (still open, expected to get merged soon) introduces the bulk of the logic to make it happen -- see this file

It is the same thing with a slightly different code implementation, correct? In the MusicGen PR, the model does a forward pass with 2x the batch size, where half of the batch corresponds to the unprompted tokens

Vermeille · 2023-06-28T15:42:37Z

Indeed @gante !

I don't fully get how the 2x batch size thing works, but if it does, it's cool.
The paper makes some more additions to that base implementation:

the uncond_logits might in fact have a different prompt than the cond_logits, which is commonly called "negative prompt".
the comment says "usually at the expense of poorer quality". This can be mitigated with linearly interpolating the cfg scores back with with the initial scores
We had better results log_softmaxing both scores before cfg, which normalizes both logits sets to a common "scale".

gante · 2023-06-29T10:37:47Z

cc @sanchit-gandhi, who's probably better equipped to comment on potential differences :)

sanchit-gandhi · 2023-06-30T15:03:51Z

Hey @Vermeille - thanks for the comprehensive write-up! Just a clarifying question: in your implementation, how do you construct the token ids for the model based on the conditional ids and the un-conditional ones? You mention:

inputs_cfg usually is the last token of the prompt but there are

Which suggests you concatenate them together in the same batch item?

In MusicGen (and also the HF Diffusers library for models like Stable Diffusion), we construct our input ids by concatenating the input ids for the conditional prompt and the un-conditional prompt along the batch dimension (dim=0):

input_ids = torch.concatenate([conditional_ids, unconditional_ids], dim=0)

This is what's referred to by the 2x batch size 'trick' (concatenating the conditional prompt and unconditional prompt over the batch dim). There's no restriction to how these unconditional ids are formed - they can be from a 'null' input, or from a negative prompt. So we can do negative prompting in exactly the way you've described.

When we run our model forward, the logits for the first half of the batch corresponds to the conditional prompt, and the second half to the unconditional prompt (or negative prompt if we use one).

By splitting along the batch dim, we can partition the conditional logits and the unconditional ones:

conditional_logits, unconditional_logits = torch.split(logits, batch_size // 2)

-> we then perform our weighted sum over the conditional and unconditional logits for CFG.

Hope that explains how the 2x batch size trick works - would be keen to hear whether this aligns with how you've run CFG in your experiments.

Regarding implementing a new logits processor, we'd probably want to add this new logits processor when the time comes for integrating the model you've worked on into transformers, rather than adding it solely as a standalone logits processor. transformers is less of a modular toolbox for building new models, more a library for housing the most popular OS ML models

Have you trained a new model that uses this processor? Or built on-top of an existing one? (if it's the latter, then adding the CFG logits processor standalone makes sense, otherwise let's integrate it all in one go)

Vermeille · 2023-06-30T23:18:06Z

Thank you for your detailed answer @sanchit-gandhi !

The part I'm the most unclear with regarding the 2x batch trick is how the sampling happen. Do you actually sample the same continuation token for the conditional and unconditional branch, or do they diverge in their own direction (which would be weird imho)?

Regarding the integration, there is no need to train models to support CFG, it works out of the box. The paper will be out in few days, but as you can see on the figures, we employed it with LLaMA models, all Pythias, GPT-2 family, and even GPT4All. We don't train a new model. It's meant to be an addition to the .generate() method that is totally model agnostic and don't need training nor finetuning. Hence the PR with the standalone logits processor :)

Vermeille · 2023-07-03T10:19:49Z

The paper is out

sanchit-gandhi · 2023-07-03T15:37:01Z

Maybe this helps!

Pre-processing:

conditional text -> conditional_ids (bsz)
negative text -> unconditional_ids (bsz)
input_ids = [conditional_ids, unconditional_ids] (2 * bsz since we've done a concat)

Forward pass:

logits (2 * bsz since they come from the input_ids)

CFG:

conditional_logits, unconditional_logits = logits[:bsz], logits[bsz:] (so each one is bsz since we've done a split)
scores = weighted_sum(conditional_logits, unconditional_logits; guidance_scale) (bsz)

Sampling:

next token = sample(scores) (bsz num tokens -> we combined the cond/uncond logits to get the scores, so we only have bsz scores, and thus bsz num tokens)

How have you been getting the conditional and unconditional logits in your experiments? Through two forward passes? (one with the conditional inputs and then a second with the unconditional ones). This batch size concatenation trick means you only have to run one forward pass, but with 2x the batch size

The only pain point I see with getting this work in transformers is this batch size change as we go from our forward pass to our sampling loop. But we can add some logic to change the batch size on the fly if we're doing CFG (kind of like we did for MusicGen @gante - we need to trick the forward pass into using 2 * bsz, then the decoder ids to use bsz).

here is no need to train models to support CFG, it works out of the box

Very cool indeed! Would be nice to have this as a standalone PR then as suggested

Vermeille · 2023-07-03T15:48:42Z

Thank you!
Yeah if the cond and uncond prompts gets the same next token sampled, it's good wrt to our experiments! That's how you manage to loop around in the .generate() to grow the continuation token per token and zigzaging between bsz and 2bsz that I'm not 100% clear with. I totally see how it works for one forward pass. Totally an implementation detail :) But apparently that's a new trick you had to implement for MusicGen too so it makes sense that I'm not perfectly clear with that.

Would be nice to have this as a standalone PR then as suggested

I'm happy to address the changes that have to be made to contribute this into the lib :)

sanchit-gandhi · 2023-07-03T15:50:49Z

Awesome - feel free to open a PR and tag myself and @gante! How do you do it without the 2x batch size trick? Do you do two forward passes? Just asking in case there's a simpler way we can integrate this!

gante · 2023-07-03T17:34:59Z

(catching up on the paper and thinking a bit about usage experience -- will comment tomorrow with specific suggestions, but I think @Vermeille's suggested implementation above will be pretty close to a great user experience with minimal compute overhead)

alex2awesome · 2023-07-03T17:53:24Z

here is an alternative implementation we used for some of our other experiments in the paper, for your consideration.

it was designed with huggingface's typical *ModelFor* code-style in mind, which just puts the base model in the init and extends the forward() method
https://github.com/Vermeille/lm-evaluation-harness-cfg/blob/cfg-alex/log_logits_on_p3.py#L30-L97

Vermeille · 2023-07-03T21:21:01Z

Awesome - feel free to open a PR and tag myself and @gante! How do you do it without the 2x batch size trick? Do you do two forward passes? Just asking in case there's a simpler way we can integrate this!

Yes. Two consecutive passes. Which is indeed not that great wrt latency.

elikoga · 2023-07-03T21:34:42Z

Would be great to have both the 2x batch size and two forward passes. Since 2x batch size is better for throughput but the two forward passes are much better for VRAM usage, as the Paper outlines

(unless I missunderstood)

Vermeille · 2023-07-03T23:00:37Z

So given you already have this ( https://github.com/huggingface/transformers/blob/main/src/transformers/generation/logits_process.py#L1070 )

What do you want me to add / change in the PR?

StellaAthena · 2023-07-03T23:57:14Z

Would be great to have both the 2x batch size and two forward passes. Since 2x batch size is better for throughput but the two forward passes are much better for VRAM usage, as the Paper outlines

(unless I missunderstood)

This is correct: our focus was on getting the best results for a fixed amount of VRAM in our experiments. Hence it didn't occur to us to simply 2x the batch size. I agree that having this be togglable is a good idea and don't have any preference about the default.

drdaxxy · 2023-07-04T03:11:16Z

The application to LLMs seems more of a situational sampling technique. With smaller conditional generative models like MusicGen, trained from-scratch with (explicit) condition dropout, it's practically part of the model. MusicGen isn't the first AR Transformer here, last year's DALL-E Mega already did it (itself inspired by https://twitter.com/RiversHaveWings/status/1478093658716966912 ), and in these models it's essential for performance.

So I'd expect "batch size 1 dramatically underutilizes available resources" to be the more common case.

Since 2x batch size is better for throughput but the two forward passes are much better for VRAM usage, as the Paper outlines

Depending on model and hardware, "biggest batch size that fits" isn't necessarily optimal. On decent hardware, you can hit optimal compute utilisation before VRAM limits with batched inference in smaller models.

Normalizing the summands, then interpolating with the original scores is intriguing. If adding this to the CFG implementation that's now in Transformers is still being considered, this would be unexpected as default behavior though. In diffusion models, it's not applicable, and in sequence prediction, I've only seen people combine the unnormalized scores.

Vermeille · 2023-07-04T08:22:58Z

@drdaxxy

Normalizing the summands, then interpolating with the original scores is intriguing. [...] In diffusion models, it's not applicable

This is a technique we borrowed from Common Diffusion Noise Schedules and Sample Steps are Flawed they call CFG Rescale. You can see Imagen doing some normalizing trick too.

in sequence prediction, I've only seen people combine the unnormalized scores.

That's what we started with, and our results were a little bit worse.

gante · 2023-07-04T09:33:32Z

This method is interesting to implement from an engineering and maintenance point of view!

The simplest approach would be to proceed as @Vermeille suggested: add a logits processor that calls a model forward pass for the unconditional part of the input. It would be a small self-contained piece of code, which means low long-term maintenance on our end. On the negative side, we have the 2x latency, which is more impactful than the extra VRAM (IMO).

If we go the 2x batch size route, we need to implement a function like greedy_search or sample -- a long function with non-negligible maintenance costs on our end. I believe this would be the best form of CFG sampling. However, we are severely constrained by our ability to keep the machine up and running at a good pace, so we can quickly add new features like CFG sampling :D

We have a plan to reorganize generate such that it is entirely made of small functions, making it much more composable. In the way I'm envisioning it, the 2x batch size version of CFG sampling would need a few extra lines of code, as opposed to a new large function.

How about we go with @Vermeille's proposal now, which will make CFG sampling available this week with low overhead on our end, and we implement the 2x batch size version after the generate refactor is complete? The new logits processor class would need a different name, as we already have ClassifierFreeGuidanceLogitsProcessor for the 2x batch size case (perhaps UnbatchedClassifierFreeGuidanceLogitsProcessor?)

Vermeille · 2023-07-04T09:54:04Z

Expect a PR in few hours.

Thank you for your interest and answers!

Vermeille · 2023-07-04T12:28:10Z

@gante There is a name clash for the arguments to .generate(). For this PR, unless instructed otherwise before I submit it, cfg_scale (mine) will live next to guidance_scale (MusicGen's). Idk how to resolve this competition, give that .generate() does not seem ready to use the 2x batch trick yet.

gante · 2023-07-04T13:03:50Z

@Vermeille Adding more (and partially redundant) parameterization is highly undesirable, and we'd want to favor the more general case (yours). You also have the additional requirement of renormalizing the logits before applying your logits processor. Fortunately, we haven't officially released a transformers version with MusicGen, so we still have some wiggle room!

Let's try to fit everything together -- here's my suggestion:

your logits processor uses the same parameter, guidance_scale, and it's triggered by its presence
EDIT: this is not needed ~~your logits processor is added after the normalization one (after this if), and the normalization step is now also triggered when guidance_scale is non-None~~
ClassifierFreeGuidanceLogitsProcessor (MusicGen's) is removed from the function that prepares the logits processors, and we modify MusicGen's generation function to handle its special processor: if guidance_scale is present when we generate with MusicGen, we pop it and manually add its CFG processor. I can take care of this part if you don't feel comfortable touching MusicGen :)

This way the two strategies can coexist, share the argument, and not clash 🤗

Vermeille · 2023-07-04T13:09:06Z

Great! Thank you for the walkthrough.

On it.

Vermeille · 2023-07-04T13:14:11Z

Wait @gante, integrating it after the LogitNormalization is not something we want: all the prior processing (temperature, top_p, etc), will be used only on the conditional branch and not the unconditional, and will be executed before computing the CFG logits. To be fair, we haven't tested this transformation order, but being asymmetrical like this scares me.

And this is is even invalid. Top-k/p may not even select the same tokens in both branches, so that will misbehave.

I'm afraid I can't do that. CFG has to happen as one of the first logitprocessor

gante · 2023-07-04T13:22:27Z

@Vermeille looking at your code example above, I didn't notice it already had normalization inside the processor. My bad -- feel free to add it as the 1st one :)

(will edit my comment above accordingly, for clarity)

grantCelley · 2023-07-05T04:41:21Z

So this is the code I got to get it working. It is just a hack but if you want to playwith it just use this code

from transformers import LogitsWarper
import torch
from torch.nn import functional as F

device = 'cpu'
if torch.has_cuda:
    device = 'cuda'

class CFGLogits(LogitsWarper):

    def __init__(self, cfg, inputs, model, verbose=True):
        self.cfg = cfg
        self.inputs = inputs
        self.model = model
        self.out = None
        self.verbose = verbose

    def __call__(self, input_ids, scores):
        if self.cfg == 1:
            return F.log_softmax(scores, dim=-1)
        scores = F.log_softmax(scores, dim=-1)
        if self.out is None:
            self.out = self.model(self.inputs.to(device), use_cache=True)
        else:
            self.out = self.model(input_ids[:, -1:],
                                  use_cache=True,
                                  past_key_values=self.out.past_key_values)
        unconditional_logits = F.log_softmax(self.out.logits[0][-1:], dim=-1)
        out = self.cfg * (scores - unconditional_logits) + unconditional_logits
        out = F.log_softmax(out, dim=-1)
        return 0.7 * out + 0.3 * scores
    
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import LogitsProcessorList, TemperatureLogitsWarper, TopPLogitsWarper

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-160m")

model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-160m")

prompt = "Salve, dispiculi."
inputs = tokenizer(prompt, return_tensors='pt')
model.to(device)
outputs = model.generate(
    input_ids=inputs['input_ids'].to(device),
    attention_mask=inputs['attention_mask'].to(device),
    max_new_tokens=125,
    logits_processor=LogitsProcessorList([
        # inputs_cfg usually is the last token of the prompt but there are
        # possibilities of negative prompting that are explored in the paper
        CFGLogits(3, inputs['input_ids'], model),
        TemperatureLogitsWarper(0.8),
        TopPLogitsWarper(0.95),
    ]),
    do_sample=True,
)

print(tokenizer.decode(outputs[0]))

This worked on my end

Vermeille · 2023-07-06T01:01:56Z

@grantCelley Pythia models are trained on English. I'm really confused by what you're trying to achieve there.

grantCelley · 2023-07-06T02:06:03Z

I was just trying to get it to work. Also it does continue in latin for a little which is interesting then goes into a romance language. But it just showed how to do it. I didn't realize that you updated the original codeblock.

chris-aeviator · 2023-07-06T10:01:35Z

@Vermeille

This output is typical of a guidance strength that's too high. You can either reduce it or reduce the rescale_factor. Try cfg 1.5 and if it's still not there, 1.25. Then you can try ramping the guidance strength up while reducing the rescale_factor.

Ok this helped, generation for the same amount of tokens takes longer now, is this expected?

Vanilla / no CFG, 512 token / 3 min

Fungus-growing ants (Myrmecocystus ants) farm their own food, feed larvae and queen with it, and maintain the garden through vigorous harvesting and composting practices. Like farmers who grow vegetables, they plant crops, harvest, store, and distribute the fruits of the harvest to themselves and their dependents.

CFG, neg_token = last token, cfg_scale=1.5, 512 token / 5 min

Leaf-cutting ants (or “antlion ants”) are found in tropical regions around the world. They farm fungi to eat as adults and feed their larvae. Fungi provide food not only for adult ants but also for the gardens that they maintain across vast distances in the form of fungus farms. These farms can contain tens of thousands of acres of fungal colonies. The largest known leaf-cutting ant fungus farm has over 20,000 colonies with a total area of nearly 2 million square meters (2 hectares).

CFG, neg_token = last token, cfg_scale=1.25, 512 token / 5 min

Leaf-cutting ants (or “antlion ants”) are found in tropical regions around the world, where they farm fungus to feed their young. Fungus farming has been observed in several ant species, including the Acromyrmex octospinosus ant, endemic to South and Central America and the southern United States. Farmers remove leaves from native plants and chew them into small pieces, which they place directly onto the soil around newly established colonies. The leaf fragments provide food for the larvae inside the nest as well as for the colony’s queen.

chris-aeviator · 2023-07-06T10:08:24Z

@grantCelley shouldnt a negative prompt of 'Latin' prohibit latin output? Do I misunderstand the concept of negative prompts?

Vermeille · 2023-07-06T10:19:22Z

@chris-aeviator

Ok this helped, generation for the same amount of tokens takes longer now, is this expected?

Yes, there are two forward passes per token now.

@grantCelley shouldnt a negative prompt of 'Latin' prohibit latin output? Do I misunderstand the concept of negative prompts?

You are correct

FartyPants · 2023-07-06T22:05:19Z

@grantCelley shouldnt a negative prompt of 'Latin' prohibit latin output? Do I misunderstand the concept of negative prompts?

It is hard to say what negative prompt does in certain terms. I had it generate a poem and specified negative prompt as happy and it used somehow gloomy language and vice versa - so it "does" work, but beyond that I think only further experimentation will tell.
It does affect the output, but not too dramatically.
In the paper they put negative prompt the system prompt the model was trained with... not sure about the reasoning for that.

Vermeille · 2023-07-06T23:29:01Z

It is hard to say what negative prompt does in certain terms. I had it generate a poem and specified negative prompt as happy and it used somehow gloomy language and vice versa - so it "does" work, but beyond that I think only further experimentation will tell.

Yes. Neg prompts in language are somewhat harder to pull off than in vision. Especially because the continuation should be kinda grammatical with the neg prompt too. Not gonna lie, we were under time constraints and having a clear neg prompt methodology was unsolved in that time frame. But we're working on it, and the example in the first post works.

It does affect the output, but not too dramatically.

Hard to say yet, but it should depend on the guidance strength (decrease the rescale_factor as you increase the guidance strength)

In the paper they put negative prompt the system prompt the model was trained with... not sure about the reasoning for that.

from the paper:

We set the negative prompt to be the default system-prompt for the models we use [...] This approach not only makes the sampling more prompt-aware in general, but directly emphasizes the difference between our system-prompt and the model’s default system-prompt.

FartyPants · 2023-07-07T07:23:01Z

Thanks for explanation.
I can't wait to see more about this. It is really fascinating trying to see how the code works - it is quite something!
A little question for my own curiosity. (referring to the code on top of the page)
In case I'm using it with a full negative prompt, wouldn't be better in the very first round (when self.out==None) to do just
out = self.guidance_scale * (scores - unconditional_logits)
(not to add the + unconditional_logits - which are logits of the negative prompt in the first round? Then of course continue normally as t is in next rounds.
Or it doesn't really matter? Or I don't understand the nuance of the code (which I of course don't)?

cyberfox · 2023-07-07T16:24:21Z

In case I'm using it with a full negative prompt, wouldn't be better in the very first round (when self.out==None) to do just out = self.guidance_scale * (scores - unconditional_logits) (not to add the + unconditional_logits - which are logits of the negative prompt in the first round? Then of course continue normally as t is in next rounds. Or it doesn't really matter? Or I don't understand the nuance of the code (which I of course don't)?

Because the key_values are passed back in each time, the negative side of the prompting (the second self.out) will always contain the prediction based on the entire negative context so far. Even if you did it from the second token generation, you'd still be adding in the negative prompt's effect on the generated 'unconditional' logits...

This implies to me that the two things should be separate concepts, with separate implementations...but if you (reasonably) wanted to use both focus-on-first-prompt, and negative prompting, it would be compute expensive to do them separately.

That said, I do feel a little like the 'adding them back in' is a fudge-factor, trying to reduce the effect slightly. But I don't understand the math symbology in the original paper very well, so I'm very cautious about that.

github-actions · 2023-08-05T15:02:38Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

gante · 2023-08-06T19:15:54Z

It is merged, feel free to install from main and play with it :)

sersoage · 2023-09-04T04:43:29Z

Can you provide sample code on how to use classifier free guidance?

sanchit-gandhi · 2023-09-04T17:45:49Z

Here are the docs @sersoage - you can enable CFG by passing guidance_scale>1 to the call to .generate, and then by specifying negative_prompt_ids: https://huggingface.co/docs/transformers/internal/generation_utils#transformers.UnbatchedClassifierFreeGuidanceLogitsProcessor.example

sersoage · 2023-09-04T17:46:55Z

thanks!!!

sakrat-az · 2024-07-02T09:23:43Z

@Vermeille Thanks for the code!
I was also working on CFG for a while.
How does your code work in cases where there is no negative prompt.
For example:
"Today in France, I had fun". Let's say I need to focus only on the phrase "Today in France", then how to set the variables prompt and neg_prompt?
Thanks!

Vermeille · 2024-07-02T09:27:33Z

@sakrat-az WE set the last token of the prompt as the negative prompt

sakrat-az · 2024-07-02T09:32:59Z

@sakrat-az WE set the last token of the prompt as the negative prompt
What if I don't want any negative prompt? @Vermeille

Vermeille · 2024-07-02T09:52:02Z

That doesn't exist. You necessarily need something negative or unconditional in the equation. Last token of the prompt is the closest way to emulate unconditional I've found.

sakrat-az · 2024-07-02T10:00:30Z

That doesn't exist. You necessarily need something negative or unconditional in the equation. Last token of the prompt is the closest way to emulate unconditional I've found.

def call(self, input_ids, scores):
scores = F.log_softmax(scores, dim=-1)
if self.guidance_scale == 1:
return scores
# print("hello\n\n", tokenizer.decode(self.uncond[0]))
if self.out is None:
self.out = self.model(self.uncond, use_cache=True)
else:
# print("check")
self.out = self.model(
input_ids[:, -1:],
use_cache=True,
past_key_values=self.out.past_key_values,
)
conditional_logits = F.log_softmax(self.out.logits[0][-1:], dim=-1)
out = self.guidance_scale * (conditional_logits - scores) + scores
return out

prompt = tokenizer("Today a dragon flew over Paris,", return_tensors='pt')
neg_prompt = tokenizer("Today a dragon flew over Paris, France,", return_tensors='pt')['input_ids']

Once can you check my code, so I wanted to focus on the word "France". @Vermeille

Vermeille · 2024-07-02T10:48:57Z

Then you did the opposite. France has to be only in the positive prompt if you want to focus on it. Here you try to sample away from France.

sakrat-az · 2024-07-02T11:27:00Z

@Vermeille, but I have also changed the code for the call() function. Did you check that too?

# Description Implement classifier-free guidance function based on vLLM. The author of this paper implements this function in huggingface-transformers: huggingface/transformers#24536. The pseudo-code ``` conditional_logits = log_softmax(model(positive_prompt)) unconditional_logits = log_softmax(model(negative_prompt)) logits = unconditional_logits + cfg_scale * (conditional_logits - unconditional_logits) next_token = do_sample(logits) positive_prompt.append(next_token) negative_prompt.append(next_token) ``` usage in FlagScale can reference `tests/unit_tests/test_classifier_free_guidance.py`

Jourdelune · 2024-12-03T16:09:07Z

hey, how do you convert log(softmax(x) to prob? If we apply exp() to logit, often we get inf value

Jourdelune · 2024-12-03T16:34:28Z

okay, I have my answer:

transformers/src/transformers/generation/utils.py

Line 3296 in 125de41

probs = nn.functional.softmax(next_token_scores, dim=-1)

lukestanley mentioned this issue Jul 3, 2023

llama : add support for Classifier-Free Guidance (CFG) sampling to stay on topic better ggml-org/llama.cpp#2083

Closed

wronkiew mentioned this issue Jul 3, 2023

[Feature Request] Classifier-Free Guidance sampling mlc-ai/mlc-llm#499

Open

Vermeille mentioned this issue Jul 3, 2023

Classifier-Free Guidance turboderp/exllama#129

Closed

Vermeille mentioned this issue Jul 5, 2023

add CFG for .generate() #24654

Merged

stevelaskaridis mentioned this issue Jul 11, 2023

Integration of Classifier-Free Guidance with HF models brave-experiments/simba-evaluation-harness#6

Open

gante closed this as completed in #24654 Aug 6, 2023

airaria mentioned this issue Aug 7, 2023

Add CFG sampling ymcui/Chinese-LLaMA-Alpaca-2#91

Merged

hammer mentioned this issue Aug 20, 2023

Python client for AI interactions refstudio/refstudio#58

Closed

pseudotensor mentioned this issue Aug 28, 2023

long context h2oai/h2ogpt#360

Open

zhaoyinglia mentioned this issue Sep 27, 2024

[NewFeature] add classifier free guidance base on vllm FlagOpen/FlagScale#225

Merged

Add Classifier-Free Guidance sampling #24536

Add Classifier-Free Guidance sampling #24536

Comments

Vermeille commented Jun 28, 2023 • edited Loading

Feature request

Motivation

Your contribution

sgugger commented Jun 28, 2023

gante commented Jun 28, 2023

Vermeille commented Jun 28, 2023 • edited Loading

gante commented Jun 29, 2023

sanchit-gandhi commented Jun 30, 2023 • edited Loading

Vermeille commented Jun 30, 2023

Vermeille commented Jul 3, 2023

sanchit-gandhi commented Jul 3, 2023 • edited Loading

Vermeille commented Jul 3, 2023 • edited Loading

sanchit-gandhi commented Jul 3, 2023

gante commented Jul 3, 2023 • edited Loading

alex2awesome commented Jul 3, 2023 • edited Loading

Vermeille commented Jul 3, 2023

elikoga commented Jul 3, 2023

Vermeille commented Jul 3, 2023

StellaAthena commented Jul 3, 2023

drdaxxy commented Jul 4, 2023

Vermeille commented Jul 4, 2023

gante commented Jul 4, 2023

Vermeille commented Jul 4, 2023

Vermeille commented Jul 4, 2023 • edited Loading

gante commented Jul 4, 2023 • edited Loading

Vermeille commented Jul 4, 2023

Vermeille commented Jul 4, 2023 • edited Loading

gante commented Jul 4, 2023

grantCelley commented Jul 5, 2023

Vermeille commented Jul 6, 2023

grantCelley commented Jul 6, 2023

chris-aeviator commented Jul 6, 2023

Vanilla / no CFG, 512 token / 3 min

CFG, neg_token = last token, cfg_scale=1.5, 512 token / 5 min

CFG, neg_token = last token, cfg_scale=1.25, 512 token / 5 min

chris-aeviator commented Jul 6, 2023 • edited Loading

Vermeille commented Jul 6, 2023 • edited Loading

FartyPants commented Jul 6, 2023

Vermeille commented Jul 6, 2023 • edited Loading

FartyPants commented Jul 7, 2023 • edited Loading

cyberfox commented Jul 7, 2023 • edited Loading

github-actions bot commented Aug 5, 2023

gante commented Aug 6, 2023

sersoage commented Sep 4, 2023

sanchit-gandhi commented Sep 4, 2023

sersoage commented Sep 4, 2023

sakrat-az commented Jul 2, 2024

Vermeille commented Jul 2, 2024

sakrat-az commented Jul 2, 2024

Vermeille commented Jul 2, 2024

sakrat-az commented Jul 2, 2024 • edited Loading

Vermeille commented Jul 2, 2024

sakrat-az commented Jul 2, 2024

Jourdelune commented Dec 3, 2024

Jourdelune commented Dec 3, 2024

Vermeille commented Jun 28, 2023 •

edited

Loading

Vermeille commented Jun 28, 2023 •

edited

Loading

sanchit-gandhi commented Jun 30, 2023 •

edited

Loading

sanchit-gandhi commented Jul 3, 2023 •

edited

Loading

Vermeille commented Jul 3, 2023 •

edited

Loading

gante commented Jul 3, 2023 •

edited

Loading

alex2awesome commented Jul 3, 2023 •

edited

Loading

Vermeille commented Jul 4, 2023 •

edited

Loading

gante commented Jul 4, 2023 •

edited

Loading

Vermeille commented Jul 4, 2023 •

edited

Loading

chris-aeviator commented Jul 6, 2023 •

edited

Loading

Vermeille commented Jul 6, 2023 •

edited

Loading

Vermeille commented Jul 6, 2023 •

edited

Loading

FartyPants commented Jul 7, 2023 •

edited

Loading

cyberfox commented Jul 7, 2023 •

edited

Loading

sakrat-az commented Jul 2, 2024 •

edited

Loading