Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider pooled_prompt_embeds for SDXL #4341

Closed
pdoane opened this issue Jul 28, 2023 · 16 comments
Closed

Reconsider pooled_prompt_embeds for SDXL #4341

pdoane opened this issue Jul 28, 2023 · 16 comments
Labels
stale Issues that haven't received updates

Comments

@pdoane
Copy link
Contributor

pdoane commented Jul 28, 2023

The current API with Compel uses both the prompt_embeds and pooled_prompt_embeds parameters which is different than other diffusion pipelines. The new Auto pipelines might be interesting to use, but currently I don't see that I can use them because of the API divergence.

If prompt_embeds for SDXL pipeline accepted a tuple, user code would be unchanged. E.g:

prompt_embeds = compel(prompt)
image = pipeline(prompt_embeds=prompt_embeds[0], pooled_prompt_embeds=prompt_embeds[1]).images[0]

becomes:

prompt_embeds = compel(prompt)
image = pipeline(prompt_embeds=prompt_embeds).images[0]
@pdoane
Copy link
Contributor Author

pdoane commented Jul 28, 2023

May also want to consider the tuple approach as well for prompt_2/negative_prompt_2

@pdoane
Copy link
Contributor Author

pdoane commented Jul 31, 2023

Generating embeds for separate prompts wasn't what I was expecting:

compel1 = Compel(
    tokenizer=self.pipe.tokenizer,
    text_encoder=self.pipe.text_encoder,
    returned_embeddings_type=ReturnedEmbeddingsType.PENULTIMATE_HIDDEN_STATES_NON_NORMALIZED,
    requires_pooled=False,
)

compel2 = Compel(
    tokenizer=self.pipe.tokenizer_2,
    text_encoder=self.pipe.text_encoder_2,
    returned_embeddings_type=ReturnedEmbeddingsType.PENULTIMATE_HIDDEN_STATES_NON_NORMALIZED,
    requires_pooled=True,
)

conditioning1 = compel1(prompt1)
conditioning2, pooled = compel2(prompt2)
conditioning = torch.cat((conditioning1, conditioning2), dim=-1)

image = pipeline(prompt_embeds=conditioning, pooled_prompt_embeds=pooled, num_inference_steps=30).images[0]

Which is interesting as prompt_embeds is a combination of data for both prompts. A tuple form of the API could look like this:

prompt = (prompt1, prompt2)
prompt_embeds = (conditioning1, (conditioning2, pooled))  # ([1, 77, 768], ([1, 77, 1280], [1, 2048]))

both of which could be lists for batch mode.

@patrickvonplaten
Copy link
Contributor

I understand the problem! Not a huge fan of allowing tuples of two conceptually different tensors (one is a seq, one is pooled), but on the other hand unified API is super important.

What do you think here @williamberman @yiyixuxu @sayakpaul @pcuenca ?

Another option could be to just not pass the pooled embeds and create them inside the pipeline if only the prompt embeds are passed. I think we can create pooled embeddings from the prompt embeddings

@sayakpaul
Copy link
Member

Not a huge fan of allowing tuples of two conceptually different tensors (one is a seq, one is pooled), but on the other hand unified API is super important.

Tend to agree with @patrickvonplaten here. I think if accepted prompt_embeds to be of tuple, it might introduce unconscious conceptual inconsistencies which is also something we don't want.

Another option could be to just not pass the pooled embeds and create them inside the pipeline if only the prompt embeds are passed. I think we can create pooled embeddings from the prompt embeddings

I think so too. We need to very careful about this part:

The pooled_prompt_embeds are always computed from the LAST hidden states of the text encoder.

@patrickvonplaten
Copy link
Contributor

Yes that means we push it trough the last layer and then pool - might be problematic though if the pooled embeds are scaled before-hand

@sayakpaul
Copy link
Member

Hmm then it becomes risky.

@pdoane
Copy link
Contributor Author

pdoane commented Aug 17, 2023

What about formalizing an interface for prompt conversion? The public API doesn't currently work with prompt embeds and cross attention scale:

prompt_embeds = compel(...)  # will use a scale of 1.0
pipe(prompt_embeds, cross_attention_kwargs={"scale": 0.5})
prompt_embeds = compel(...) # will use a scale of 0.5
pipe(prompt_embeds, cross_attention_kwargs={"scale": 1.0})

The prompt embeds are generated with the last value set for cross attention scale. There's an easy workaround that bypasses the API:

pipe._lora_scale = 0.5
prompt_embeds = compel(...)  # will use a scale of 0.5
pipe(prompt_embeds)
pipe._lora_scale = 1.0
prompt_embeds = compel(...) # will use a scale of 1.0
pipe(prompt_embeds)

This is fixable of course, but maybe the better long term direction is to assume that prompts will always be provided to the pipeline so that conversions can be handled internally with correct parameters and ordering.

Clip Skip is another area where the current interface is a bit awkward that could be resolved.

For reference, this is what I'm currently doing for Compel with multiple prompts. It seems to work but I haven't had any feedback on it:

compel = Compel(
  tokenizer=pipe.tokenizer,
  text_encoder=pipe.text_encoder,
  returned_embeddings_type=ReturnedEmbeddingsType.PENULTIMATE_HIDDEN_STATES_NON_NORMALIZED,
  requires_pooled=False,
)

compel2 = Compel(
  tokenizer=pipe.tokenizer_2,
  text_encoder=pipe.text_encoder_2,
  returned_embeddings_type=ReturnedEmbeddingsType.PENULTIMATE_HIDDEN_STATES_NON_NORMALIZED,
  requires_pooled=True,
)

prompt1_embeds = compel(prompt)
prompt2_embeds, pooled_prompt_embeds = compel2(prompt2)
prompt_embeds = torch.cat((prompt1_embeds, prompt2_embeds), dim=-1)

@sayakpaul
Copy link
Member

@pdoane would be helpful for us to see this issue in isolation i.e., without compel ideally.

@pdoane
Copy link
Contributor Author

pdoane commented Aug 19, 2023

It isn’t a Compel issue, and there isn’t anything wrong unless you consider how a user would generate prompt embeds. The scale passed in cross_attention_kwargs modifies the text encoder. When diffusers generates prompt embeds, it has access to that information and can do it correctly.

For a user generating prompt embeds, there is no way to set the text encoder LoRA scale outside of generating an image which is obviously too late.

This is why formalizing an interface for prompt conversion could be useful. It can be called at the right point in time and isolate the complexity in a way that enables the universal pipelines to be used with prompt weighting.

@williamberman
Copy link
Contributor

Sorry, I'm a bit lost here. Is the question what's best way to pass pre calculated prompt embeddings to a pipeline if it uses multiple prompts embeddings i.e. pooled and non-pooled? If so what's wrong with adding another keyword argument of the appropriate name?

@pdoane
Copy link
Contributor Author

pdoane commented Aug 24, 2023

@williamberman - The discussion started about custom prompt embeddings (e.g. using Compel) with the Auto pipelines and SDXL.

Most recently, I've suggested formalizing an interface for prompt conversion which I think is the best solution to this, and also addresses another issue with LoRAs (see damian0815/compel#42 - this is a diffusers problem and not a Compel issue).

@williamberman
Copy link
Contributor

So I'm not tremendously familiar with auto pipelines and compel. But is the problem there's no way to pass a particular set of prompt embeddings to an auto pipeline? I think it might be pretty hard to come up with a good interface to handle all combinations of prompt embeddings. And if you're looking for that specificity of use case, maybe an auto pipeline isn't the right solution?

@pdoane
Copy link
Contributor Author

pdoane commented Aug 24, 2023

Let me review where I think we are at with the discussion:

There are 25 pipelines that can be used through the Auto interface which is a good sign of API regularity/consistency. Dedicated features like SDXL dual prompts require the user to know the specific pipeline class but common features work across the set.

For most pipelines, prompt_embeds is the tensor representation of the prompt. SDXL works quite a bit differently. There are two prompt parameters (second one is optional and matches the first if not provided). There are also two prompt embedding parameters: prompt_embeds is a concatenation of the penultimate tensors from both prompts, and pooled_prompt_embeds are the final tensors from the secondary prompt.

Prior to SDXL, there was a natural correspondence between these variables:

prompt_embeds = my_prompt_weighting(prompt)

Whereas SDXL requires this:

conditioning1 = my_prompt_weighting1(prompt1)
conditioning2, pooled = my_prompt_weighting2(prompt2)
conditioning = torch.cat((conditioning1, conditioning2), dim=-1)

My initial suggestion was to support passing everything in a single parameter:

prompt_embeds = (conditioning1, (conditioning2, pooled))  # ([1, 77, 768], ([1, 77, 1280], [1, 2048]))

This would allow existing user code to be unchanged, isolating the complexity to the prompt weighting implementation. However, there was a concern mixing the different types of tensors together.

Another suggestion was to generate the pooled embeds, which is a simpler design but it removes prompt weighting for them which isn't great.

The last suggestion, which is what I recommend at this point, is to introduce a Prompt Weighting object interface to diffusers:

pipe.prompt_weighter = MyPromptWeighter();
pipe(prompt, ...); 

In this approach, both prompt_embeds and pooled_prompt_embeds are no longer needed (could be deprecated too if desired). The pipeline implementation would call the prompt weighter using the prompt as input.

This also fixes another problem with the API. The current design for LoRAs does not work with prompt weighting as information is not available at the right times.

To your specific questions:

But is the problem there's no way to pass a particular set of prompt embeddings to an auto pipeline?

You can pass any parameter that the underlying pipeline accepts. The problem is with the semantics. Users need to do different things to pass prompt embeds based on the pipeline type, which conflicts a bit with the goal of an Auto pipeline.

I think it might be pretty hard to come up with a good interface to handle all combinations of prompt embeddings

Switching to a tuple (or a dictionary) so that prompt_embeds represents all the necessary data seems like a reasonable approach? It's not the direction I'd push for but it covers much more than what is possible today.

The prompt weighting object sidesteps the problem, which is attractive for a number of reasons.

And if you're looking for that specificity of use case, maybe an auto pipeline isn't the right solution?

I'm not currently using auto pipelines (for reasons outlined here), but I don't think prompt weighting is a niche feature and it seems worth exploring alternatives?

At the very least, I would like to see a solution that addresses the problem with LoRAs and prompt weighting, and maybe that could help with this issue too.

@williamberman
Copy link
Contributor

You can pass any parameter that the underlying pipeline accepts. The problem is with the semantics. Users need to do different things to pass prompt embeds based on the pipeline type, which conflicts a bit with the goal of an Auto pipeline.

If you can pass any parameter that the underlying pipeline accepts, I think we don't have to make any changes here? I'm not really sure I see the benefit of using a single interface. I think maybe we see the auto pipeline differently. I see the autopipeline as more this is a nice to have for if you don't want to choose a class to fulfill a particular objective. It doesn't necessarily have to pull all the interfaces together under one interface.

@pdoane
Copy link
Contributor Author

pdoane commented Aug 25, 2023

If you can pass any parameter that the underlying pipeline accepts, I think we don't have to make any changes here?

Diffusers is broken for LoRAs and prompt weighting, so something hopefully will be done there?

With regard to the original topic - you are correct that nothing is fundamentally broken by having pooled_prompt_embeds which is why this is titled "Reconsider...". The choices have made prompt weighting more challenging to use and hopefully a discussion of alternatives is useful?

It doesn't necessarily have to pull all the interfaces together under one interface.

No one on the thread is suggesting this, e.g. SDXL is unique with its cropped coords functionality.

From a theoretical view of it, prompt weighting is an external thing - it's why Compel is mostly working as an independent library. But we're seeing that as pipeline complexity increases and more features are added, it's increasingly clear that the pipeline needs to be more involved in that process.

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

4 participants