Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement framewise encoding/decoding in LTX Video VAE #10488

Merged
merged 8 commits into from
Jan 13, 2025

Conversation

rootonchair
Copy link
Contributor

What does this PR do?

Fixes #10333

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@rootonchair
Copy link
Contributor Author

For decode, original:

org_output.mp4
org_output2.mp4
org_output3.mp4
org_output4.mp4

framewise:

output.mp4
output2.mp4
output3.mp4
output4.mp4

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@a-r-r-o-w a-r-r-o-w self-requested a review January 7, 2025 19:35
Copy link
Member

@a-r-r-o-w a-r-r-o-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is awesome @rootonchair! So cool

Just some questions and asks:

  • Did you verify that the expected number of frames are the same with framewise enabled vs disabled?
  • Is there any numerical difference between the tensors with framewise enabled vs disabled? A small absmax difference is usually okay/expected due to order of tensor operations changing, but since all matmul operations are on the embedding dimension, it should be very small and not affected by how many frames are being encoded/decoded at once
  • Let's try to make vae encoding work with framewise as well
  • Let's enable framewise encoding/decoding as default, by setting the values of the flags that enable them to True

This will also help reduce the memory requirements for training LTX significantly so really appreciate you looking into this :)

I'm sure it works as expected as the videos look great, so will try and get back to you quickly after a sanity check. Thank you!

@@ -1114,6 +1116,53 @@ def encode(
if not return_dict:
return (posterior,)
return AutoencoderKLOutput(latent_dist=posterior)

def blend_t(self, a: torch.Tensor, b: torch.Tensor, blend_extent: int) -> torch.Tensor:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would move this down a few methods to where blend_h and blend_v are located

)
return b

def _temporal_tiled_decode(self, z: torch.Tensor, temb: Optional[torch.Tensor], return_dict: bool = True) -> Union[DecoderOutput, torch.Tensor]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the debug statements from this method and move it below tiled_decode.

I think we would also need to implement tiled_encode. Happy to help with the changes if needed 🤗

@rootonchair
Copy link
Contributor Author

Hi @a-r-r-o-w , thank you for your queries

Did you verify that the expected number of frames are the same with framewise enabled vs disabled?

Yes, the number of frames remain the same for both framewise enabled and disabled

Is there any numerical difference between the tensors with framewise enabled vs disabled? A small absmax difference is usually okay/expected due to order of tensor operations changing, but since all matmul operations are on the embedding dimension, it should be very small and not affected by how many frames are being encoded/decoded at once

I have just completed the encoding part, will proceed to the sanity check and then enable framewise encoding/decoding as default

@rootonchair
Copy link
Contributor Author

rootonchair commented Jan 8, 2025

for the decoding part, the output does not change at all, for encoding part, there is a small difference in mean about -8.285045623779297e-06 between the output

Below is my testing script

import torch
from diffusers import LTXPipeline
from diffusers.utils import export_to_video

pipe = LTXPipeline.from_pretrained("a-r-r-o-w/LTX-Video-0.9.1-diffusers", torch_dtype=torch.bfloat16)
pipe.to("cuda")

prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
#prompt = "A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting."
#prompt = "The camera pans across a cityscape of tall buildings with a circular building in the center. The camera moves from left to right, showing the tops of the buildings and the circular building in the center. The buildings are various shades of gray and white, and the circular building has a green roof. The camera angle is high, looking down at the city. The lighting is bright, with the sun shining from the upper left, casting shadows from the buildings. The scene is computer-generated imagery."
#prompt = "A clear, turquoise river flows through a rocky canyon, cascading over a small waterfall and forming a pool of water at the bottom.The river is the main focus of the scene, with its clear water reflecting the surrounding trees and rocks. The canyon walls are steep and rocky, with some vegetation growing on them. The trees are mostly pine trees, with their green needles contrasting with the brown and gray rocks. The overall tone of the scene is one of peace and tranquility."
#prompt = "Two police officers in dark blue uniforms and matching hats enter a dimly lit room through a doorway on the left side of the frame. The first officer, with short brown hair and a mustache, steps inside first, followed by his partner, who has a shaved head and a goatee. Both officers have serious expressions and maintain a steady pace as they move deeper into the room. The camera remains stationary, capturing them from a slightly low angle as they enter. The room has exposed brick walls and a corrugated metal ceiling, with a barred window visible in the background. The lighting is low-key, casting shadows on the officers' faces and emphasizing the grim atmosphere. The scene appears to be from a film or television show."
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

pipe.vae.use_framewise_decoding = False

video = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=768,
    height=512,
    num_frames=161,
    decode_timestep=0.03,
    decode_noise_scale=0.025,
    num_inference_steps=50,
    generator=torch.Generator(device="cuda").manual_seed(42),
    output_type='pt',
).frames[0]

pipe.vae.use_framewise_decoding = True
video2 = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=768,
    height=512,
    num_frames=161,
    decode_timestep=0.03,
    decode_noise_scale=0.025,
    num_inference_steps=50,
    generator=torch.Generator(device="cuda").manual_seed(42),
    output_type='pt',
).frames[0]
print(video2.size())
print(f"Diff: {torch.mean(video-video)}")

print("Test encoding")

generator = torch.Generator(device="cuda").manual_seed(42)
dummy_input = torch.rand((1, 3, 161, 512//8, 768//8), device="cuda", dtype=torch.bfloat16, generator=generator)

vae = pipe.vae
vae.use_framewise_encoding = False
posterior = vae.encode(dummy_input).latent_dist
z = posterior.sample(generator=generator)
vae.use_framewise_encoding = True
posterior = vae.encode(dummy_input).latent_dist
z2 = posterior.sample(generator=generator)
print(f"Diff: {torch.mean(z-z2)}")

Copy link
Member

@a-r-r-o-w a-r-r-o-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding support for this @rootonchair! I've verified for all frames values upto 257 that encoding/decoding has negligible difference with the changes here

@a-r-r-o-w a-r-r-o-w requested a review from yiyixuxu January 9, 2025 06:13
@a-r-r-o-w
Copy link
Member

@yiyixuxu Do you want to give this a look too? We're flipping the default value for use_framewise_encoding/decoding here, but should be okay. It is expected to always be defaulted to True when framewise implementation is available

@a-r-r-o-w
Copy link
Member

@rootonchair Could you run make style here for the tests to pass?

@rootonchair
Copy link
Contributor Author

@a-r-r-o-w thank you for reviewing. I have run make style

@yiyixuxu
Copy link
Collaborator

yiyixuxu commented Jan 9, 2025

ohh I don't think we should flip the default
see https://huggingface.co/docs/diffusers/en/conceptual/philosophy#usability-over-performance

other changes look good to me

@a-r-r-o-w
Copy link
Member

But we've been using framewise encoding/decoding by default wherever possible in past VAE integrations @yiyixuxu. I think it would be nice to maintain that consistency, no? For example, both CogVideoX and Hunyuan use this by default

@yiyixuxu
Copy link
Collaborator

yiyixuxu commented Jan 9, 2025

@a-r-r-o-w
we really should not have done that though
let's make sure not to do that moving forward

Copy link
Member

@a-r-r-o-w a-r-r-o-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rootonchair! Just one more thing to do in accordance with YiYi's comment. We can merge after this change

src/diffusers/models/autoencoders/autoencoder_kl_ltx.py Outdated Show resolved Hide resolved
@yiyixuxu yiyixuxu merged commit 794f7e4 into huggingface:main Jan 13, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement framewise encoding/decoding in LTX Video VAE
4 participants