Add QKV fusion to the Hunyuan Video transformer #10407

Ednaordinary · 2024-12-29T14:01:54Z

What does this PR do?

This adds QKV fusion to Hunyuan Video. At the moment, this gives minimal/no improvement:

	QKV	No QKV
Time (sec)	522.18	547.21
VRAM (GiB)	4.17	3.88

The biggest improvement is expected in combination with torchao, though that currently errors out due to torchao tensors not having the ability to be concatenated:

NotImplementedError: AffineQuantizedTensor dispatch: attempting to run unimplemented operator/function: func=<OpOverload(op='aten.cat', overload='default')>, types=(<class 'torchao.dtypes.affine_quantized_tensor.AffineQuantizedTensor'>,), arg_types=(<class 'list'>,), kwarg_types={}

BitsAndBytes also errors out (relevant but somewhat dated discussion):

RuntimeError: Only Tensors of floating point and complex dtype can require gradients

There's a slight hack in HunyuanVideoIndividualTokenRefinerBlock, since attention with qkv fusion seems to become tuple(tensor, None) instead of just a tensor

Reproducible script

import torch
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
import imageio as iio
import math
import numpy as np
import io
import time

quant_div = 3
quant_mod = 3
full_dtype = torch.bfloat16
cast_dtype = torch.float8_e4m3fn

torch.manual_seed(42)

def export_to_video_bytes(fps, frames):
    request = iio.core.Request("<bytes>", mode="w", extension=".mp4")
    pyavobject = iio.plugins.pyav.PyAVPlugin(request)
    if isinstance(frames, np.ndarray):
        frames = (np.array(frames) * 255).astype('uint8')
    else:
        frames = np.array(frames)
    new_bytes = pyavobject.write(frames, codec="libx264", fps=fps)
    out_bytes = io.BytesIO(new_bytes)
    return out_bytes

def export_to_video(frames, path, fps):
    video_bytes = export_to_video_bytes(fps, frames)
    video_bytes.seek(0)
    with open(path, "wb") as f:
        f.write(video_bytes.getbuffer())

model_id = "tencent/HunyuanVideo"

print("Loading transformer")
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
    model_id, subfolder="transformer", torch_dtype=torch.bfloat16, revision="refs/pr/18"
)
transformer.fuse_qkv_projections()

pipe = HunyuanVideoPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.float16, revision="refs/pr/18")
pipe.scheduler._shift = 7.0
pipe.vae.enable_tiling()
#pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()

start_time = time.perf_counter()
output = pipe(
    prompt="a cat walks along the sidewalk of a city. The camera follows the cat at knee level. The city has many people and cars moving around, with advertisement billboards in the background",
    #height=544, #544,
    #width=960, #960,
    height = 544,
    width=960,
    num_frames=45,
    num_inference_steps=20,
).frames[0]
export_to_video(output, "output.mp4", fps=15)
print("Time:", round(time.perf_counter() - start_time, 2), "seconds")
print("Max vram:", round(torch.cuda.max_memory_allocated(device="cuda") / 1024 ** 3, 3), "GiB")

Comparison

QKV fusion:

output_qkv.mp4

No fusion:

output.mp4

Results are different but comparable.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@a-r-r-o-w @DN6

a-r-r-o-w

In my experience, QKV fusion does not really help much with eiter time or memory requirements, even with quantization. In fact, there is even slow downs at times depending on the quantization technique applied.

Not sure if it would be beneficial adding but since we do support it for some other things, it makes sense to do so in the interest of consistency. Will ask @yiyixuxu to make the final call

Update transformer_hunyuan_video.py

e559ae8

a-r-r-o-w self-requested a review December 30, 2024 11:38

Merge branch 'main' into hunyuan-qkv

5e4e83d

a-r-r-o-w reviewed Jan 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add QKV fusion to the Hunyuan Video transformer #10407

Add QKV fusion to the Hunyuan Video transformer #10407

Ednaordinary commented Dec 29, 2024

a-r-r-o-w left a comment

Add QKV fusion to the Hunyuan Video transformer #10407

Are you sure you want to change the base?

Add QKV fusion to the Hunyuan Video transformer #10407

Conversation

Ednaordinary commented Dec 29, 2024

What does this PR do?

Before submitting

Who can review?

a-r-r-o-w left a comment

Choose a reason for hiding this comment