Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Immiscible Noise algorithm #1395

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from

Conversation

v0xie
Copy link

@v0xie v0xie commented Jun 28, 2024

This PR implements the algorithm in "Immiscible Diffusion: Accelerating Diffusion Training with Noise Assignment" (2024, Li et al.) https://arxiv.org/abs/2406.12303

  • The algorithm modifies the latents before noise is added to project training images onto only nearby noise. This is supposed to speed up convergence time and capture more fine detail in the trained model.

  • There is an noise assignment operation that is supposed to add some overhead to training time, but the paper describes it only adding 22.8ms when training with batch size of 1024.

  • Use by adding argument --immiscible_noise.


2024/06/27 - Outdated results - Expand for more Here are some experimental results trained on the "monster_toy" dataset from the Dreambooth repository (https://github.com/google/dreambooth/blob/main/dataset/monster_toy/00.jpg). Keep in mind the dataset is only 5 images, so by Epoch 30 the model is already starting to be overtrained.
  • Training with Huber loss:
    comparison_huberloss

  • Training with no Huber loss:
    comparison_no_huber

  • The loss/epoch graph looks like the FID/Training Steps graphs from the paper:
    loss

Thank you for your consideration!

@araleza
Copy link

araleza commented Jun 29, 2024

This sounds interesting. I fetched your branch, and ran one of my standard training runs (110 images, mostly high quality/resolution, with decent captions) at these learning rates:

Tenc: 1e-10
Unet: 1e-7
Batch size: 4
Loss: Huber
Format: fp32

Those are very slow learning rates, but the images still became 'wobbly' almost immediately, and even after 1500 iterations, it hadn't recovered:

image

Do other people see something similar?

I made a new build, and there's some new warning popping up:
sd-scripts-immi/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py:456: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)
return F.conv2d(input, weight, bias, self.stride,

Seems to be a torchvision vs. torch vs. xformers version issue. I don't think that warning had any effect though, so I doubt the wobbly renders are caused by that.

Edit: Re-ran the same training run without --immiscible_noise and the images were sharp again, so the low quality images I saw are associated with --immiscible_noise, and not that cudnn warning.

@feffy380
Copy link
Contributor

@v0xie Your loss graph says these were trained with batch size 1, so there's nothing to assign. The fact that it's still affecting the loss tells me something is wrong with the implementation.

@feffy380
Copy link
Contributor

feffy380 commented Jun 30, 2024

The immiscible noise is supposed to replace the original random noise, but the code is adding both to the latents.
The result is that the returned noisy_latents has twice as much noise as intended.

Based on the paper, we only need to:

  1. Generate a batch of noise, preferably n >> 1
    • They show you get better matching noise with larger batch sizes. We can't always use the latent batch size because for most users that's quite small, so you will never get a good match. At the very least it must be more than 1 to be able to perform matching at all.
  2. Find similar noise-latent pairs
  3. Use this noise to replace noise = torch.randn_like(latents).

Something like this (I don't know if my distance calculation is efficient but it does work in fp16):

 def get_noise_noisy_latents_and_timesteps(args, noise_scheduler, latents):
     # Sample noise that we'll add to the latents
-    noise = torch.randn_like(latents, device=latents.device)
+    if args.immiscible_diffusion:
+        # Immiscible Diffusion https://arxiv.org/abs/2406.12303
+        from scipy.optimize import linear_sum_assignment
+        n = args.immiscible_diffusion # arg is an integer for how many noise tensors to generate
+        size = [n] + list(latents.shape[1:])
+        noise = torch.randn(size, dtype=latents.dtype, layout=latents.layout, device=latents.device)
+        # find similar latent-noise pairs
+        latents_expanded = latents.half().unsqueeze(1).expand(-1, n, *latents.shape[1:])
+        noise_expanded = noise.half().unsqueeze(0).expand(latents.shape[0], *noise.shape)
+        dist = (latents_expanded - noise_expanded)**2
+        dist = dist.mean(list(range(2, dist.dim()))).cpu()
+        noise = noise[linear_sum_assignment(dist)[1]]
+    else:
+        noise = torch.randn_like(latents, device=latents.device)

@araleza
Copy link

araleza commented Jun 30, 2024

Hey @feffy380, my first impression is that your code seems to be working. I set n = 32 for my first run with it (cause I hadn't read the bit in the paper where they recommend 1024 at that point), and I think I saw quality improvements even at that low level. I'm restarting a new run with n = 1024 now. Maybe make the default just be 1024, so people don't need to know what value to pass in?

One thing I noticed is that even though my training images are all real-world images, the sample renders continue to show cartoon-styled images longer than usual. I saw one even at iteration 550. I don't think that's an issue, it looks like it'll learn to stop doing that, but I found it interesting to note. (I stopped at iteration 650, so I don't know if I'd have gotten any more cartoon-style samples)

@v0xie
Copy link
Author

v0xie commented Jun 30, 2024

Thank you for testing @araleza, and thank you for the detailed review @feffy380!

I incorporated the suggested changes and I'm running some tests now.

  • --immiscible_noise is now an integer argument which represents the size of batch of random noise to generate.
    • Ex: --immiscible_noise=1024 for a batch size of 1024.

@v0xie
Copy link
Author

v0xie commented Jun 30, 2024

I've been training with ip_noise_gamma=0.1 this whole time. Ran some tests without it to see what that's like.

trainingimages_huber_no_ip_batch1024

**Dropdown for more images**

comparison_no_huber_ip_noise_batch1024
trainingimages_huberon_batch1024

trainingimages_no_huber_no_ip_batch1024

@araleza
Copy link

araleza commented Jun 30, 2024

My test run with noise batch size 1024 has reached 11000 iterations now with feffy380's code (I haven't tried the new updated version from v0xie yet), and it's looking good.

My sample images look different in quality (better lighting, and fewer facial distortions on the difficult training images) to how they usually look without the immiscible noise parameter set. I'd like to try more training runs at different learning rates to be more confident, but as far as I can tell, this is a positive change.

@araleza
Copy link

araleza commented Jul 2, 2024

Hi, so I grabbed the latest code in your branch again, @v0xie . I'm still seeing lots of very noisy, damaged images. When I look at the code, it seems there are two sections, the part that feffy380 wrote, and a second section that looks like this:

def immiscible_diffusion(args, noise_scheduler, latents, noise, timesteps):
    # "Immiscible Diffusion: Accelerating Diffusion Training with Noise Assignment" (2024) Li et al. arxiv.org/abs/2406.12303
    batch_size, _, _, _= latents.shape
    alpha_t = noise_scheduler.alphas.to(timesteps.device)
    alpha_t = alpha_t[timesteps]
    alpha_t = alpha_t.view(batch_size, 1, 1, 1)
    sqrt_alpha_t = torch.sqrt(alpha_t)
    sqrt_one_minus_alpha_t = torch.sqrt(1 - alpha_t)
    x_t_b = sqrt_alpha_t * latents + sqrt_one_minus_alpha_t * noise
    return x_t_b

[...]

    if args.immiscible_noise:
        latents = immiscible_diffusion(args, noise_scheduler, latents, noise, timesteps)

If I comment out the call to immiscible_diffusion() - which still leaves the call to immiscible_diffusion_get_noise() in the code - then the noisy corruption on the images goes away.

Looking at the paper you've linked, I can see why you added that second call. But I think there must be a bug in that implementation. :(

@feffy380: I've now done lots of runs with just that section of code you provided in place. These are the BEST runs of sdxl training that I've done to date. The quality gains are amazing - it's like a new model. And thanks go to @v0xie for finding this great paper.

@feffy380
Copy link
Contributor

feffy380 commented Jul 2, 2024

    if args.immiscible_noise:
        latents = immiscible_diffusion(args, noise_scheduler, latents, noise, timesteps)

If I comment out the call to immiscible_diffusion() - which still leaves the call to immiscible_diffusion_get_noise() in the code - then the noisy corruption on the images goes away.

Like I said before, adding noise to the latents like this is wrong because the noise_scheduler already does that a few lines later. You get noisy results because the latents now have 2x noise, but the unet is only removing 1x noise. The extra noise has effectively become part of the ground truth, which completely corrupts the dataset.

@araleza
Copy link

araleza commented Jul 2, 2024

@feffy380, I think that call is there to try to implement step 3 in this part of the paper:

image

Is there some other way of doing that step that might be correct, and better than just picking the closest noise to the current latent?

@v0xie
Copy link
Author

v0xie commented Jul 2, 2024

You're absolutely correct about the double noise add @feffy380. Removed it and it's much improved. What's funny is that even with the double noise add I was getting pretty good results, which might speak to the effectiveness of this method.

Results after removing the double noise add; also trained a test with immiscible_noise=4096, which didn't add any noticeable delay to training, at least at 512^2.
20240702_trainingimages_huber_no_ip_batch1024

@feffy380
Copy link
Contributor

feffy380 commented Jul 4, 2024

@araleza Step 3 is adding noise to the latents, which is what noise_scheduler.add_noise(latents, noise, timesteps) already does. That's why we assign to the noise variable

@araleza
Copy link

araleza commented Jul 4, 2024

@feffy380, thanks for helping me understand; I don't have a very strong knowledge of pytorch commands. The bit that confuses me still though is that the code that's now been removed has this section:

    sqrt_alpha_t = torch.sqrt(alpha_t)
    sqrt_one_minus_alpha_t = torch.sqrt(1 - alpha_t)
    x_t_b = sqrt_alpha_t * latents + sqrt_one_minus_alpha_t * noise

And that looks exactly like Step 3 in the paper:

image

But the bit we've kept doesn't have anything that looks like that equation. So how come it still works? Does the code section that's still around (i.e. the immiscible_diffusion_get_noise() function) implement that function with the two square roots in some way that isn't so obviously written out explicitly?

Edit: Or maybe those square roots are inside noise_scheduler.add_noise()?

@78752
Copy link

78752 commented Jul 5, 2024

After doing some testing I'm actually getting consistently slightly worse results with the latest iteration of this PR compared to 7b487ce. Certain high frequency details that appeared consistently with the original are lost when reusing the same settings and dataset. Not really sure why.

@kohya-ss kohya-ss added the enhancement New feature or request label Jul 8, 2024
@Clybius
Copy link

Clybius commented Nov 3, 2024

Figure I'll put this out there since there appears to have been an update for immiscible diffusion (v2 on the arxiv?), along with code examples. I've gotten it simplified down for a single-process use-case (I think this works as intended?)

A notable change is the distance calculation, which seems to be rather different. By any means, it worked rather well on a test run, so I felt the need to share.

# https://github.com/yhli123/Immiscible-Diffusion/blob/main/stable_diffusion/conditional_ft_train_sd.py#L941
def immiscible_diffusion_get_noise_v2(latents, n = None):
    """
    Generates noise for immiscible diffusion, simplified for single process.
    """

    with torch.no_grad():
        batch_size = latents.shape[0] if n is None else n
        size = [batch_size] + list(latents.shape[1:])
        noise = torch.randn(size, dtype=latents.dtype, layout=latents.layout, device=latents.device) # [B, C, H, W]

        # Distance calculation
        distance = torch.linalg.vector_norm(
            0.10 * latents.to(torch.float16).flatten(start_dim=1).unsqueeze(1) -
            0.10 * noise.to(torch.float16).flatten(start_dim=1).unsqueeze(0),
            dim=2
        )  # [B, B]

        _, col_ind = linear_sum_assignment(distance.cpu().numpy())
        noise = noise[col_ind].to(latents.device)  # Assign the permuted noise

    return noise

In get_noise_noisy_latents_and_timesteps (or your model-specific noisy latent function,) replace the noise variable with a call to immiscible_diffusion_get_noise_v2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants