图3为什么去噪步数因子μ越小FID越小呢 #1

donghaotian123 · 2024-03-30T13:24:09Z

您好，我拜读了SCP-Diff的论文，存在几点疑问，希望得到您的解答。

图3为什么步数因子u越小FID越小呢，在我理解，FID越小，表明生成图像质量越真实，因此我认为从T开始去噪FID的结果应该更好，但这与图3是相悖的，我认为是我没理解图3所表示的含义；
论文Sec 3.2为什么提出“从更接近标准高斯分布开始去噪”这个观点不可靠呢，虽然Fig.8似乎确实证明了这个观点，但您的参考文献[17]这篇论文，以及您提出的推理和训练噪声数据分布不一致这个来看，似乎在训练时第T步服从标准高斯分布才是更好的选择，那为什么不从第T步开始去噪呢；
希望能得到您的解答，谢谢！

GasaiYU · 2024-03-30T15:33:11Z

Thanks for your good problems!

The first problem: The fig 3 is different from fig 8. Fig 3 means first add $\mu T$ noise to latent code $x_0$ and get $x_{\mu T}$ Then use our pretrained ControlNet model to denoise $x_{\mu T}$ $\mu T$ steps to get the predicted $\hat{x_0}$. We calculate the FID between $\hat{x_0}$ and $x_0$. The less noise we add to $x_0$ , the lower FID we will get.

The reason we add fig3 is to explore the source of ControlNet's FID. It is not mainly due to the score matching in the diffusion's learning process. The distribution gap between $x_0$ and standard gaussian noise plays a significant role.

The second question: Because we don't want to finetune the weight in SD, so we don't change the noise schedule in training. We only add noise prior in the inference step.

donghaotian123 · 2024-03-31T03:13:14Z

Thank you very much for your answer! It solves my doubts. Indeed, I didn't understand the experimental setup in Figure 3.

GasaiYU · 2024-03-31T05:06:15Z

In Stable Diffusion, we first use a VQGAN encoder to encode an image to a latent code $x_0$. Then we add $\mu T$ steps noise to this latent code $x_0$ to get the result $x_{\mu T}$. Then we use the pretrained denoise model to denoise $x_{\mu T}$ $\mu T$ steps to predict the initial $x_0$. However, there must be some distance between the predicted $\hat{x_0}$ and the ground truth $x_0$ because it is impossible for our pretrained denoise model to predict exactly the same $x_0$. Then we calculate the FID between the predicted $\hat{x_0}$ and the ground truth $x_0$ to measure their difference.

In fig3, we see a clear gap between the result denoising from $x_{\mu T}$ ($x_0$ add $\mu T$ steps). This phenomenon indicates that perhaps denoising from another distribution with dataset's prior may have a better performance than directly denoising from normal distribution.

donghaotian123 · 2024-04-01T03:03:57Z

Hello, sorry to disturb you again. I still have two questions.

What are the components involved in the calculation of the attention map in Figure 5?
In the last paragraph of Section 4.2 about spatial priors, it mentions that "using spatial priors exhibits a broader receptive field." What does "receptive field" specifically refer to here? As we know, in computer vision, the receptive field is often used to refer to the input area that a CNN can see. In semantic segmentation, we hope that the perceptual area is as large as possible, ideally covering the entire shape. However, in generative tasks, it seems that the perceptual area should better match the shape. Therefore, I don't quite understand what this part of the experiment aims to prove.

Hoantrbl · 2024-08-25T07:06:35Z

#1 (comment)

I also have the same problem about the Fig. 3. Absolutely, I think the empical conclusion is the common sense of the distribution gap. I wonder whether we can find out how large the distribution gap between $x_{\mu T}$ and standard gaussian noise? If we can handle it, we may control the guide condition more accurately.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

图3为什么去噪步数因子μ越小FID越小呢 #1

图3为什么去噪步数因子μ越小FID越小呢 #1

donghaotian123 commented Mar 30, 2024

GasaiYU commented Mar 30, 2024

donghaotian123 commented Mar 31, 2024

GasaiYU commented Mar 31, 2024

donghaotian123 commented Apr 1, 2024

Hoantrbl commented Aug 25, 2024 •

edited

Loading

图3为什么去噪步数因子μ越小FID越小呢 #1

图3为什么去噪步数因子μ越小FID越小呢 #1

Comments

donghaotian123 commented Mar 30, 2024

GasaiYU commented Mar 30, 2024

donghaotian123 commented Mar 31, 2024

GasaiYU commented Mar 31, 2024

donghaotian123 commented Apr 1, 2024

Hoantrbl commented Aug 25, 2024 • edited Loading

Hoantrbl commented Aug 25, 2024 •

edited

Loading