-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The BLIP-2 implement difference between this repo and HuggingFace #418
Comments
Can I know your transformers version? |
transformers==4.30.2 |
@NielsRogge Could you provide some suggestions? Thanks! |
Hi, Thanks for reporting. I've re-run the conversion script (to convert the LAVIS checkpoints to the HF format), and I've noticed setting layer_norm_eps=1e-6 (instead of 1e-5 as it is now) results in logits that match up to a tolerance of 1e-4: Running
results in:
So this only passes if I update the @yuanze1024 can you try by updating |
Update; I've also tested it using your image, and the following prompts:
However note that all of this is with the conversion script. I will now double check whether the same can be done with the HF models which are available on the hub. |
And by the way, you can't use
|
Hey @NielsRogge, thank you for your help and forgive me for not replying in time. To be clear in the first place, I'm using the HF ckpt here(https://huggingface.co/Salesforce/blip2-flan-t5-xxl/tree/main) and LAVIS ckpt https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_flant5xxl.pth, not convert it on my own. I will do it if necessary... I've tried your suggestions However, things are different when I tried other pictures, such as: What's more, when I use a batch size of 2, and input the screenshot image and the proto image together, the captions come:
lavis:
It can be found that the screenshot image' captions are violated by the proto image. Let alone the proto image. Above all, I don't think the hf model is the same as the lavis one. This phenomenon confuses me a lot. BTW, I am so concerned about the issue of consistency because I want to save as much inference time as possible while maintaining consistent or similar inference results. You know, graphics cards are really expensive. So, another question is: why does it seem like the hf model is approximately three times faster than the lavis model? If the aforementioned differences are resolved, will hf still be faster? |
Hi, I've also checked sampling (providing Also note that LAVIS uses different dtypes for the various building blocks of BLIP-2 (it autocasts the vision encoder to |
@NielsRogge Can I ask you how to set the random seed? import torch
torch.manual_seed(0)
import random
random.seed(0)
import numpy as np
np.random.seed(0) Is that OK? |
I'm using:
The script I'm using is here: https://github.com/NielsRogge/transformers/blob/improve_blip2/src/transformers/models/blip_2/convert_blip_2_original_to_pytorch.py. So to reproduce you can do:
|
Just bumping this, since it seems to affect correctness of current HF BLIP-2 behavior. @NielsRogge - to clarify: if we're trying to use BLIP-2 from HF (with mixed precision), is the current checkpoint/HF code incorrect (e.g., Furthermore, if LAVIS uses In the meantime, if I want to play it the safest for replicating BLIP-2 results -- what's the recommended workflow for using BLIP-2 checkpoints? Should I just use LAVIS directly (@LiJunnan1992)? |
So to clarify:
Also, the first message in this thread does not test equivalent results, cause the OP uses different generation settings (beam search for LAVIS and greedy decoding for HF) for getting the "object" in the image. Moreover, he uses different dtypes for both implementations, which further explain the differences. TLDR: make sure to compare apples to apples (same generation settings + dtypes) |
Thanks so much @NielsRogge -- this is incredibly helpful. Interesting that T5 supposedly works for both FP16 and BF16... do you know if it's safe to run the vision backbone in BF16 precision (assuming you load in FP32 first, then cast to BF16)? Or is that just a model-specific unknown? |
@siddk per this answer: it looks like it's always fine to go from float32/float16 to bfloat16 (but not the other way around). And bfloat16 seems to be a lot more stable for training compared to float16 |
Hi, thank you for your excellent works. I'm facing a problem using BLIP-2 (only inference) to generate captions and I think you may get clues about it.
Background
I'm tring Cap3D which uses BLIP-2 as a part. Anyway, In thier codes, they are using this LAVIS implement to generate captions for rendered images from a 3D model in a serialized way, which in my opinion is very slow. So I tried to generate captions for a batch of images at the same time.
Problems
For this LAVIS implement, they chose
pretrain_flant5xxl
. And for HuggingFace implement, I choseSalesforce/blip2-flan-t5-xxl
which I think should be similar to the former one. So I guess they are trained in the same way, and should share similar performance.However, I found that this LAVIS implement is about 3x slower than the HuggingFace released model, while LAVIS one can generate captions with better quality.
Aren't they the same?
How to reproduce it
LAVIS:
HuggingFace
LAVIS/lavis/models/blip2_models/blip2_t5.py
Lines 159 to 171 in f3212e7
Results
LAVIS
HuggingFace
We can see that the time_generate1 are quite same, but time_generate2 are different.

the example image is:
Environment
Ubuntu-20.04.4
python 3.8.13
torch 1.13.0a0+d321be6
and using single A100(actually A800) in my experiment.
I also tried
with torch.cuda.amp.autocast(dtype=torch.bfloat16):
before the HF generate() to align the HF and LAVIS, and It is a bit slower(from 39s to 50s) but not slow enough to explain the difference.Please let me know if there is any lack of information for debugging.
It would be appreciated if you help me find out the reason of the difference.
The text was updated successfully, but these errors were encountered: