Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add InstructBLIP #23460

Merged
merged 7 commits into from
Jun 26, 2023
Merged

Add InstructBLIP #23460

merged 7 commits into from
Jun 26, 2023

Conversation

NielsRogge
Copy link
Contributor

@NielsRogge NielsRogge commented May 19, 2023

What does this PR do?

This PR adds InstructBLIP, a visual instruction tuned version of BLIP-2.

It's a bit like an open-source multimodal GPT-4, leveraging Flan-T5 and Vicuna pre-trained checkpoints.

Basic usage is as follows:

from transformers import InstructBlipProcessor, InstructBlipForConditionalGeneration
import torch
from PIL import Image
import requests

model = InstructBlipForConditionalGeneration.from_pretrained("...")
processor = InstructBlipProcessor.from_pretrained("...")

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

url = "https://raw.githubusercontent.com/salesforce/LAVIS/main/docs/_static/Confusing-Pictures.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
prompt = "What is unusual about this image?"
inputs = processor(images=image, text=prompt, return_tensors="pt")

outputs = model.generate(
        **inputs,
        do_sample=False,
        num_beams=1,
        max_length=256,
        min_length=1,
        top_p=0.9,
        repetition_penalty=1.5,
        length_penalty=1.0,
        temperature=1,
)
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0].strip()
print(generated_text)

To do:

  • discuss whether to integrate the QFormerTokenizer into the processor
  • integration tests
  • figure out the the best way to handle the various dtypes of the vision encoder and language model

Nice to haves:

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 19, 2023

The documentation is not available anymore as the PR was closed or merged.

@Reveyer
Copy link

Reveyer commented May 21, 2023

Thank you for your contribution! I noticed a potential problem with this open PR. It seems that the InstructBLIP processor is missing the QformerTokenizer compared to the BLIP2Processor.

@NielsRogge NielsRogge force-pushed the add_instruct_blip branch from 95c349a to 80065da Compare May 22, 2023 20:07
@NielsRogge NielsRogge requested a review from sgugger May 24, 2023 06:59
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this new model. It's mostly in good shape apart from the floating dtypes in the forward. Like all models in Transformers, this should run in the default precision and users can decide to use another dtype for parts (or all) of their models, but we can't hard-code this.

@NielsRogge
Copy link
Contributor Author

Thanks for your review. Updates:

  • all autocast logic was removed, turns out the implementation returns the same exact logits as the original implementation when also using float32 for the original implementation. However, we may need to think about supporting various dtypes of building blocks of a model, cause if you'd do from_pretrained("...", dtype=torch.float16"), that would break for the Flan-T5 checkpoints, which require bfloat16. It would be nice to provide the possibility to load the vision encoder in float16 and the language model in bfloat16.
  • The InstructBlipProcessor is a bit different than other processors in the sense that it consists of 1 image processor and 2 tokenizers (one for the language model, one for the Q-Former). I've included logic to save the Q-Former tokenizer files in a separate folder on the hub as can be seen here, and had to overwrite the from_pretrained and save_pretrained methods to make this work. I know that this logic may need to be addressed in a separate PR.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating! Changes LGTM apart from the modifications in the push to hub mixin which really need to go in their own PR.

src/transformers/utils/hub.py Outdated Show resolved Hide resolved
@yukw777
Copy link

yukw777 commented May 31, 2023

Will the converted weights be hosted on the model hub like blip-2?

@NielsRogge
Copy link
Contributor Author

NielsRogge commented Jun 5, 2023

All checkpoints are transferred: https://huggingface.co/models?other=instructblip.

Feel free to merge the PR.

The only thing left is uploading fast tokenizer files for the Vicuna-based checkpoints, but that can only be done once #23889 is fixed. Currently the fast tokenizer is created on-the-fly based on the slow tokenizer files when loading from the hub.

Update: that's now also done, so it's entirely ready

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NielsRogge Please do not resolve comments without addressing them or explain why you refuse them.

@sgugger
Copy link
Collaborator

sgugger commented Jun 5, 2023

@amyeroberts Could you have a final look and merge if you are happy?

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this model!

There's still a few small things to address before it's ready to merge in. Main comments:

  • Docstrings for some main objects aren't correct - mostly missing inputs needing to be added
  • Tolerance for the logits check between the converted and original models is v. high. Have you dug into this at all? Do you know where these difference are coming from?
  • Possibly missing tests? e.g. There's InstructBlipTextModelTester but no InstructBlipTextModelTest and some tests for InstructBlipModel are skipped because they're run in individual model tests. It would be good to get @ydshieh's insight on this as he's both the composite model and testing king 👑

src/transformers/utils/hub.py Outdated Show resolved Hide resolved
tests/models/instructblip/test_modeling_instructblip.py Outdated Show resolved Hide resolved
tests/models/instructblip/test_modeling_instructblip.py Outdated Show resolved Hide resolved
tests/models/instructblip/test_modeling_instructblip.py Outdated Show resolved Hide resolved
@ydshieh
Copy link
Collaborator

ydshieh commented Jun 6, 2023

There's InstructBlipTextModelTester but no InstructBlipTextModelTest

In general, I would say yes to have 1-1 correspondence. But I don't want to make it strict if it doesn't really bring anything valuable.

The pipeline testing script would be easier if we have such correspondence, but since I was able to manage BLIP2 already, and this test file here is similar to BLIP2, I think it's fine.

and some tests for InstructBlipModel are skipped because they're run in individual model tests.

It's same as CLIP test file, so it's OK :-)

@amyeroberts
Copy link
Collaborator

@ydshieh Thanks for reviewing & info about the tests!

and some tests for InstructBlipModel are skipped because they're run in individual model tests.
It's same as CLIP test file, so it's OK :-)

Ah, sorry, I wasn't clear. What I meant was: if tests are skipped with the reason of being already tested in individual model tests, don't we need the modular tests classes implemented i.e. InstructBlipTextModelTest?

@ydshieh
Copy link
Collaborator

ydshieh commented Jun 6, 2023

Ah, sorry, I wasn't clear. What I meant was: if tests are skipped with the reason of being already tested in individual model tests, don't we need the modular tests classes implemented i.e. InstructBlipTextModelTest?

I agree (was thinking the same but my mind is lost in my reply).

@NielsRogge I will let you to express why there is no text model test class :-), which is same as in BLIP2.

Well, after looking a bit, the text part is not a fixed model class

        if config.use_decoder_only_language_model:
            language_model = AutoModelForCausalLM.from_config(config.text_config)
        else:
            language_model = AutoModelForSeq2SeqLM.from_config(config.text_config)

I think that's the main reason why we don't have the test for that part.

@kfallah
Copy link

kfallah commented Jun 12, 2023

Hi, will this land soon? I would love to try out this model. Thanks!

@NielsRogge
Copy link
Contributor Author

NielsRogge commented Jun 12, 2023

Thanks @amyeroberts for your review, there was a bug with LlamaTokenizerFast that has now been fixed, now the absolute tolerance is much lower (1e-4 and 1e-5).

I've removed InstructBlipModel from this PR as that was copied from Blip2Model using the CookieCutter template. The latter was added in this PR: #21817. However I'm not sure why the latter got approved, cause it's not really in lign with the design of the library, meaning that xxxModel are models not including any head on top and not accepting a labels argument. However Blip2Model seems like an entire copy of Blip2ForConditionalGeneration, which seems odd to me.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for adding this model!

Just a few small nit comments

README_es.md Outdated Show resolved Hide resolved
tests/models/instructblip/test_modeling_instructblip.py Outdated Show resolved Hide resolved
tests/models/instructblip/test_modeling_instructblip.py Outdated Show resolved Hide resolved
tests/models/instructblip/test_modeling_instructblip.py Outdated Show resolved Hide resolved
Comment on lines +1345 to +1287
def get_input_embeddings(self):
return self.language_model.get_input_embeddings()

def set_input_embeddings(self, value):
self.language_model.set_input_embeddings(value)

def set_output_embeddings(self, new_embeddings):
self.language_model.set_output_embeddings(new_embeddings)

def get_output_embeddings(self) -> nn.Module:
return self.language_model.get_output_embeddings()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue this is a bit confusing, if I call model.get_input_embeddings it's not obvious from which module these relate to

@zdxff
Copy link

zdxff commented Jun 16, 2023

Do the prompt need further packaging when inference? For example, BLIP2 use "Question: {prompt}? Answer: " as prompt. And which type of prompt be used in InstructBLIP? Or we only use question to ask the model?

@amyeroberts
Copy link
Collaborator

@NielsRogge It appears in the current diff that there a some changes unrelated to this PR? Could you rebase to sync up with main? Could you also respond to the questions in the PR review instead of just marking as resolved?

@ydshieh
Copy link
Collaborator

ydshieh commented Jun 24, 2023

Well 💚

@ydshieh
Copy link
Collaborator

ydshieh commented Jun 26, 2023

Merge it now as 🟢 is approved.

@ydshieh ydshieh merged commit 868363a into huggingface:main Jun 26, 2023
@NielsRogge
Copy link
Contributor Author

Hi @zdxff there's no specific prompt being used for InstructBLIP. You can just ask it questions like "What is unusual about this image?"

@younesbelkada
Copy link
Contributor

younesbelkada commented Jun 26, 2023

Will work on the 8bit / 4bit integration ASAP !

EDIT: here you go #24488

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants