-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add InstructBLIP #23460
Add InstructBLIP #23460
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Thank you for your contribution! I noticed a potential problem with this open PR. It seems that the InstructBLIP processor is missing the QformerTokenizer compared to the BLIP2Processor. |
95c349a
to
80065da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this new model. It's mostly in good shape apart from the floating dtypes in the forward. Like all models in Transformers, this should run in the default precision and users can decide to use another dtype for parts (or all) of their models, but we can't hard-code this.
Thanks for your review. Updates:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating! Changes LGTM apart from the modifications in the push to hub mixin which really need to go in their own PR.
Will the converted weights be hosted on the model hub like blip-2? |
2b48255
to
42e50c0
Compare
All checkpoints are transferred: https://huggingface.co/models?other=instructblip. Feel free to merge the PR. The only thing left is uploading fast tokenizer files for the Vicuna-based checkpoints, but that can only be done once #23889 is fixed. Currently the fast tokenizer is created on-the-fly based on the slow tokenizer files when loading from the hub. Update: that's now also done, so it's entirely ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NielsRogge Please do not resolve comments without addressing them or explain why you refuse them.
@amyeroberts Could you have a final look and merge if you are happy? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this model!
There's still a few small things to address before it's ready to merge in. Main comments:
- Docstrings for some main objects aren't correct - mostly missing inputs needing to be added
- Tolerance for the logits check between the converted and original models is v. high. Have you dug into this at all? Do you know where these difference are coming from?
- Possibly missing tests? e.g. There's
InstructBlipTextModelTester
but noInstructBlipTextModelTest
and some tests forInstructBlipModel
are skipped because they're run in individual model tests. It would be good to get @ydshieh's insight on this as he's both the composite model and testing king 👑
src/transformers/models/instructblip/processing_instructblip.py
Outdated
Show resolved
Hide resolved
src/transformers/models/instructblip/processing_instructblip.py
Outdated
Show resolved
Hide resolved
src/transformers/models/instructblip/processing_instructblip.py
Outdated
Show resolved
Hide resolved
In general, I would say yes to have 1-1 correspondence. But I don't want to make it strict if it doesn't really bring anything valuable. The pipeline testing script would be easier if we have such correspondence, but since I was able to manage BLIP2 already, and this test file here is similar to BLIP2, I think it's fine.
It's same as CLIP test file, so it's OK :-) |
@ydshieh Thanks for reviewing & info about the tests!
Ah, sorry, I wasn't clear. What I meant was: if tests are skipped with the reason of being already tested in individual model tests, don't we need the modular tests classes implemented i.e. |
I agree (was thinking the same but my mind is lost in my reply). @NielsRogge I will let you to express why there is no text model test class :-), which is same as in BLIP2. Well, after looking a bit, the text part is not a fixed model class
I think that's the main reason why we don't have the test for that part. |
Hi, will this land soon? I would love to try out this model. Thanks! |
Thanks @amyeroberts for your review, there was a bug with I've removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for adding this model!
Just a few small nit comments
src/transformers/models/instructblip/configuration_instructblip.py
Outdated
Show resolved
Hide resolved
src/transformers/models/instructblip/processing_instructblip.py
Outdated
Show resolved
Hide resolved
def get_input_embeddings(self): | ||
return self.language_model.get_input_embeddings() | ||
|
||
def set_input_embeddings(self, value): | ||
self.language_model.set_input_embeddings(value) | ||
|
||
def set_output_embeddings(self, new_embeddings): | ||
self.language_model.set_output_embeddings(new_embeddings) | ||
|
||
def get_output_embeddings(self) -> nn.Module: | ||
return self.language_model.get_output_embeddings() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd argue this is a bit confusing, if I call model.get_input_embeddings
it's not obvious from which module these relate to
Do the prompt need further packaging when inference? For example, BLIP2 use "Question: {prompt}? Answer: " as prompt. And which type of prompt be used in InstructBLIP? Or we only use question to ask the model? |
@NielsRogge It appears in the current diff that there a some changes unrelated to this PR? Could you rebase to sync up with |
b337a9f
to
5a94341
Compare
cb5f986
to
9ce45c7
Compare
Well 💚 |
Merge it now as 🟢 is approved. |
Hi @zdxff there's no specific prompt being used for InstructBLIP. You can just ask it questions like "What is unusual about this image?" |
Will work on the 8bit / 4bit integration ASAP ! EDIT: here you go #24488 |
What does this PR do?
This PR adds InstructBLIP, a visual instruction tuned version of BLIP-2.
It's a bit like an open-source multimodal GPT-4, leveraging Flan-T5 and Vicuna pre-trained checkpoints.
Basic usage is as follows:
To do:
QFormerTokenizer
into the processorNice to haves: