Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CLIP] Captioning Pipeline #1145

Merged
merged 47 commits into from
Aug 7, 2023
Merged

[CLIP] Captioning Pipeline #1145

merged 47 commits into from
Aug 7, 2023

Conversation

dsikka
Copy link
Contributor

@dsikka dsikka commented Jul 25, 2023

Note: there are currently no models for captioning on sparsezoo.
We have an open issue with open_clip to track some of the issues with CoCa models that have been brought up.

Summary

clip_caption

  • Implement CLIPCaptioning and CLIPDecoder pipelines. These pipelines allow us to produce captions given an image. This leverages the previous CLIPVisual and CLIPText Pipelines that were implemented for zeroshot, with some modifications to make them more generic
  • The captioning pipeline adds a _generate function which is adapted from open_clip and applies BeamSearch to build the caption: https://github.com/neuralmagic/open_clip/blob/onnx-edit/src/open_clip/coca_model.py
  • One caveat is that in open_clip's implementation, the input sequence length is dynamic. We're using padded sequences
  • Also, the exported onnx models are all originally from open_clip

Testing

  • Added tests to the original clip tests
  • Also ran the following script to generate captions for various images:
from deepsparse import BasePipeline, Pipeline
from deepsparse.clip import CLIPCaptionInput, CLIPCaptionPipeline, CLIPVisualInput

root = "caption_models"
model_path_visual = f"{root}/clip_visual.onnx"
model_path_text = f"{root}/clip_text.onnx"
model_path_decoder = f"{root}/clip_text_decoder.onnx"

kwargs = {
    "visual_model_path": model_path_visual,
    "text_model_path": model_path_text,
    "decoder_model_path": model_path_decoder,
}
pipeline = BasePipeline.create(task="clip_caption", **kwargs)

pipeline_input = CLIPCaptionInput(image=CLIPVisualInput(images="mountain.jpg"))
output = pipeline(pipeline_input)

Examples of images and the generated caption:

mountain
Caption: a view of mountains in the background .

thailand
Caption: an adult elephant and a baby elephant .

mug
Caption: a cup of coffee .

@dsikka dsikka changed the base branch from main to clip_zshot July 25, 2023 20:49
@dsikka dsikka marked this pull request as ready for review July 27, 2023 20:59
@dsikka dsikka force-pushed the captioning branch 4 times, most recently from 743bd95 to 402bfc6 Compare July 31, 2023 23:54
@dsikka dsikka requested review from bfineran and dbogunowicz August 1, 2023 00:53
@dsikka dsikka requested a review from Satrat August 1, 2023 00:53
@dsikka dsikka assigned dsikka and unassigned rahul-tuli Aug 1, 2023
@dsikka dsikka requested a review from rahul-tuli August 1, 2023 00:54
dbogunowicz
dbogunowicz previously approved these changes Aug 1, 2023
bfineran
bfineran previously approved these changes Aug 1, 2023
Base automatically changed from clip_zshot to main August 2, 2023 18:03
@bfineran bfineran dismissed stale reviews from dbogunowicz and themself August 2, 2023 18:03

The base branch was changed.

@dsikka dsikka merged commit ffeb98f into main Aug 7, 2023
@dsikka dsikka deleted the captioning branch August 7, 2023 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants