Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add image text to text pipeline #34170

Merged

Conversation

yonigozlan
Copy link
Member

@yonigozlan yonigozlan commented Oct 15, 2024

What does this PR do?

Add image-text-to-text pipeline!

A split of this PR with only model-specific pre and post processing is available here, in order to reduce the loc count and number of files changed before merging this PR.

Note: The use of a "legacy" kwarg to modify the preprocessing of some image-text-to-text models is needed here if we want to integrate those models into this pipeline. However, the way it is handled might not be ideal, so I'm open to suggestion on how to improve this.

the pipeline support the following inputs:

  • unbatched images and text - images=image, text=text
  • batched images and text - images = [image, image], text= [text, text]
  • several images per prompt (only for models supporting the use of an image token) - images = [[image, image], [image]] or images=[image, image, image], text = ["... <image>...<image>...", "...<image>..."]
  • Chat templates (for models supporting them).

TODOs:

  • Add pipeline tests in model-specific test files
  • Update tasks documentation?

Known current limitations/bugs:

  • Using prompts without image tokens with models that expect them will throw an error. Should we automatically add image tokens to prompts and display a warning? For now, only a warning is displayed if the model's processor has an image token.
  • Using several images per prompt for models who do not support the use of an image token) will raise an uncaught error.
  • Donut doesn't work, as there is a problem identifying the correct model type for it
  • Idefics3 will raise an uncaught error if no correct image tokens are provided, fixed here Use non nested images and batched text Idefics2/3  #34222
  • Pixtral with batched input raises Pipeline with tokenizer without pad_token cannot do batching. You can try to set it with pipe.tokenizer.pad_token_id = model.config.eos_token_id.

Examples of usage:

>>> from transformers import pipeline
>>> pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf")
>>> image = "./tests/fixtures/tests_samples/COCO/000000039769.png"
>>> text = "<image> What this is? Assistant: This is"
>>> pipe(image, text=text, max_new_tokens=20)
[
    [
        {
            "input_text": "<image> What this is? Assistant: This is",
            "generated_text": "<image> What this is? Assistant: This is a photo of two cats lying on a pink blanket. The cats are sleeping and appear to be comfortable",
        }
    ]
],
>>> from transformers import pipeline
>>> pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf")
>>> messages = [
>>>     {
>>>         "role": "user",
>>>         "content": [
>>>             {
>>>                 "type": "image",
>>>                 "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
>>>             },
>>>             {"type": "text", "text": "Describe this image."},
>>>         ],
>>>     }
>>> ]
>>> outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
>>> print(outputs[0]["generated_text"])
"In the image, a woman is sitting on the sandy beach, her legs crossed in a relaxed manner"
>>> from transformers import pipeline
>>> pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf")
>>> messages = [
>>>     {
>>>         "role": "user",
>>>         "content": [
>>>             {
>>>                 "type": "image",
>>>                 "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
>>>             },
>>>             {"type": "text", "text": "Describe this image."},
>>>         ],
>>>     },
>>>     {
>>>         "role": "assistant",
>>>         "content": [
>>>             {"type": "text", "text": "There is a dog and"},
>>>         ],
>>>     },
>>> ]
>>> outputs = pipe(text=messages, max_new_tokens=20)
>>> print(outputs[0]["generated_text"])
[
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "There is a dog and a person in the image. The dog is sitting on the sand, and the person is sitting on",
            }
        ],
    },
]

Who can review?

@Rocketknight1 @molbap @qubvel @NielsRogge

@yonigozlan yonigozlan marked this pull request as ready for review October 15, 2024 09:12
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@yonigozlan yonigozlan force-pushed the add-image-text-to-text-pipeline branch from 90f00d4 to 4ac2d1f Compare October 15, 2024 14:11
@knkski
Copy link

knkski commented Oct 15, 2024

Will it be possible to use this PR for just text generation with a image-capable model? I'm trying to use this PR (at commit 4ac2d1fce81a00d251ae9af75f32b1f821d56296) with meta-llama/Llama-3.2-90B-Vision-Instruct so that I can compare the language capabilities vs Llama 3.1 70B, and I don't need to use the image support.

I tried calling it like this:

pipe = pipeline(
    "image-text-to-text",
    model="meta-llama/Llama-3.2-90B-Vision-Instruct", 
    device_map="auto",
)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is 1+1?"},
        ],
    }
]
outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
print(outputs[0]["generated_text"])

That resulted in this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/image_text_to_text.py:393, in ImageTextToTextPipeline.preprocess(self, inputs, truncation, padding, max_length, timeout, continue_final_message)
    392 try:
--> 393     model_inputs = self.processor(images=images, text=text, return_tensors=self.framework, **kwargs).to(
    394         dtype=self.torch_dtype
    395     )
    396 except TypeError:

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/models/mllama/processing_mllama.py:285, in MllamaProcessor.__call__(self, images, text, audio, videos, **kwargs)
    284 _ = text_kwargs.pop("padding_side", None)  # hack until padding-side is an accepted kwarg by tokenizers
--> 285 encoding = self.tokenizer(text, **text_kwargs)
    286 data.update(encoding)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:3020, in PreTrainedTokenizerBase.__call__(self, text, text_pair, text_target, text_pair_target, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, padding_side, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   3019         self._switch_to_input_mode()
-> 3020     encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
   3021 if text_target is not None:

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:3108, in PreTrainedTokenizerBase._call_one(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, padding_side, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, split_special_tokens, **kwargs)
   3107     batch_text_or_text_pairs = list(zip(text, text_pair)) if text_pair is not None else text
-> 3108     return self.batch_encode_plus(
   3109         batch_text_or_text_pairs=batch_text_or_text_pairs,
   3110         add_special_tokens=add_special_tokens,
   3111         padding=padding,
   3112         truncation=truncation,
   3113         max_length=max_length,
   3114         stride=stride,
   3115         is_split_into_words=is_split_into_words,
   3116         pad_to_multiple_of=pad_to_multiple_of,
   3117         padding_side=padding_side,
   3118         return_tensors=return_tensors,
   3119         return_token_type_ids=return_token_type_ids,
   3120         return_attention_mask=return_attention_mask,
   3121         return_overflowing_tokens=return_overflowing_tokens,
   3122         return_special_tokens_mask=return_special_tokens_mask,
   3123         return_offsets_mapping=return_offsets_mapping,
   3124         return_length=return_length,
   3125         verbose=verbose,
   3126         split_special_tokens=split_special_tokens,
   3127         **kwargs,
   3128     )
   3129 else:

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:3310, in PreTrainedTokenizerBase.batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, padding_side, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, split_special_tokens, **kwargs)
   3301 padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
   3302     padding=padding,
   3303     truncation=truncation,
   (...)
   3307     **kwargs,
   3308 )
-> 3310 return self._batch_encode_plus(
   3311     batch_text_or_text_pairs=batch_text_or_text_pairs,
   3312     add_special_tokens=add_special_tokens,
   3313     padding_strategy=padding_strategy,
   3314     truncation_strategy=truncation_strategy,
   3315     max_length=max_length,
   3316     stride=stride,
   3317     is_split_into_words=is_split_into_words,
   3318     pad_to_multiple_of=pad_to_multiple_of,
   3319     padding_side=padding_side,
   3320     return_tensors=return_tensors,
   3321     return_token_type_ids=return_token_type_ids,
   3322     return_attention_mask=return_attention_mask,
   3323     return_overflowing_tokens=return_overflowing_tokens,
   3324     return_special_tokens_mask=return_special_tokens_mask,
   3325     return_offsets_mapping=return_offsets_mapping,
   3326     return_length=return_length,
   3327     verbose=verbose,
   3328     split_special_tokens=split_special_tokens,
   3329     **kwargs,
   3330 )

TypeError: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'legacy'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[5], line 9
      1 messages = [
      2     {
      3         "role": "user",
   (...)
      7     }
      8 ]
----> 9 outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
     10 print(outputs[0]["generated_text"])

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/image_text_to_text.py:291, in ImageTextToTextPipeline.__call__(self, images, text, **kwargs)
    285 if isinstance(text, (list, tuple, KeyDataset) if is_torch_available() else (list, tuple)) and isinstance(
    286     text[0], (list, tuple, dict)
    287 ):
    288     # We have one or more prompts in list-of-dicts format, so this is chat mode
    290     if isinstance(text[0], dict):
--> 291         return super().__call__(Chat(text, images), **kwargs)
    292     else:
    293         if images is None:

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1302, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
   1294     return next(
   1295         iter(
   1296             self.get_iterator(
   (...)
   1299         )
   1300     )
   1301 else:
-> 1302     return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1308, in Pipeline.run_single(self, inputs, preprocess_params, forward_params, postprocess_params)
   1307 def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
-> 1308     model_inputs = self.preprocess(inputs, **preprocess_params)
   1309     model_outputs = self.forward(model_inputs, **forward_params)
   1310     outputs = self.postprocess(model_outputs, **postprocess_params)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/image_text_to_text.py:398, in ImageTextToTextPipeline.preprocess(self, inputs, truncation, padding, max_length, timeout, continue_final_message)
    396 except TypeError:
    397     kwargs.pop("legacy", None)
--> 398     model_inputs = self.processor(images=images, text=text, return_tensors=self.framework, **kwargs).to(
    399         dtype=self.torch_dtype
    400     )
    402 model_inputs["text"] = inputs_text
    404 return model_inputs

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/models/mllama/processing_mllama.py:290, in MllamaProcessor.__call__(self, images, text, audio, videos, **kwargs)
    288 n_images_in_images = [0]
    289 if images is not None:
--> 290     images = make_list_of_images(images)
    291     n_images_in_images = [len(sample) for sample in images]
    293 if text is not None:

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/models/mllama/image_processing_mllama.py:543, in make_list_of_images(images)
    541     output_images = images
    542 else:
--> 543     raise ValueError(
    544         "Invalid input type. Must be a single image, a list of images, or a list of batches of images."
    545     )
    546 return output_images

ValueError: Invalid input type. Must be a single image, a list of images, or a list of batches of images.

I tried running it just as above as well, with an image input, and that resulted in an OutOfMemoryError, which is confusing because the model size is only 166G on disk, and I'm running this in a 4x80G (i.e. 320G) H100 Lambda Labs environment.

---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
Cell In[6], line 23
      1 # messages = [
      2 #     {
      3 #         "role": "user",
   (...)
      9 # outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
     10 # print(outputs[0]["generated_text"])
     11 messages = [
     12     {
     13         "role": "user",
   (...)
     21     }
     22 ]
---> 23 outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
     24 print(outputs[0]["generated_text"])

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/image_text_to_text.py:291, in ImageTextToTextPipeline.__call__(self, images, text, **kwargs)
    285 if isinstance(text, (list, tuple, KeyDataset) if is_torch_available() else (list, tuple)) and isinstance(
    286     text[0], (list, tuple, dict)
    287 ):
    288     # We have one or more prompts in list-of-dicts format, so this is chat mode
    290     if isinstance(text[0], dict):
--> 291         return super().__call__(Chat(text, images), **kwargs)
    292     else:
    293         if images is None:

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1302, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
   1294     return next(
   1295         iter(
   1296             self.get_iterator(
   (...)
   1299         )
   1300     )
   1301 else:
-> 1302     return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1309, in Pipeline.run_single(self, inputs, preprocess_params, forward_params, postprocess_params)
   1307 def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
   1308     model_inputs = self.preprocess(inputs, **preprocess_params)
-> 1309     model_outputs = self.forward(model_inputs, **forward_params)
   1310     outputs = self.postprocess(model_outputs, **postprocess_params)
   1311     return outputs

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1209, in Pipeline.forward(self, model_inputs, **forward_params)
   1207     with inference_context():
   1208         model_inputs = self._ensure_tensor_on_device(model_inputs, device=self.device)
-> 1209         model_outputs = self._forward(model_inputs, **forward_params)
   1210         model_outputs = self._ensure_tensor_on_device(model_outputs, device=torch.device("cpu"))
   1211 else:

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/image_text_to_text.py:412, in ImageTextToTextPipeline._forward(self, model_inputs, generate_kwargs)
    408 prompt_text = model_inputs.pop("text")
    409 input_ids = (
    410     model_inputs["input_ids"] if "input_ids" in model_inputs else model_inputs["decoder_input_ids"]
    411 )  # for decoder-only models
--> 412 generated_sequence = self.model.generate(**model_inputs, **generate_kwargs)
    414 return {"generated_sequence": generated_sequence, "prompt_text": prompt_text, "input_ids": input_ids}

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/generation/utils.py:2208, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
   2200     input_ids, model_kwargs = self._expand_inputs_for_generation(
   2201         input_ids=input_ids,
   2202         expand_size=generation_config.num_return_sequences,
   2203         is_encoder_decoder=self.config.is_encoder_decoder,
   2204         **model_kwargs,
   2205     )
   2207     # 12. run sample (it degenerates to greedy search when `generation_config.do_sample=False`)
-> 2208     result = self._sample(
   2209         input_ids,
   2210         logits_processor=prepared_logits_processor,
   2211         stopping_criteria=prepared_stopping_criteria,
   2212         generation_config=generation_config,
   2213         synced_gpus=synced_gpus,
   2214         streamer=streamer,
   2215         **model_kwargs,
   2216     )
   2218 elif generation_mode in (GenerationMode.BEAM_SAMPLE, GenerationMode.BEAM_SEARCH):
   2219     # 11. prepare beam search scorer
   2220     beam_scorer = BeamSearchScorer(
   2221         batch_size=batch_size,
   2222         num_beams=generation_config.num_beams,
   (...)
   2227         max_length=generation_config.max_length,
   2228     )

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/generation/utils.py:3176, in GenerationMixin._sample(self, input_ids, logits_processor, stopping_criteria, generation_config, synced_gpus, streamer, **model_kwargs)
   3173 model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {})
   3175 # forward pass to get next token
-> 3176 outputs = self(**model_inputs, return_dict=True)
   3178 # synced_gpus: don't waste resources running the code we don't need; kwargs must be updated before skipping
   3179 model_kwargs = self._update_model_kwargs_for_generation(
   3180     outputs,
   3181     model_kwargs,
   3182     is_encoder_decoder=self.config.is_encoder_decoder,
   3183 )

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
   1551     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1552 else:
-> 1553     return self._call_impl(*args, **kwargs)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
   1557 # If we don't have any hooks, we want to skip the rest of the logic in
   1558 # this function, and just call forward.
   1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1560         or _global_backward_pre_hooks or _global_backward_hooks
   1561         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562     return forward_call(*args, **kwargs)
   1564 try:
   1565     result = None

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/accelerate/hooks.py:170, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    168         output = module._old_forward(*args, **kwargs)
    169 else:
--> 170     output = module._old_forward(*args, **kwargs)
    171 return module._hf_hook.post_forward(module, output)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/models/mllama/modeling_mllama.py:2138, in MllamaForConditionalGeneration.forward(self, input_ids, pixel_values, aspect_ratio_mask, aspect_ratio_ids, attention_mask, cross_attention_mask, cross_attention_states, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, cache_position, num_logits_to_keep)
   2135     cross_attention_mask = cross_attention_mask[:, :, cache_position]
   2136     full_text_row_masked_out_mask = full_text_row_masked_out_mask[:, :, cache_position]
-> 2138 outputs = self.language_model(
   2139     input_ids=input_ids,
   2140     attention_mask=attention_mask,
   2141     position_ids=position_ids,
   2142     cross_attention_states=cross_attention_states,
   2143     cross_attention_mask=cross_attention_mask,
   2144     full_text_row_masked_out_mask=full_text_row_masked_out_mask,
   2145     past_key_values=past_key_values,
   2146     use_cache=use_cache,
   2147     inputs_embeds=inputs_embeds,
   2148     labels=labels,
   2149     output_hidden_states=output_hidden_states,
   2150     output_attentions=output_attentions,
   2151     return_dict=return_dict,
   2152     cache_position=cache_position,
   2153     num_logits_to_keep=num_logits_to_keep,
   2154 )
   2156 return outputs

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
   1551     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1552 else:
-> 1553     return self._call_impl(*args, **kwargs)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
   1557 # If we don't have any hooks, we want to skip the rest of the logic in
   1558 # this function, and just call forward.
   1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1560         or _global_backward_pre_hooks or _global_backward_hooks
   1561         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562     return forward_call(*args, **kwargs)
   1564 try:
   1565     result = None

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/models/mllama/modeling_mllama.py:1948, in MllamaForCausalLM.forward(self, input_ids, attention_mask, position_ids, cross_attention_states, cross_attention_mask, full_text_row_masked_out_mask, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, cache_position, num_logits_to_keep)
   1931 outputs = self.model(
   1932     input_ids=input_ids,
   1933     cross_attention_states=cross_attention_states,
   (...)
   1944     cache_position=cache_position,
   1945 )
   1947 hidden_states = outputs[0]
-> 1948 logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :]).float()
   1950 loss = None
   1951 if labels is not None:
   1952     # Upcast to float if we need to compute the loss to avoid potential precision issues

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
   1551     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1552 else:
-> 1553     return self._call_impl(*args, **kwargs)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
   1557 # If we don't have any hooks, we want to skip the rest of the logic in
   1558 # this function, and just call forward.
   1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1560         or _global_backward_pre_hooks or _global_backward_hooks
   1561         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562     return forward_call(*args, **kwargs)
   1564 try:
   1565     result = None

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    164 def new_forward(module, *args, **kwargs):
--> 165     args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
    166     if module._hf_hook.no_grad:
    167         with torch.no_grad():

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/accelerate/hooks.py:355, in AlignDevicesHook.pre_forward(self, module, *args, **kwargs)
    347         if (
    348             value is not None
    349             and self.tied_params_map is not None
    350             and value.data_ptr() in self.tied_params_map
    351             and self.execution_device not in self.tied_params_map[value.data_ptr()]
    352         ):
    353             self.tied_pointers_to_remove.add((value.data_ptr(), self.execution_device))
--> 355         set_module_tensor_to_device(
    356             module,
    357             name,
    358             self.execution_device,
    359             value=value,
    360             fp16_statistics=fp16_statistics,
    361             tied_params_map=self.tied_params_map,
    362         )
    364 return send_to_device(args, self.execution_device), send_to_device(
    365     kwargs, self.execution_device, skip_keys=self.skip_keys
    366 )

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/accelerate/utils/modeling.py:329, in set_module_tensor_to_device(module, tensor_name, device, value, dtype, fp16_statistics, tied_params_map)
    327             module._parameters[tensor_name] = param_cls(new_value, requires_grad=old_value.requires_grad)
    328 elif isinstance(value, torch.Tensor):
--> 329     new_value = value.to(device)
    330 else:
    331     new_value = torch.tensor(value, device=device)

OutOfMemoryError: CUDA out of memory. Tried to allocate 3.91 GiB. GPU 0 has a total capacity of 79.10 GiB of which 2.12 GiB is free. Including non-PyTorch memory, this process has 76.97 GiB memory in use. Of the allocated memory 75.56 GiB is allocated by PyTorch, and 761.23 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

@yonigozlan
Copy link
Member Author

Thanks for the feedback @knkski! Although it's not really an objective of this pipeline, I think we can try to add support and raise a warning at least, wdyt @Rocketknight1 ?
For the memory problem, that's is strange indeed, I will look into that, and if others have an idea of why this is happening feel free to chime in. Do you manage to use this model on your setup without using the pipeline?

@Rocketknight1
Copy link
Member

@yonigozlan I think that's okay! It might result in a bit of crossover with text-generation pipelines, but I think it's fine, and we can deprecate it later and officially move that functionality to text-generation if it's a problem.

@yonigozlan
Copy link
Member Author

@Rocketknight1 @knkski , text-only inference should be supported now :)

@knkski
Copy link

knkski commented Oct 18, 2024

@yonigozlan Thanks! Works great for me 🚀

I think the extra memory usage is unrelated to this PR, so ignore that 👍

@yonigozlan yonigozlan force-pushed the add-image-text-to-text-pipeline branch 2 times, most recently from 7038c52 to 46d6891 Compare October 22, 2024 13:51
Copy link
Member

@Rocketknight1 Rocketknight1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good! The tests seem good and the pipeline code looks clean! A lot of the code is familiar from the text-generation pipeline, with modifications for images.

The only question I have is whether it'll be confusing to have e.g. image-text-to-text as well as image-to-text and text-generation pipelines. In particular, it feels like this pipeline is almost a "superset" of text-generation, since it can handle both text completions and chat completions with templates, which means it's basically just text-generation plus image support.

That might mean we should take these changes and fold them into text-generation instead. However, that might add additional inputs that would make it harder to synchronize the pipeline with the inference spec - cc @Wauplin / @LysandreJik, how annoying do you think that would be?

src/transformers/models/blip/processing_blip.py Outdated Show resolved Hide resolved
src/transformers/tokenization_utils_base.py Outdated Show resolved Hide resolved
src/transformers/pipelines/image_text_to_text.py Outdated Show resolved Hide resolved
src/transformers/pipelines/image_text_to_text.py Outdated Show resolved Hide resolved
src/transformers/pipelines/image_text_to_text.py Outdated Show resolved Hide resolved
src/transformers/pipelines/image_text_to_text.py Outdated Show resolved Hide resolved
@Wauplin
Copy link
Contributor

Wauplin commented Oct 23, 2024

That might mean we should take these changes and fold them into text-generation instead. However, that might add additional inputs that would make it harder to synchronize the pipeline with the inference spec - cc @Wauplin / @LysandreJik, how annoying do you think that would be?

X-posting the slack thread (private) about that convo.
IMO better to have both text-generation and image-text-to-text to be consistent with https://huggingface.co/tasks.

@yonigozlan yonigozlan force-pushed the add-image-text-to-text-pipeline branch from 31432b4 to d739c0a Compare October 24, 2024 20:56
@yonigozlan
Copy link
Member Author

There is still some issues with pipeline tests:

  • It seems that pipeline model tests are based on "tiny models" available on hf-internal-testing, but those tiny models don't seem to be added anymore for recent vlms, so they are not being tested. I'm not sure if this is or used to be an automatic or manual process, and if we should start adding those tiny models back again.
  • The Kosmos2 tiny model causes some problems: it's configuration has hyper-parameters that are not compatible with each other, namely latent_query_num=3, which is a model parameter, should be the same as num_image_tokens=64, which is a processor call argument, so can't be set via a json config file (I think?). An easy fix would be to manually change latent_query_num to 64 in the tiny model's config in hf-internal-testing, but that could make the model not so tiny anymore. Or we could skip the test altogether.

@Rocketknight1
Copy link
Member

@yonigozlan tiny models aren't automatically generated, those are all manually created. Rather than modifying an existing one (which might break existing tests), I'd suggest just making a new tiny model that fits what you want to test and uploading that to hf-internal-testing. You can ask to be added to the organization if you don't have permissions!

@yonigozlan
Copy link
Member Author

@yonigozlan tiny models aren't automatically generated, those are all manually created. Rather than modifying an existing one (which might break existing tests), I'd suggest just making a new tiny model that fits what you want to test and uploading that to hf-internal-testing. You can ask to be added to the organization if you don't have permissions!

I see, thanks for the explanation! As for adding new tiny model, pipelines use the tiny_model_summary.json file to identify tiny models, but it looks like only one tiny model per model architecture can be present in that file, so I'm not sure how to solve the issue with the Kosmos2 tiny model without modifying the current one.

@Rocketknight1
Copy link
Member

@yonigozlan probably the easiest thing to do, in that case, is just to manually upload a new model, don't add it to tiny_model_summary, and manually set that model in the image-text-to-text tests. You shouldn't need to worry about whatever's in tiny_model_summary.json either way!

Also, I was wrong - some of the tiny models are automatically created, but in this case I think a manual one just for your pipeline will work a lot better.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! I think it's very important, thus we should try to make it a bit more simple. 🤗

src/transformers/models/donut/processing_donut.py Outdated Show resolved Hide resolved
src/transformers/models/fuyu/processing_fuyu.py Outdated Show resolved Hide resolved
src/transformers/pipelines/__init__.py Outdated Show resolved Hide resolved
src/transformers/pipelines/image_text_to_text.py Outdated Show resolved Hide resolved
src/transformers/pipelines/image_text_to_text.py Outdated Show resolved Hide resolved
src/transformers/pipelines/image_text_to_text.py Outdated Show resolved Hide resolved
src/transformers/pipelines/image_text_to_text.py Outdated Show resolved Hide resolved
src/transformers/pipelines/image_text_to_text.py Outdated Show resolved Hide resolved
src/transformers/pipelines/image_text_to_text.py Outdated Show resolved Hide resolved
@yonigozlan yonigozlan force-pushed the add-image-text-to-text-pipeline branch from c05ceb2 to 61cc576 Compare October 31, 2024 19:25
@yonigozlan
Copy link
Member Author

yonigozlan commented Oct 31, 2024

Thanks for all of your inputs! I'll merged this now as the remaining issues/improvements raised seem a bit out of scope for this PR.
Just to recap some of the points that were raised:

  • VLMs processors are not fully consistent in terms of what inputs they accept, and some of them don't catch errors that should be caught. Improvements can be made there that would benefit this pipeline as well. I'll open an issue for this to share it as a known limitation, and I'll start working on it asap :).
  • Donut doesn't work in this pipeline as processors are not infer in pipelines if they are not in auto.
  • Chat templates could be applied directly in conversational models' processor instead of users having to manually do so before making a processor call? Chat inputs could be detected as they are list of dicts.
  • Several pipelines have a way to handle detecting inputs in generated text, and removing or adding it. This could be unified in a util, or in generate with an added "return_input" flag.
  • Most recent models (and vlms in particular) don't have a "tiny" version uploaded on hf-internal-testing, which means they are not tested by the CI in the different pipelines that support them.

@yonigozlan yonigozlan merged commit 203e270 into huggingface:main Oct 31, 2024
26 checks passed
frances720 pushed a commit to Promptless/transformers-test that referenced this pull request Nov 6, 2024
* Standardize image-text-to-text-models-output

add post_process_image_text_to_text to chameleon and cleanup

Fix legacy kwarg behavior and deprecation warning

add post_process_image_text_to_text to qwen2_vl and llava_onevision

Add post_process_image_text_to_text to idefics3, mllama, pixtral processor

* nit var name post_process_image_text_to_text udop

* nit fix deprecation warnings

* Add image-text-to-text pipeline

* add support for image url in chat template for pipeline

* Reformat to be fully compatible with chat templates

* Add tests chat template

* Fix imports and tests

* Add pipeline tag

* change logic handling of single prompt ans multiple images

* add pipeline mapping to models

* fix batched inference

* fix tests

* Add manual batching for preprocessing

* Fix outputs with nested images

* Add support for all common processing kwargs

* Add default padding when multiple text inputs (batch size>1)

* nit change version deprecation warning

* Add support for text only inference

* add chat_template warnings

* Add pipeline tests and add copied from post process function

* Fix batched pipeline tests

* nit

* Fix pipeline tests blip2

* remove unnecessary max_new_tokens

* revert processing kosmos2 and remove unnecessary max_new_tokens

* fix pipeline tests idefics

* Force try loading processor if pipeline supports it

* revert load_processor change

* hardcode loading only processor

* remove unnecessary try except

* skip imagetexttotext tests for kosmos2 as tiny model causes problems

* Make code clearer

* Address review comments

* remove preprocessing logic from pipeline

* fix fuyu

* add BC resize fuyu

* Move post_process_image_text_to_text to ProcessorMixin

* add guard in post_process

* fix zero shot object detection pipeline

* add support for generator input in pipeline

* nit

* change default image-text-to-text model to llava onevision

* fix owlv2 size dict

* Change legacy deprecation warning to only show when True
BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
* Standardize image-text-to-text-models-output

add post_process_image_text_to_text to chameleon and cleanup

Fix legacy kwarg behavior and deprecation warning

add post_process_image_text_to_text to qwen2_vl and llava_onevision

Add post_process_image_text_to_text to idefics3, mllama, pixtral processor

* nit var name post_process_image_text_to_text udop

* nit fix deprecation warnings

* Add image-text-to-text pipeline

* add support for image url in chat template for pipeline

* Reformat to be fully compatible with chat templates

* Add tests chat template

* Fix imports and tests

* Add pipeline tag

* change logic handling of single prompt ans multiple images

* add pipeline mapping to models

* fix batched inference

* fix tests

* Add manual batching for preprocessing

* Fix outputs with nested images

* Add support for all common processing kwargs

* Add default padding when multiple text inputs (batch size>1)

* nit change version deprecation warning

* Add support for text only inference

* add chat_template warnings

* Add pipeline tests and add copied from post process function

* Fix batched pipeline tests

* nit

* Fix pipeline tests blip2

* remove unnecessary max_new_tokens

* revert processing kosmos2 and remove unnecessary max_new_tokens

* fix pipeline tests idefics

* Force try loading processor if pipeline supports it

* revert load_processor change

* hardcode loading only processor

* remove unnecessary try except

* skip imagetexttotext tests for kosmos2 as tiny model causes problems

* Make code clearer

* Address review comments

* remove preprocessing logic from pipeline

* fix fuyu

* add BC resize fuyu

* Move post_process_image_text_to_text to ProcessorMixin

* add guard in post_process

* fix zero shot object detection pipeline

* add support for generator input in pipeline

* nit

* change default image-text-to-text model to llava onevision

* fix owlv2 size dict

* Change legacy deprecation warning to only show when True
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants