-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add image text to text pipeline #34170
Add image text to text pipeline #34170
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
90f00d4
to
4ac2d1f
Compare
Will it be possible to use this PR for just text generation with a image-capable model? I'm trying to use this PR (at commit 4ac2d1fce81a00d251ae9af75f32b1f821d56296) with I tried calling it like this: pipe = pipeline(
"image-text-to-text",
model="meta-llama/Llama-3.2-90B-Vision-Instruct",
device_map="auto",
)
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "What is 1+1?"},
],
}
]
outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
print(outputs[0]["generated_text"]) That resulted in this error: ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/image_text_to_text.py:393, in ImageTextToTextPipeline.preprocess(self, inputs, truncation, padding, max_length, timeout, continue_final_message)
392 try:
--> 393 model_inputs = self.processor(images=images, text=text, return_tensors=self.framework, **kwargs).to(
394 dtype=self.torch_dtype
395 )
396 except TypeError:
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/models/mllama/processing_mllama.py:285, in MllamaProcessor.__call__(self, images, text, audio, videos, **kwargs)
284 _ = text_kwargs.pop("padding_side", None) # hack until padding-side is an accepted kwarg by tokenizers
--> 285 encoding = self.tokenizer(text, **text_kwargs)
286 data.update(encoding)
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:3020, in PreTrainedTokenizerBase.__call__(self, text, text_pair, text_target, text_pair_target, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, padding_side, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
3019 self._switch_to_input_mode()
-> 3020 encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
3021 if text_target is not None:
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:3108, in PreTrainedTokenizerBase._call_one(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, padding_side, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, split_special_tokens, **kwargs)
3107 batch_text_or_text_pairs = list(zip(text, text_pair)) if text_pair is not None else text
-> 3108 return self.batch_encode_plus(
3109 batch_text_or_text_pairs=batch_text_or_text_pairs,
3110 add_special_tokens=add_special_tokens,
3111 padding=padding,
3112 truncation=truncation,
3113 max_length=max_length,
3114 stride=stride,
3115 is_split_into_words=is_split_into_words,
3116 pad_to_multiple_of=pad_to_multiple_of,
3117 padding_side=padding_side,
3118 return_tensors=return_tensors,
3119 return_token_type_ids=return_token_type_ids,
3120 return_attention_mask=return_attention_mask,
3121 return_overflowing_tokens=return_overflowing_tokens,
3122 return_special_tokens_mask=return_special_tokens_mask,
3123 return_offsets_mapping=return_offsets_mapping,
3124 return_length=return_length,
3125 verbose=verbose,
3126 split_special_tokens=split_special_tokens,
3127 **kwargs,
3128 )
3129 else:
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:3310, in PreTrainedTokenizerBase.batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, padding_side, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, split_special_tokens, **kwargs)
3301 padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
3302 padding=padding,
3303 truncation=truncation,
(...)
3307 **kwargs,
3308 )
-> 3310 return self._batch_encode_plus(
3311 batch_text_or_text_pairs=batch_text_or_text_pairs,
3312 add_special_tokens=add_special_tokens,
3313 padding_strategy=padding_strategy,
3314 truncation_strategy=truncation_strategy,
3315 max_length=max_length,
3316 stride=stride,
3317 is_split_into_words=is_split_into_words,
3318 pad_to_multiple_of=pad_to_multiple_of,
3319 padding_side=padding_side,
3320 return_tensors=return_tensors,
3321 return_token_type_ids=return_token_type_ids,
3322 return_attention_mask=return_attention_mask,
3323 return_overflowing_tokens=return_overflowing_tokens,
3324 return_special_tokens_mask=return_special_tokens_mask,
3325 return_offsets_mapping=return_offsets_mapping,
3326 return_length=return_length,
3327 verbose=verbose,
3328 split_special_tokens=split_special_tokens,
3329 **kwargs,
3330 )
TypeError: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'legacy'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Cell In[5], line 9
1 messages = [
2 {
3 "role": "user",
(...)
7 }
8 ]
----> 9 outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
10 print(outputs[0]["generated_text"])
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/image_text_to_text.py:291, in ImageTextToTextPipeline.__call__(self, images, text, **kwargs)
285 if isinstance(text, (list, tuple, KeyDataset) if is_torch_available() else (list, tuple)) and isinstance(
286 text[0], (list, tuple, dict)
287 ):
288 # We have one or more prompts in list-of-dicts format, so this is chat mode
290 if isinstance(text[0], dict):
--> 291 return super().__call__(Chat(text, images), **kwargs)
292 else:
293 if images is None:
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1302, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
1294 return next(
1295 iter(
1296 self.get_iterator(
(...)
1299 )
1300 )
1301 else:
-> 1302 return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1308, in Pipeline.run_single(self, inputs, preprocess_params, forward_params, postprocess_params)
1307 def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
-> 1308 model_inputs = self.preprocess(inputs, **preprocess_params)
1309 model_outputs = self.forward(model_inputs, **forward_params)
1310 outputs = self.postprocess(model_outputs, **postprocess_params)
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/image_text_to_text.py:398, in ImageTextToTextPipeline.preprocess(self, inputs, truncation, padding, max_length, timeout, continue_final_message)
396 except TypeError:
397 kwargs.pop("legacy", None)
--> 398 model_inputs = self.processor(images=images, text=text, return_tensors=self.framework, **kwargs).to(
399 dtype=self.torch_dtype
400 )
402 model_inputs["text"] = inputs_text
404 return model_inputs
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/models/mllama/processing_mllama.py:290, in MllamaProcessor.__call__(self, images, text, audio, videos, **kwargs)
288 n_images_in_images = [0]
289 if images is not None:
--> 290 images = make_list_of_images(images)
291 n_images_in_images = [len(sample) for sample in images]
293 if text is not None:
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/models/mllama/image_processing_mllama.py:543, in make_list_of_images(images)
541 output_images = images
542 else:
--> 543 raise ValueError(
544 "Invalid input type. Must be a single image, a list of images, or a list of batches of images."
545 )
546 return output_images
ValueError: Invalid input type. Must be a single image, a list of images, or a list of batches of images. I tried running it just as above as well, with an image input, and that resulted in an ---------------------------------------------------------------------------
OutOfMemoryError Traceback (most recent call last)
Cell In[6], line 23
1 # messages = [
2 # {
3 # "role": "user",
(...)
9 # outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
10 # print(outputs[0]["generated_text"])
11 messages = [
12 {
13 "role": "user",
(...)
21 }
22 ]
---> 23 outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
24 print(outputs[0]["generated_text"])
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/image_text_to_text.py:291, in ImageTextToTextPipeline.__call__(self, images, text, **kwargs)
285 if isinstance(text, (list, tuple, KeyDataset) if is_torch_available() else (list, tuple)) and isinstance(
286 text[0], (list, tuple, dict)
287 ):
288 # We have one or more prompts in list-of-dicts format, so this is chat mode
290 if isinstance(text[0], dict):
--> 291 return super().__call__(Chat(text, images), **kwargs)
292 else:
293 if images is None:
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1302, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
1294 return next(
1295 iter(
1296 self.get_iterator(
(...)
1299 )
1300 )
1301 else:
-> 1302 return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1309, in Pipeline.run_single(self, inputs, preprocess_params, forward_params, postprocess_params)
1307 def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
1308 model_inputs = self.preprocess(inputs, **preprocess_params)
-> 1309 model_outputs = self.forward(model_inputs, **forward_params)
1310 outputs = self.postprocess(model_outputs, **postprocess_params)
1311 return outputs
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1209, in Pipeline.forward(self, model_inputs, **forward_params)
1207 with inference_context():
1208 model_inputs = self._ensure_tensor_on_device(model_inputs, device=self.device)
-> 1209 model_outputs = self._forward(model_inputs, **forward_params)
1210 model_outputs = self._ensure_tensor_on_device(model_outputs, device=torch.device("cpu"))
1211 else:
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/image_text_to_text.py:412, in ImageTextToTextPipeline._forward(self, model_inputs, generate_kwargs)
408 prompt_text = model_inputs.pop("text")
409 input_ids = (
410 model_inputs["input_ids"] if "input_ids" in model_inputs else model_inputs["decoder_input_ids"]
411 ) # for decoder-only models
--> 412 generated_sequence = self.model.generate(**model_inputs, **generate_kwargs)
414 return {"generated_sequence": generated_sequence, "prompt_text": prompt_text, "input_ids": input_ids}
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
113 @functools.wraps(func)
114 def decorate_context(*args, **kwargs):
115 with ctx_factory():
--> 116 return func(*args, **kwargs)
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/generation/utils.py:2208, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
2200 input_ids, model_kwargs = self._expand_inputs_for_generation(
2201 input_ids=input_ids,
2202 expand_size=generation_config.num_return_sequences,
2203 is_encoder_decoder=self.config.is_encoder_decoder,
2204 **model_kwargs,
2205 )
2207 # 12. run sample (it degenerates to greedy search when `generation_config.do_sample=False`)
-> 2208 result = self._sample(
2209 input_ids,
2210 logits_processor=prepared_logits_processor,
2211 stopping_criteria=prepared_stopping_criteria,
2212 generation_config=generation_config,
2213 synced_gpus=synced_gpus,
2214 streamer=streamer,
2215 **model_kwargs,
2216 )
2218 elif generation_mode in (GenerationMode.BEAM_SAMPLE, GenerationMode.BEAM_SEARCH):
2219 # 11. prepare beam search scorer
2220 beam_scorer = BeamSearchScorer(
2221 batch_size=batch_size,
2222 num_beams=generation_config.num_beams,
(...)
2227 max_length=generation_config.max_length,
2228 )
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/generation/utils.py:3176, in GenerationMixin._sample(self, input_ids, logits_processor, stopping_criteria, generation_config, synced_gpus, streamer, **model_kwargs)
3173 model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {})
3175 # forward pass to get next token
-> 3176 outputs = self(**model_inputs, return_dict=True)
3178 # synced_gpus: don't waste resources running the code we don't need; kwargs must be updated before skipping
3179 model_kwargs = self._update_model_kwargs_for_generation(
3180 outputs,
3181 model_kwargs,
3182 is_encoder_decoder=self.config.is_encoder_decoder,
3183 )
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
1551 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1552 else:
-> 1553 return self._call_impl(*args, **kwargs)
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
1557 # If we don't have any hooks, we want to skip the rest of the logic in
1558 # this function, and just call forward.
1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1560 or _global_backward_pre_hooks or _global_backward_hooks
1561 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562 return forward_call(*args, **kwargs)
1564 try:
1565 result = None
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/accelerate/hooks.py:170, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
168 output = module._old_forward(*args, **kwargs)
169 else:
--> 170 output = module._old_forward(*args, **kwargs)
171 return module._hf_hook.post_forward(module, output)
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/models/mllama/modeling_mllama.py:2138, in MllamaForConditionalGeneration.forward(self, input_ids, pixel_values, aspect_ratio_mask, aspect_ratio_ids, attention_mask, cross_attention_mask, cross_attention_states, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, cache_position, num_logits_to_keep)
2135 cross_attention_mask = cross_attention_mask[:, :, cache_position]
2136 full_text_row_masked_out_mask = full_text_row_masked_out_mask[:, :, cache_position]
-> 2138 outputs = self.language_model(
2139 input_ids=input_ids,
2140 attention_mask=attention_mask,
2141 position_ids=position_ids,
2142 cross_attention_states=cross_attention_states,
2143 cross_attention_mask=cross_attention_mask,
2144 full_text_row_masked_out_mask=full_text_row_masked_out_mask,
2145 past_key_values=past_key_values,
2146 use_cache=use_cache,
2147 inputs_embeds=inputs_embeds,
2148 labels=labels,
2149 output_hidden_states=output_hidden_states,
2150 output_attentions=output_attentions,
2151 return_dict=return_dict,
2152 cache_position=cache_position,
2153 num_logits_to_keep=num_logits_to_keep,
2154 )
2156 return outputs
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
1551 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1552 else:
-> 1553 return self._call_impl(*args, **kwargs)
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
1557 # If we don't have any hooks, we want to skip the rest of the logic in
1558 # this function, and just call forward.
1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1560 or _global_backward_pre_hooks or _global_backward_hooks
1561 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562 return forward_call(*args, **kwargs)
1564 try:
1565 result = None
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/models/mllama/modeling_mllama.py:1948, in MllamaForCausalLM.forward(self, input_ids, attention_mask, position_ids, cross_attention_states, cross_attention_mask, full_text_row_masked_out_mask, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, cache_position, num_logits_to_keep)
1931 outputs = self.model(
1932 input_ids=input_ids,
1933 cross_attention_states=cross_attention_states,
(...)
1944 cache_position=cache_position,
1945 )
1947 hidden_states = outputs[0]
-> 1948 logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :]).float()
1950 loss = None
1951 if labels is not None:
1952 # Upcast to float if we need to compute the loss to avoid potential precision issues
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
1551 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1552 else:
-> 1553 return self._call_impl(*args, **kwargs)
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
1557 # If we don't have any hooks, we want to skip the rest of the logic in
1558 # this function, and just call forward.
1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1560 or _global_backward_pre_hooks or _global_backward_hooks
1561 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562 return forward_call(*args, **kwargs)
1564 try:
1565 result = None
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
164 def new_forward(module, *args, **kwargs):
--> 165 args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
166 if module._hf_hook.no_grad:
167 with torch.no_grad():
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/accelerate/hooks.py:355, in AlignDevicesHook.pre_forward(self, module, *args, **kwargs)
347 if (
348 value is not None
349 and self.tied_params_map is not None
350 and value.data_ptr() in self.tied_params_map
351 and self.execution_device not in self.tied_params_map[value.data_ptr()]
352 ):
353 self.tied_pointers_to_remove.add((value.data_ptr(), self.execution_device))
--> 355 set_module_tensor_to_device(
356 module,
357 name,
358 self.execution_device,
359 value=value,
360 fp16_statistics=fp16_statistics,
361 tied_params_map=self.tied_params_map,
362 )
364 return send_to_device(args, self.execution_device), send_to_device(
365 kwargs, self.execution_device, skip_keys=self.skip_keys
366 )
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/accelerate/utils/modeling.py:329, in set_module_tensor_to_device(module, tensor_name, device, value, dtype, fp16_statistics, tied_params_map)
327 module._parameters[tensor_name] = param_cls(new_value, requires_grad=old_value.requires_grad)
328 elif isinstance(value, torch.Tensor):
--> 329 new_value = value.to(device)
330 else:
331 new_value = torch.tensor(value, device=device)
OutOfMemoryError: CUDA out of memory. Tried to allocate 3.91 GiB. GPU 0 has a total capacity of 79.10 GiB of which 2.12 GiB is free. Including non-PyTorch memory, this process has 76.97 GiB memory in use. Of the allocated memory 75.56 GiB is allocated by PyTorch, and 761.23 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) |
Thanks for the feedback @knkski! Although it's not really an objective of this pipeline, I think we can try to add support and raise a warning at least, wdyt @Rocketknight1 ? |
@yonigozlan I think that's okay! It might result in a bit of crossover with |
@Rocketknight1 @knkski , text-only inference should be supported now :) |
035d953
to
17903d1
Compare
@yonigozlan Thanks! Works great for me 🚀 I think the extra memory usage is unrelated to this PR, so ignore that 👍 |
7038c52
to
46d6891
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks good! The tests seem good and the pipeline code looks clean! A lot of the code is familiar from the text-generation
pipeline, with modifications for images.
The only question I have is whether it'll be confusing to have e.g. image-text-to-text
as well as image-to-text
and text-generation
pipelines. In particular, it feels like this pipeline is almost a "superset" of text-generation
, since it can handle both text completions and chat completions with templates, which means it's basically just text-generation
plus image support.
That might mean we should take these changes and fold them into text-generation
instead. However, that might add additional inputs that would make it harder to synchronize the pipeline with the inference spec - cc @Wauplin / @LysandreJik, how annoying do you think that would be?
X-posting the slack thread (private) about that convo. |
31432b4
to
d739c0a
Compare
There is still some issues with pipeline tests:
|
@yonigozlan tiny models aren't automatically generated, those are all manually created. Rather than modifying an existing one (which might break existing tests), I'd suggest just making a new tiny model that fits what you want to test and uploading that to |
I see, thanks for the explanation! As for adding new tiny model, pipelines use the |
@yonigozlan probably the easiest thing to do, in that case, is just to manually upload a new model, don't add it to Also, I was wrong - some of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! I think it's very important, thus we should try to make it a bit more simple. 🤗
c05ceb2
to
61cc576
Compare
Thanks for all of your inputs! I'll merged this now as the remaining issues/improvements raised seem a bit out of scope for this PR.
|
* Standardize image-text-to-text-models-output add post_process_image_text_to_text to chameleon and cleanup Fix legacy kwarg behavior and deprecation warning add post_process_image_text_to_text to qwen2_vl and llava_onevision Add post_process_image_text_to_text to idefics3, mllama, pixtral processor * nit var name post_process_image_text_to_text udop * nit fix deprecation warnings * Add image-text-to-text pipeline * add support for image url in chat template for pipeline * Reformat to be fully compatible with chat templates * Add tests chat template * Fix imports and tests * Add pipeline tag * change logic handling of single prompt ans multiple images * add pipeline mapping to models * fix batched inference * fix tests * Add manual batching for preprocessing * Fix outputs with nested images * Add support for all common processing kwargs * Add default padding when multiple text inputs (batch size>1) * nit change version deprecation warning * Add support for text only inference * add chat_template warnings * Add pipeline tests and add copied from post process function * Fix batched pipeline tests * nit * Fix pipeline tests blip2 * remove unnecessary max_new_tokens * revert processing kosmos2 and remove unnecessary max_new_tokens * fix pipeline tests idefics * Force try loading processor if pipeline supports it * revert load_processor change * hardcode loading only processor * remove unnecessary try except * skip imagetexttotext tests for kosmos2 as tiny model causes problems * Make code clearer * Address review comments * remove preprocessing logic from pipeline * fix fuyu * add BC resize fuyu * Move post_process_image_text_to_text to ProcessorMixin * add guard in post_process * fix zero shot object detection pipeline * add support for generator input in pipeline * nit * change default image-text-to-text model to llava onevision * fix owlv2 size dict * Change legacy deprecation warning to only show when True
* Standardize image-text-to-text-models-output add post_process_image_text_to_text to chameleon and cleanup Fix legacy kwarg behavior and deprecation warning add post_process_image_text_to_text to qwen2_vl and llava_onevision Add post_process_image_text_to_text to idefics3, mllama, pixtral processor * nit var name post_process_image_text_to_text udop * nit fix deprecation warnings * Add image-text-to-text pipeline * add support for image url in chat template for pipeline * Reformat to be fully compatible with chat templates * Add tests chat template * Fix imports and tests * Add pipeline tag * change logic handling of single prompt ans multiple images * add pipeline mapping to models * fix batched inference * fix tests * Add manual batching for preprocessing * Fix outputs with nested images * Add support for all common processing kwargs * Add default padding when multiple text inputs (batch size>1) * nit change version deprecation warning * Add support for text only inference * add chat_template warnings * Add pipeline tests and add copied from post process function * Fix batched pipeline tests * nit * Fix pipeline tests blip2 * remove unnecessary max_new_tokens * revert processing kosmos2 and remove unnecessary max_new_tokens * fix pipeline tests idefics * Force try loading processor if pipeline supports it * revert load_processor change * hardcode loading only processor * remove unnecessary try except * skip imagetexttotext tests for kosmos2 as tiny model causes problems * Make code clearer * Address review comments * remove preprocessing logic from pipeline * fix fuyu * add BC resize fuyu * Move post_process_image_text_to_text to ProcessorMixin * add guard in post_process * fix zero shot object detection pipeline * add support for generator input in pipeline * nit * change default image-text-to-text model to llava onevision * fix owlv2 size dict * Change legacy deprecation warning to only show when True
What does this PR do?
Add image-text-to-text pipeline!
A split of this PR with only model-specific pre and post processing is available here, in order to reduce the loc count and number of files changed before merging this PR.
Note: The use of a
"legacy"
kwarg to modify the preprocessing of some image-text-to-text models is needed here if we want to integrate those models into this pipeline. However, the way it is handled might not be ideal, so I'm open to suggestion on how to improve this.the pipeline support the following inputs:
images=image, text=text
images = [image, image], text= [text, text]
images = [[image, image], [image]] or images=[image, image, image], text = ["... <image>...<image>...", "...<image>..."]
TODOs:
Known current limitations/bugs:
Pipeline with tokenizer without pad_token cannot do batching. You can try to set it with
pipe.tokenizer.pad_token_id = model.config.eos_token_id.
Examples of usage:
Who can review?
@Rocketknight1 @molbap @qubvel @NielsRogge