Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Image features and image tokens do not match #710

Open
MoyusiteruIori opened this issue Jan 4, 2025 · 0 comments
Open

ValueError: Image features and image tokens do not match #710

MoyusiteruIori opened this issue Jan 4, 2025 · 0 comments

Comments

@MoyusiteruIori
Copy link

I'm using transformers==4.47.1.

Command:

python run.py --data MUIRBench --model llava_next_interleave_7b --verbose

Output:

[2025-01-04 17:24:42,096] WARNING - RUN - run.py: main - 165: --reuse is not set, will not reuse previous (before one day) temporary files
[2025-01-04 17:24:42] WARNING - run.py: main - 165: --reuse is not set, will not reuse previous (before one day) temporary files
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.97it/s]
/home/mengyu/VLMEvalKit/vlmeval/vlm/llava/llava.py:294: UserWarning: Following kwargs received: {'do_sample': False, 'temperature': 0, 'max_new_tokens': 512, 'top_p': None, 'num_beams': 1}, will use as generation config. 
  warnings.warn(
  0%|                                                                                                                                                                                    | 0/2600 [00:00<?, ?it/s]You may have used the wrong order for inputs. `images` should be passed before `text`. The `images` and `text` inputs will be swapped. This behavior will be deprecated in transformers v4.47.
/home/mengyu/miniconda3/envs/vlmevalkit/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:628: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
  0%|                                                                                                                                                                                    | 0/2600 [00:00<?, ?it/s]
[2025-01-04 17:24:52,047] ERROR - RUN - run.py: main - 411: Model llava_next_interleave_7b x Dataset MUIRBench combination failed: Image features and image tokens do not match: tokens: 5832, features 2916, skipping this combination.
Traceback (most recent call last):
  File "/home/mengyu/VLMEvalKit/run.py", line 299, in main
    model = infer_data_job(
  File "/home/mengyu/VLMEvalKit/vlmeval/inference.py", line 165, in infer_data_job
    model = infer_data(
  File "/home/mengyu/VLMEvalKit/vlmeval/inference.py", line 130, in infer_data
    response = model.generate(message=struct, dataset=dataset_name)
  File "/home/mengyu/VLMEvalKit/vlmeval/vlm/base.py", line 115, in generate
    return self.generate_inner(message, dataset)
  File "/home/mengyu/VLMEvalKit/vlmeval/vlm/llava/llava.py", line 402, in generate_inner
    output = self.model.generate(**inputs, **self.kwargs)
  File "/home/mengyu/miniconda3/envs/vlmevalkit/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/mengyu/miniconda3/envs/vlmevalkit/lib/python3.10/site-packages/transformers/generation/utils.py", line 2252, in generate
    result = self._sample(
  File "/home/mengyu/miniconda3/envs/vlmevalkit/lib/python3.10/site-packages/transformers/generation/utils.py", line 3251, in _sample
    outputs = self(**model_inputs, return_dict=True)
  File "/home/mengyu/miniconda3/envs/vlmevalkit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/mengyu/miniconda3/envs/vlmevalkit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mengyu/miniconda3/envs/vlmevalkit/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 534, in forward
    raise ValueError(
ValueError: Image features and image tokens do not match: tokens: 5832, features 2916
[2025-01-04 17:24:52] ERROR - run.py: main - 411: Model llava_next_interleave_7b x Dataset MUIRBench combination failed: Image features and image tokens do not match: tokens: 5832, features 2916, skipping this combination.
Traceback (most recent call last):
  File "/home/mengyu/VLMEvalKit/run.py", line 299, in main
    model = infer_data_job(
  File "/home/mengyu/VLMEvalKit/vlmeval/inference.py", line 165, in infer_data_job
    model = infer_data(
  File "/home/mengyu/VLMEvalKit/vlmeval/inference.py", line 130, in infer_data
    response = model.generate(message=struct, dataset=dataset_name)
  File "/home/mengyu/VLMEvalKit/vlmeval/vlm/base.py", line 115, in generate
    return self.generate_inner(message, dataset)
  File "/home/mengyu/VLMEvalKit/vlmeval/vlm/llava/llava.py", line 402, in generate_inner
    output = self.model.generate(**inputs, **self.kwargs)
  File "/home/mengyu/miniconda3/envs/vlmevalkit/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/mengyu/miniconda3/envs/vlmevalkit/lib/python3.10/site-packages/transformers/generation/utils.py", line 2252, in generate
    result = self._sample(
  File "/home/mengyu/miniconda3/envs/vlmevalkit/lib/python3.10/site-packages/transformers/generation/utils.py", line 3251, in _sample
    outputs = self(**model_inputs, return_dict=True)
  File "/home/mengyu/miniconda3/envs/vlmevalkit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/mengyu/miniconda3/envs/vlmevalkit/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mengyu/miniconda3/envs/vlmevalkit/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 534, in forward
    raise ValueError(
ValueError: Image features and image tokens do not match: tokens: 5832, features 2916

Seems that the model works fine with

vlmutil check llava_next_interleave_7b
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.01it/s]
/home/mengyu/VLMEvalKit/vlmeval/vlm/llava/llava.py:294: UserWarning: Following kwargs received: {'do_sample': False, 'temperature': 0, 'max_new_tokens': 512, 'top_p': None, 'num_beams': 1}, will use as generation config. 
  warnings.warn(
Model: llava_next_interleave_7b
You may have used the wrong order for inputs. `images` should be passed before `text`. The `images` and `text` inputs will be swapped. This behavior will be deprecated in transformers v4.47.
/home/mengyu/miniconda3/envs/vlmevalkit/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:628: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Expanding inputs for image tokens in LLaVa should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.50.
Test 1: The image shows a red apple with a green leaf attached to its stem. The apple appears to be fresh and shiny, suggesting it is ripe. The background is plain white, which highlights the apple as the main subject of the image.
Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Test 2: The image shows a red apple with a green leaf attached to its stem. The apple appears to be fresh and shiny, suggesting it is ripe. The background is plain white, which highlights the apple as the main subject of the image.
Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Test 3: There is only one apple in each image.
Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
Test 4: There is only one apple in each image.

but does not work when using MUIRBench (which is automatically downloaded by the program).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant