Which Vision Model? #65

alensiljak · 2025-02-14T11:21:22Z

Hi!
Thanks for adding new features so quickly!

I'd like to review the options for the Vision Model. Maybe get a recommendation and/or update the options.
From the three options offered:

OpenAI tells me I'm over the quota. I assume they are a paid service and I have no credit with them.
Google models provide all kinds of errors

Error: No parseable tool calls provided to GoogleGenerativeAIToolsOutputParser.

Ollama
3a. with DeepSeek-R1

error: 'registry.ollama.ai/library/deepseek-r1:latest does not support tools'
3b. with Llava-llama3, which should have vision
registry.ollama.ai/library/llava-llama3:latest does not support tools

I guess this is all due to "Only models with vision capabilities can be used" instruction.

Any suggestions on which model provides good results? It would be good to hear from other users, as well.
Perhaps this can be moved to discussions if there are no other actions that would make this configuration easier.

The text was updated successfully, but these errors were encountered:

alensiljak · 2025-02-14T11:22:19Z

Some AI-generated recommendations are listed below but I have (so far) not tried any of them:

Llama 3.2 Vision: Developed by Meta, this model is available in 11B and 90B parameter sizes and can handle both image and text inputs simultaneously5.
NVLM (NVIDIA Vision Language Model): This family of models includes three distinct architectures for different use cases, offering powerful image reasoning capabilities5.
Molmo: Developed by the Allen Institute for AI, Molmo models are available in 1B, 7B, and 72B parameter sizes and can perform on par with some proprietary models5.
CLIP: This model provides joint image-text embeddings and can be used for tasks like zero-shot image classification1.
FLAVA: Trained with both unimodal and multi-modal pre-training objectives, FLAVA can be used for vision, language, and multi-modal tasks1.
OWL-ViT: This model enables zero-shot/text-guided and one-shot/image-guided object detection1.
CLIPSeg and GroupViT: These models allow for text and image-guided image segmentation1.
VisualBERT, GIT, and ViLT: These models enable visual question answering and various other tasks1.
X-CLIP: A multi-modal model trained with video and text modalities, enabling zero-shot video classification1.

alensiljak · 2025-02-14T11:24:37Z

There is also llava available on Ollama.

Edit: it gives the same error, though.
registry.ollama.ai/library/llava-llama3:latest does not support tools

alensiljak · 2025-02-16T23:08:13Z

Interesting suggestions on the topic:
https://news.ycombinator.com/item?id=43048698

FedAnt · 2025-02-25T03:42:50Z

Which model does photo and pdf recognition work with?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Which Vision Model? #65

Which Vision Model? #65

alensiljak commented Feb 14, 2025 •

edited

Loading

alensiljak commented Feb 14, 2025 •

edited

Loading

alensiljak commented Feb 14, 2025 •

edited

Loading

alensiljak commented Feb 16, 2025

FedAnt commented Feb 25, 2025

Which Vision Model? #65

Which Vision Model? #65

Comments

alensiljak commented Feb 14, 2025 • edited Loading

alensiljak commented Feb 14, 2025 • edited Loading

alensiljak commented Feb 14, 2025 • edited Loading

alensiljak commented Feb 16, 2025

FedAnt commented Feb 25, 2025

alensiljak commented Feb 14, 2025 •

edited

Loading

alensiljak commented Feb 14, 2025 •

edited

Loading

alensiljak commented Feb 14, 2025 •

edited

Loading