Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which Vision Model? #65

Open
alensiljak opened this issue Feb 14, 2025 · 4 comments
Open

Which Vision Model? #65

alensiljak opened this issue Feb 14, 2025 · 4 comments

Comments

@alensiljak
Copy link
Contributor

alensiljak commented Feb 14, 2025

Hi!
Thanks for adding new features so quickly!

I'd like to review the options for the Vision Model. Maybe get a recommendation and/or update the options.
From the three options offered:

  1. OpenAI tells me I'm over the quota. I assume they are a paid service and I have no credit with them.
  2. Google models provide all kinds of errors

Error: No parseable tool calls provided to GoogleGenerativeAIToolsOutputParser.

  1. Ollama
    3a. with DeepSeek-R1

error: 'registry.ollama.ai/library/deepseek-r1:latest does not support tools'
3b. with Llava-llama3, which should have vision
registry.ollama.ai/library/llava-llama3:latest does not support tools

I guess this is all due to "Only models with vision capabilities can be used" instruction.

Any suggestions on which model provides good results? It would be good to hear from other users, as well.
Perhaps this can be moved to discussions if there are no other actions that would make this configuration easier.

@alensiljak
Copy link
Contributor Author

alensiljak commented Feb 14, 2025

Some AI-generated recommendations are listed below but I have (so far) not tried any of them:

  • Llama 3.2 Vision: Developed by Meta, this model is available in 11B and 90B parameter sizes and can handle both image and text inputs simultaneously5.
  • NVLM (NVIDIA Vision Language Model): This family of models includes three distinct architectures for different use cases, offering powerful image reasoning capabilities5.
  • Molmo: Developed by the Allen Institute for AI, Molmo models are available in 1B, 7B, and 72B parameter sizes and can perform on par with some proprietary models5.
  • CLIP: This model provides joint image-text embeddings and can be used for tasks like zero-shot image classification1.
  • FLAVA: Trained with both unimodal and multi-modal pre-training objectives, FLAVA can be used for vision, language, and multi-modal tasks1.
  • OWL-ViT: This model enables zero-shot/text-guided and one-shot/image-guided object detection1.
  • CLIPSeg and GroupViT: These models allow for text and image-guided image segmentation1.
  • VisualBERT, GIT, and ViLT: These models enable visual question answering and various other tasks1.
  • X-CLIP: A multi-modal model trained with video and text modalities, enabling zero-shot video classification1.

@alensiljak
Copy link
Contributor Author

alensiljak commented Feb 14, 2025

There is also llava available on Ollama.

Edit: it gives the same error, though.
registry.ollama.ai/library/llava-llama3:latest does not support tools

@alensiljak
Copy link
Contributor Author

Interesting suggestions on the topic:
https://news.ycombinator.com/item?id=43048698

@FedAnt
Copy link

FedAnt commented Feb 25, 2025

Which model does photo and pdf recognition work with?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants