Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR: allow multimodal input / vision / images #429

Closed
thiswillbeyourgithub opened this issue Apr 23, 2024 · 1 comment · May be fixed by #430
Closed

FR: allow multimodal input / vision / images #429

thiswillbeyourgithub opened this issue Apr 23, 2024 · 1 comment · May be fixed by #430

Comments

@thiswillbeyourgithub
Copy link
Contributor

It would be simple to make it so that in the prompt text paths/urls to images are replaced by image call.

I could then for example add a shortcut so that images that are in my clipboard could be pasted to /tmp and add a path automatically.

See the kind of workflow implemented in ollama:

What's in this image? /Users/jmorgan/Desktop/smile.png
The image features a yellow smiley face, which is likely the central focus of the picture.

Somewhat related to:

Edit:
Oh I see that there's already partial support there: #332

It should be :

  • enabled for the other gpt4 models that support it
  • mentionned in the docs
  • support local files
    I'll see about making a PR
@thiswillbeyourgithub
Copy link
Contributor Author

For anyone interested I added a patch file and demo showcasing the vision feature in this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant