-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llava Initial approach to clear images #664
Conversation
This reverts commit f29f61e.
SignalRT, 3 short remarks.
|
This feels a bit like a hack to workaround not having properly KV cache management in the executors tbh. But without a redesign of the way executors work with sequence IDs etc (which is what I'm working on, slowly, with |
@martindevans Recently I'm also working on such things (experimental), inspired by vllm. There are a lot of works to do so I think I'll add it to |
|
@martindevans, @AsakusaRinne I will integrate this PR "as is". We will probably need to change llava support in the near future (ggerganov/llama.cpp#4216 (comment)). |
Thank you for your answer SignalRT, After reading the reference you have provided above and taking into account that the models are trained with one image only per input record, I think that it does not have any sense to use multiple images as input for one prompt. I also do not see immediately a good use case for such a functionality. I think that the model will be confused and in the best case it will describe multiple images as it would be one image, for example as a collage or it will just describe one image. I have noticed the higher GPU usage because I have an example which does not run anymore on the GPU (out of memory), but was fitting into GPU memory before. I do not know what has happened and where, I just see the result of it. Maybe it has to do with GPU memory allocation type (automatic thread pool with overhead). I think that the GPU memory allocation type has changed in the C# code or in the C++ code. I hope that we will have some kind of option to optimize GPU memory usage because it is very important for the speed (everything is much faster with GPU). I have tried saving/loading the state also and it did not work. After loading the context state the model produces garbage in my case. Probably a bug in state management somewhere. The kv cache clearing works. @martindevans, @AsakusaRinne, @SignalRT, does state saving/loading work with the general llama code base? This is important because I was counting on this to work (using it to reset the context without recreating it). Or do we need to use kv cache clearing there also? |
I'm not aware of any issues with state load/save at the moment. It's something I'm going to be actively working on soon (load/save conversations in the |
This changes the Interactive executor to be able to continue the conversation including new images. The example (outside the executor) clears the KV_CACHE to reset the previous images.