You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’ve played around with this a bit but wanted to list out the idea here in case someone can implement it sooner. The idea is neither novel nor particularly interesting, but I think it’s needed: dllama-api should support the models endpoint and allow inference against multiple models from a pre-specified list.
Yeah we’re resource constrained on local machines, so the first completion of a model might take dozens of seconds as the model is loaded into place.. and you wouldn’t want multiple tools pulling from different models interleaved with each other because you’d thrash between models in memory etc.
..but the ability to start dllama-api in “here is a list of valid models” mode (maybe just derived from some models directory, or possibly via command line args or a config file), so that the models endpoint could be correctly queried, and a completion task on that model could then succeed, would mean that we could leave a dllama-api server running all the time and access whichever we want, rather than bringing down dllama-api, starting up a new one, etc each time.
Right now I can run LMStudio’s server for this for models that fit in memory, but we should be able to do this distributed as well.
A good test of this being completed would be to try getting open-webui to work with dllama-api. I’ve tried adding in dummy model responses but can’t get open-webui to successfully talk to distributed-llama.
The text was updated successfully, but these errors were encountered:
I’ve played around with this a bit but wanted to list out the idea here in case someone can implement it sooner. The idea is neither novel nor particularly interesting, but I think it’s needed: dllama-api should support the models endpoint and allow inference against multiple models from a pre-specified list.
Yeah we’re resource constrained on local machines, so the first completion of a model might take dozens of seconds as the model is loaded into place.. and you wouldn’t want multiple tools pulling from different models interleaved with each other because you’d thrash between models in memory etc.
..but the ability to start dllama-api in “here is a list of valid models” mode (maybe just derived from some models directory, or possibly via command line args or a config file), so that the models endpoint could be correctly queried, and a completion task on that model could then succeed, would mean that we could leave a dllama-api server running all the time and access whichever we want, rather than bringing down dllama-api, starting up a new one, etc each time.
Right now I can run LMStudio’s server for this for models that fit in memory, but we should be able to do this distributed as well.
A good test of this being completed would be to try getting open-webui to work with dllama-api. I’ve tried adding in dummy model responses but can’t get open-webui to successfully talk to distributed-llama.
The text was updated successfully, but these errors were encountered: