You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using the docsgpt-7b-mistral.Q8_0.gguf model on a 1080 Ti. The model fits in the VRAM with 300MB to spare. I can query the model and it works. However, I noticed that querying the model a second time causes an out of memory error. After investigation, this is caused by DocsGPT creating a new instance of the model to answer my second query, instead of reusing the previous one or at least removing the first one before creating a new instance. Even if I had enough memory, creating a new instance causes an unnecessary delay with each query.
👟 Reproduction steps
I am running DocsGPT 0.9.0 locally (./setup.sh with option 2). The environment is configured to use llama-cpp with huggingface_sentence-transformers/all-mpnet-base-v2 embeddings. To reproduce the bug, monitor the VRAM with nvidia-smi then ask at least 2 questions with this environment. Notice how the VRAM usage either doubles (if you have enough VRAM) or DocsGPT runs out of memory.
👍 Expected behavior
The LLM instance should be cached for future queries.
👎 Actual Behavior with Screenshots
DocsGPT runs out of memory
💻 Operating system
Windows
What browsers are you seeing the problem on?
Firefox, Chrome
🤖 What development environment are you experiencing this bug on?
Docker
🔒 Did you set the correct environment variables in the right path? List the environment variable names (not values please!)
I have fixed this issue on my side by modifying the llama_cpp.py script to cache and reuse LLM instances. This requires the flask-caching package and some modifications to llms, retrievers and the answer API.
Here is the implementation in llama_cpp.py:
singleton_llm= {
'type': None,
'llm': None
}
defcreate_llm(self, type, api_key, user_api_key, *args, **kwargs):
llm_class=self.llms.get(type.lower())
ifnotllm_class:
raiseValueError(f"No LLM class found for type {type}")
# do not create a new LLM (and allocate memory again) for each request for local modelsifself.singleton_llm['type'] !=llm_classorself.singleton_llm['type'] !=LlamaCpp:
llm=llm_class(api_key, user_api_key, *args, **kwargs)
self.singleton_llm['type'] =llm_classself.singleton_llm['llm'] =llmreturnself.singleton_llm['llm']
📜 Description
I am using the docsgpt-7b-mistral.Q8_0.gguf model on a 1080 Ti. The model fits in the VRAM with 300MB to spare. I can query the model and it works. However, I noticed that querying the model a second time causes an out of memory error. After investigation, this is caused by DocsGPT creating a new instance of the model to answer my second query, instead of reusing the previous one or at least removing the first one before creating a new instance. Even if I had enough memory, creating a new instance causes an unnecessary delay with each query.
👟 Reproduction steps
I am running DocsGPT 0.9.0 locally (./setup.sh with option 2). The environment is configured to use llama-cpp with huggingface_sentence-transformers/all-mpnet-base-v2 embeddings. To reproduce the bug, monitor the VRAM with nvidia-smi then ask at least 2 questions with this environment. Notice how the VRAM usage either doubles (if you have enough VRAM) or DocsGPT runs out of memory.
👍 Expected behavior
The LLM instance should be cached for future queries.
👎 Actual Behavior with Screenshots
DocsGPT runs out of memory
💻 Operating system
Windows
What browsers are you seeing the problem on?
Firefox, Chrome
🤖 What development environment are you experiencing this bug on?
Docker
🔒 Did you set the correct environment variables in the right path? List the environment variable names (not values please!)
CELERY_BROKER_URL
CELERY_RESULT_BACKEND
EMBEDDINGS_NAME
FLASK_APP
FLASK_DEBUG
LLM_NAME
VITE_API_STREAMING
📃 Provide any additional context for the Bug.
I have fixed this issue on my side by modifying the llama_cpp.py script to cache and reuse LLM instances. This requires the flask-caching package and some modifications to llms, retrievers and the answer API.
Here is the implementation in llama_cpp.py:
📖 Relevant log output
👀 Have you spent some time to check if this bug has been raised before?
🔗 Are you willing to submit PR?
Yes, I am willing to submit a PR!
🧑⚖️ Code of Conduct
The text was updated successfully, but these errors were encountered: