🐛 Bug Report: Local model is recreated for each request, causing delay and out of memory errors #945

starkgate · 2024-05-10T13:49:39Z

📜 Description

I am using the docsgpt-7b-mistral.Q8_0.gguf model on a 1080 Ti. The model fits in the VRAM with 300MB to spare. I can query the model and it works. However, I noticed that querying the model a second time causes an out of memory error. After investigation, this is caused by DocsGPT creating a new instance of the model to answer my second query, instead of reusing the previous one or at least removing the first one before creating a new instance. Even if I had enough memory, creating a new instance causes an unnecessary delay with each query.

👟 Reproduction steps

I am running DocsGPT 0.9.0 locally (./setup.sh with option 2). The environment is configured to use llama-cpp with huggingface_sentence-transformers/all-mpnet-base-v2 embeddings. To reproduce the bug, monitor the VRAM with nvidia-smi then ask at least 2 questions with this environment. Notice how the VRAM usage either doubles (if you have enough VRAM) or DocsGPT runs out of memory.

👍 Expected behavior

The LLM instance should be cached for future queries.

👎 Actual Behavior with Screenshots

DocsGPT runs out of memory

💻 Operating system

Windows

What browsers are you seeing the problem on?

Firefox, Chrome

🤖 What development environment are you experiencing this bug on?

Docker

🔒 Did you set the correct environment variables in the right path? List the environment variable names (not values please!)

CELERY_BROKER_URL
CELERY_RESULT_BACKEND
EMBEDDINGS_NAME
FLASK_APP
FLASK_DEBUG
LLM_NAME
VITE_API_STREAMING

📃 Provide any additional context for the Bug.

I have fixed this issue on my side by modifying the llama_cpp.py script to cache and reuse LLM instances. This requires the flask-caching package and some modifications to llms, retrievers and the answer API.

Here is the implementation in llama_cpp.py:

singleton_llm = {
    'type': None,
    'llm': None
}

def create_llm(self, type, api_key, user_api_key, *args, **kwargs):
    llm_class = self.llms.get(type.lower())
    if not llm_class:
        raise ValueError(f"No LLM class found for type {type}")

    # do not create a new LLM (and allocate memory again) for each request for local models
    if self.singleton_llm['type'] != llm_class or self.singleton_llm['type'] != LlamaCpp:
        llm = llm_class(api_key, user_api_key, *args, **kwargs)
        self.singleton_llm['type'] = llm_class
        self.singleton_llm['llm'] = llm

    return self.singleton_llm['llm']

📖 Relevant log output

* Restarting with stat
 * Debugger is active!
 * Debugger PIN: 760-653-407
192.168.2.98 - - [10/May/2024 15:34:15] "GET /api/combine HTTP/1.1" 200 -
192.168.2.98 - - [10/May/2024 15:34:15] "GET /api/combine HTTP/1.1" 200 -
192.168.2.98 - - [10/May/2024 15:34:15] "GET /api/get_conversations HTTP/1.1" 200 -
192.168.2.98 - - [10/May/2024 15:34:15] "GET /api/get_conversations HTTP/1.1" 200 -
192.168.2.98 - - [10/May/2024 15:34:34] "GET /api/get_single_conversation?id=663e1d2c8f046fbf993c849b HTTP/1.1" 200 -
192.168.2.98 - - [10/May/2024 15:34:37] "OPTIONS /stream HTTP/1.1" 200 -
llama_model_loader: loaded meta data with 25 key-value pairs and 291 tensors from /home/jstark/DocsGPT/models/docsgpt-7b-mistral.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = arc53_docsgpt-7b-mistral
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 7
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,58980]   = ["▁ t", "i n", "e r", "▁ a", "h e...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% for message in messages %}\n{% if m...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 7.17 GiB (8.50 BPW)
llm_load_print_meta: general.name     = arc53_docsgpt-7b-mistral
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   132.81 MiB
llm_load_tensors:      CUDA0 buffer size =  7205.83 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   164.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
Model metadata: {'tokenizer.chat_template': "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n'  + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}", 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '2', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.rope.freq_base': '10000.000000', 'llama.context_length': '32768', 'general.name': 'arc53_docsgpt-7b-mistral', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '7'}
Using gguf chat template: {% for message in messages %}
{% if message['role'] == 'user' %}
{{ '<|user|>
' + message['content'] + eos_token }}
{% elif message['role'] == 'system' %}
{{ '<|system|>
' + message['content'] + eos_token }}
{% elif message['role'] == 'assistant' %}
{{ '<|assistant|>
'  + message['content'] + eos_token }}
{% endif %}
{% if loop.last and add_generation_prompt %}
{{ '<|assistant|>' }}
{% endif %}
{% endfor %}
Using chat eos_token: </s>
Using chat bos_token: <s>
192.168.2.98 - - [10/May/2024 15:34:44] "POST /stream HTTP/1.1" 200 -

llama_print_timings:        load time =     470.03 ms
llama_print_timings:      sample time =     190.46 ms /   369 runs   (    0.52 ms per token,  1937.37 tokens per second)
llama_print_timings: prompt eval time =     469.84 ms /   121 tokens (    3.88 ms per token,   257.54 tokens per second)
llama_print_timings:        eval time =   10895.33 ms /   368 runs   (   29.61 ms per token,    33.78 tokens per second)
llama_print_timings:       total time =   12657.91 ms /   489 tokens
192.168.2.98 - - [10/May/2024 15:34:56] "OPTIONS /api/search HTTP/1.1" 200 -
192.168.2.98 - - [10/May/2024 15:34:56] "GET /api/get_conversations HTTP/1.1" 200 -
192.168.2.98 - - [10/May/2024 15:34:58] "POST /api/search HTTP/1.1" 200 -
192.168.2.98 - - [10/May/2024 15:35:01] "OPTIONS /stream HTTP/1.1" 200 -
llama_model_loader: loaded meta data with 25 key-value pairs and 291 tensors from /home/jstark/DocsGPT/models/docsgpt-7b-mistral.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = arc53_docsgpt-7b-mistral
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 7
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,58980]   = ["▁ t", "i n", "e r", "▁ a", "h e...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% for message in messages %}\n{% if m...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 7.17 GiB (8.50 BPW)
llm_load_print_meta: general.name     = arc53_docsgpt-7b-mistral
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.30 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 7205.83 MiB on device 0: cudaMalloc failed: out of memory
llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: failed to load model
Debugging middleware caught exception in streamed response at a point where response headers were already sent.
Traceback (most recent call last):
  File "/home/jstark/.local/lib/python3.10/site-packages/werkzeug/wsgi.py", line 256, in __next__
    return self._next()
  File "/home/jstark/.local/lib/python3.10/site-packages/werkzeug/wrappers/response.py", line 32, in _iter_encoded
    for item in iterable:
  File "/home/jstark/DocsGPT/application/api/answer/routes.py", line 183, in complete_stream
    for line in answer:
  File "/home/jstark/DocsGPT/application/retriever/classic_rag.py", line 99, in gen
    llm = cache.get("llm_creator").create_llm(
  File "/home/jstark/DocsGPT/application/llm/llm_creator.py", line 35, in create_llm
    llm = llm_class(api_key, user_api_key, *args, **kwargs)
  File "/home/jstark/DocsGPT/application/llm/llama_cpp.py", line 26, in __init__
    llama = Llama(model_path=llm_name, n_ctx=2048, n_gpu_layers=-1)
  File "/home/jstark/.conda/envs/docsgpt/lib/python3.10/site-packages/llama_cpp/llama.py", line 333, in __init__
    self._model = _LlamaModel(
  File "/home/jstark/.conda/envs/docsgpt/lib/python3.10/site-packages/llama_cpp/_internals.py", line 57, in __init__
    raise ValueError(f"Failed to load model from file: {path_model}")
ValueError: Failed to load model from file: /home/jstark/DocsGPT/models/docsgpt-7b-mistral.Q8_0.gguf

👀 Have you spent some time to check if this bug has been raised before?

I checked and didn't find similar issue

🔗 Are you willing to submit PR?

Yes, I am willing to submit a PR!

🧑‍⚖️ Code of Conduct

I agree to follow this project's Code of Conduct

dartpain · 2024-06-25T13:58:58Z

Hi @starkgate please check out #1013
Tell me if there are any comment

dartpain assigned starkgate May 12, 2024

starkgate mentioned this issue May 28, 2024

Cache local model to reduce memory usage and delays #966

Closed

dartpain added this to DocsGPT Roadmap May 30, 2024

dartpain moved this to Backlog in DocsGPT Roadmap May 30, 2024

dartpain linked a pull request Jun 25, 2024 that will close this issue

fix: use singleton in llama_cpp #1013

Merged

dartpain closed this as completed in #1013 Jun 25, 2024

github-project-automation bot moved this from Backlog to Done in DocsGPT Roadmap Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Bug Report: Local model is recreated for each request, causing delay and out of memory errors #945

🐛 Bug Report: Local model is recreated for each request, causing delay and out of memory errors #945

starkgate commented May 10, 2024

dartpain commented Jun 25, 2024

🐛 Bug Report: Local model is recreated for each request, causing delay and out of memory errors #945

🐛 Bug Report: Local model is recreated for each request, causing delay and out of memory errors #945

Comments

starkgate commented May 10, 2024

📜 Description

👟 Reproduction steps

👍 Expected behavior

👎 Actual Behavior with Screenshots

💻 Operating system

What browsers are you seeing the problem on?

🤖 What development environment are you experiencing this bug on?

🔒 Did you set the correct environment variables in the right path? List the environment variable names (not values please!)

📃 Provide any additional context for the Bug.

📖 Relevant log output

👀 Have you spent some time to check if this bug has been raised before?

🔗 Are you willing to submit PR?

🧑‍⚖️ Code of Conduct

dartpain commented Jun 25, 2024