Unable to run llama.cpp or GPT4All demos #2404

Zetaphor · 2023-04-04T19:04:58Z

I'm attempting to run both demos linked today but am running into issues. I've already migrated my GPT4All model.

When I run the llama.cpp demo all of my CPU cores are pegged at 100% for a minute or so and then it just exits without an error code or output.

When I run the GPT4All demo I get the following error:

Traceback (most recent call last):
  File "/home/zetaphor/Code/langchain-demo/gpt4alldemo.py", line 12, in <module>
    llm = GPT4All(model_path="models/gpt4all-lora-quantized-new.bin")
  File "pydantic/main.py", line 339, in pydantic.main.BaseModel.__init__
  File "pydantic/main.py", line 1102, in pydantic.main.validate_model
  File "/home/zetaphor/.pyenv/versions/3.9.16/lib/python3.9/site-packages/langchain/llms/gpt4all.py", line 132, in validate_environment
    ggml_model=values["model"],
KeyError: 'model'

The text was updated successfully, but these errors were encountered:

0xhiroki · 2023-04-04T19:27:04Z

Try changing the model_path param to model.

I tried the following on Colab, but the last line never finishes... Does anyone have a clue?

from langchain.llms import GPT4All
from langchain import PromptTemplate, LLMChain

template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])

llm = GPT4All(model="{path_to_ggml}")
llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"
llm_chain.run(question)

Zetaphor · 2023-04-04T19:52:46Z

Try changing the model_path param to model.

That got me a little further, now I'm getting the following output from the GPT4All model:

llama_model_load: loading model from 'models/gpt4all-lora-quantized-new.bin' - please wait ...
llama_model_load: n_vocab = 32001
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)
llama_model_load: loading tensors from 'models/gpt4all-lora-quantized-new.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  512.00 MB
llama_generate: seed = 1680637863

system_info: n_threads = 4 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
generate: n_ctx = 512, n_batch = 1, n_predict = 256, n_keep = 0


 [end of text]

llama_print_timings:        load time =  2298.94 ms
llama_print_timings:      sample time =    78.52 ms /   150 runs   (    0.52 ms per run)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time = 44807.91 ms /   181 runs   (  247.56 ms per run)
llama_print_timings:       total time = 45558.50 ms
fish: Job 1, 'python gpt4alldemo.py' terminated by signal SIGSEGV (Address boundary error)

abetlen · 2023-04-04T20:33:44Z

Hi @Zetaphor are you referring to this Llama demo?

I'm the author of the llama-cpp-python library, I'd be happy to help.

Can you give me an idea of what kind of processor you're running and the length of your prompt?

Because llama.cpp is running inference on the CPU it can take a while to process the initial prompt and there are still some performance issues with certain CPU architectures. To rule that out have can you try running the same prompt through the examples in llama.cpp?

Zetaphor · 2023-04-04T20:43:03Z

Hey @abetlen,

I'm able to run the 7B model on both my laptop and my server without issues. Here are the specs for both:

Laptop: Ryzen 9 5900HS, 40GB RAM
Server: Xeon 6c/12t, 64GB RAM

I'm trying to run the prompt provided in the demo code:

What NFL team won the Super Bowl in the year Justin Bieber was born?

I have not yet run this exact prompt through llama.cpp, but I've been able to successfully run the chat with bob prompt on both my laptop and server. On the server in addition to running the GPT4All model I've also used the Vicuna 13B model.

Zetaphor · 2023-04-04T20:44:45Z

On the suggestion of someone in Discord I'm able to get output using the llama.cpp model, it looks like the thing we needed to do here was to increase the token context to a much higher value. However I am definitely seeing reduced performance compared to what I experience when just running inference through llama.cpp

import os
from langchain.memory import ConversationTokenBufferMemory
from langchain.agents.tools import Tool
from langchain.chat_models import ChatOpenAI
from langchain.llms.base import LLM
from langchain.llms.llamacpp import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.agents import load_tools, initialize_agent, AgentExecutor, BaseSingleActionAgent, AgentType


custom_llm = LlamaCpp(model_path="models/ggml-vicuna-13b-4bit.bin", verbose=True,
                      n_threads=4, n_ctx=5000, temperature=0.05, repeat_penalty=1.22)
tools = []

memories = {}

question = "What is the meaning of life?"
unique_id = os.urandom(16).hex()
if unique_id not in memories:
    memories[unique_id] = ConversationTokenBufferMemory(
        memory_key="chat_history", llm=custom_llm, return_messages=True)
memory = memories[unique_id]
agent = initialize_agent(tools, llm=custom_llm, agent=AgentType.CONVERSATIONAL_REACT_DESCRIPTION,
                         verbose=True, memory=memory, max_retries=1)

response = agent.run(input=question)

print(response)

rjadr · 2023-04-04T21:15:10Z

@Zetaphor Correct, llama.cpp has set the default token context window at 512 for performance, which is also the default n_ctx value in langchain. You can set it at 2048 max, but this will slow down inference.

abetlen · 2023-04-04T21:51:07Z

Thanks for the reply, based on that I think it's related to this issue. I've opened a PR (#2411) to give you control over the batch size from LangChain (this got missed in the initial merge).

For now, you can also enable verbose logging and update the n_batch with custom_llm.client.verbose = True and change self.custom_llm.client.n_batch respectively.

The verbose logs should also give you an idea of the per token performance compared to llama.cpp as it's using the same timing methods so let me know if they look very off.

EDIT: You'll need to pip install --upgrade llama-cpp-python as verbose was just added.

hershkoy · 2023-04-04T22:18:23Z

For me a simple question like this works:

print(llm_chain("tell me about Japan"))

but not the llm_chain.run(question)

also tried something more complex and got a floating point exception:

hershkoy · 2023-04-04T22:23:11Z

more info

abetlen · 2023-04-04T22:29:47Z

@hershkoy I've not seen that floating point exception before but this is using a different library, I suspect it might be a bug with n_ctx being too large maybe? Just out of curiosity, can you try loading that gpt4all-converted.bin model in the LlamaCpp class? I'm not sure what the version compatibility is between our two libs but if both fail in the same way it may be a bug in llama.cpp.

MillionthOdin16 · 2023-04-05T04:00:51Z

@Zetaphor Correct, llama.cpp has set the default token context window at 512 for performance, which is also the default n_ctx value in langchain. You can set it at 2048 max, but this will slow down inference.

Just FYI, the slowdown in performance is a bug. It's being investigated here ggml-org/llama.cpp#603 . Inference should NOT slow down with increased context

psychemedia · 2023-04-05T07:24:35Z

can you try loading that gpt4all-converted.bin model in the LlamaCpp class? I'm not sure what the version compatibility is between our two libs but if both fail in the same way it may be a bug in llama.cpp.

I'm running a chain successfully loading with LlamaCpp class (eg in https://gist.github.com/psychemedia/51f45fbfe160f78605bdd0c1b404e499) but not the GPT4All one.

(Macbook pro mid-2015 intel)

hershkoy · 2023-04-05T09:48:24Z

can you try loading that gpt4all-converted.bin model in the LlamaCpp class

@abetlen I am new to this. I can try. Can you explain how to do it?

abetlen · 2023-04-05T09:58:27Z

@hershkoy absolutely, all you have to do is change the following two lines

First update the llm import near the top of your file

from langchain.llms import LlamaCpp

and then where you instantiate the class

llm = LlamaCpp(model_path="gpt4all-converted.bin", n_ctx=2048)

Let me know if that still gives you an error on your system

hershkoy · 2023-04-05T13:50:23Z

first, It seems that I was missing a print around the llm_chain.run(question). I don't remember where I took that code from.
I added:

print(llm_chain.run(question))

and now I get an output.

when running LlamaCpp class, there is an output, and the program quits with no error.
where running GPT4All class, there is an output, but there is an error after the chain runs. no sure how it is possible...

GPT4ALL

LLAMACPP

jhaan1979 · 2023-04-05T18:33:25Z

@hershkoy Upgrading my dependencies to langchain-0.0.132 and pyllamacpp-1.0.6 fixed the segfaults for me.

benjamintanweihao · 2023-04-06T15:42:36Z

I had the same problem. Upgrading my version of Pydantic fixed it:

 pip install pydantic==1.10.7

vicevirus · 2023-04-09T14:22:09Z

Same problem. But I've never gotten any output even if I used print(). Tried setting the context and everything, still couldnt find a solution.

Things I have tried:
GPT4All basic prompt (no output, stuck forever)
Langchain GPT4All prompt (no output, stuck forever)
GPT4All Chat Mode (works without a problem)

harshil21 · 2023-04-09T14:25:21Z

Yeah I've had this same problem yesterday with the llama.cpp model. It was stuck and never produced any output, but I did get an output if I use llama.cpp directly (i.e. not through langchain). Strangely enough when I tried running the same code today with langchain, it worked just fine.

vicevirus · 2023-04-09T14:29:36Z

Leaving some traceback logs here when I pressed Ctrl + C while stuck, if its any help.

^C^C^CTraceback (most recent call last):
  File "/Users/vicevirus/Downloads/gpt4all/test.py", line 20, in <module>
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/chains/base.py", line 213, in run
    return self(args[0])[self.output_keys[0]]
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/chains/base.py", line 116, in __call__
    raise e
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/chains/base.py", line 113, in __call__
    outputs = self._call(inputs)
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/chains/llm.py", line 57, in _call
    return self.apply([inputs])[0]
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/chains/llm.py", line 118, in apply
    response = self.generate(input_list)
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/chains/llm.py", line 62, in generate
    return self.llm.generate_prompt(prompts, stop)
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/llms/base.py", line 107, in generate_prompt
    return self.generate(prompt_strings, stop=stop)
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/llms/base.py", line 140, in generate
    raise e
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/llms/base.py", line 137, in generate
    output = self._generate(prompts, stop=stop)
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/llms/base.py", line 324, in _generate
    text = self._call(prompt, stop=stop)
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/llms/gpt4all.py", line 177, in _call
    text = self.client.generate(
  File "/opt/homebrew/lib/python3.10/site-packages/pyllamacpp/model.py", line 112, in generate
    pp.llama_generate(self._ctx, self.gpt_params, self._call_new_text_callback, verbose)
  File "/opt/homebrew/lib/python3.10/site-packages/pyllamacpp/model.py", line 77, in _call_new_text_callback
    def _call_new_text_callback(self, text) -> None:
KeyboardInterrupt

Freyert · 2023-04-29T02:48:16Z

Specifically for llama.cpp I think #2404 (comment) points to the issue being in the Callback Manager.

It's likely that due to the async nature of callback manager the "main" program exits before the chain returns.

To test this I put a sleep loop, but it also seems that perhaps the callback manager isn't being used with run or is faulty for this LLM wrapper.

This shows that streaming should be used. (the streaming property is True by default)
https://github.com/hwchase17/langchain/blob/master/langchain/llms/llamacpp.py#L217-L228

The CallbackManager should also be in play as well. Would be worth stepping through this with a debugger.
https://github.com/hwchase17/langchain/blob/master/langchain/llms/llamacpp.py#L271-L273

fabmeyer · 2023-05-03T15:36:14Z

Specifically for llama.cpp I think #2404 (comment) points to the issue being in the Callback Manager.

It's likely that due to the async nature of callback manager the "main" program exits before the chain returns.

To test this I put a sleep loop, but it also seems that perhaps the callback manager isn't being used with run or is faulty for this LLM wrapper.

This shows that streaming should be used. (the streaming property is True by default) https://github.com/hwchase17/langchain/blob/master/langchain/llms/llamacpp.py#L217-L228

The CallbackManager should also be in play as well. Would be worth stepping through this with a debugger. https://github.com/hwchase17/langchain/blob/master/langchain/llms/llamacpp.py#L271-L273

You don't happen to have an example at hand of an minimal example with LlamaCpp and streaming the output, word for word?
I am only getting the output streamed on the console but cannot write it into an object and I don't understand how to access the stream.

dosubot · 2023-09-22T16:05:05Z

Hi, @Zetaphor! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you were experiencing issues running the llama.cpp and GPT4All demos. You mentioned that you tried changing the model_path parameter to model and made some progress with the GPT4All demo, but still encountered a segmentation fault. Other users suggested upgrading dependencies, changing the token context window, and using verbose logging. However, the issue remains unresolved.

Could you please let us know if this issue is still relevant to the latest version of the LangChain repository? If it is, please comment on this issue to let us know. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and cooperation!

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 22, 2023

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 29, 2023

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to run llama.cpp or GPT4All demos #2404

Unable to run llama.cpp or GPT4All demos #2404

Zetaphor commented Apr 4, 2023

0xhiroki commented Apr 4, 2023 •

edited

Loading

Zetaphor commented Apr 4, 2023

abetlen commented Apr 4, 2023

Zetaphor commented Apr 4, 2023

Zetaphor commented Apr 4, 2023

rjadr commented Apr 4, 2023

abetlen commented Apr 4, 2023 •

edited

Loading

hershkoy commented Apr 4, 2023

hershkoy commented Apr 4, 2023

abetlen commented Apr 4, 2023 •

edited

Loading

MillionthOdin16 commented Apr 5, 2023 •

edited

Loading

psychemedia commented Apr 5, 2023 •

edited

Loading

hershkoy commented Apr 5, 2023

abetlen commented Apr 5, 2023

hershkoy commented Apr 5, 2023

jhaan1979 commented Apr 5, 2023

benjamintanweihao commented Apr 6, 2023

vicevirus commented Apr 9, 2023

harshil21 commented Apr 9, 2023

vicevirus commented Apr 9, 2023

Freyert commented Apr 29, 2023 •

edited

Loading

fabmeyer commented May 3, 2023 •

edited

Loading

dosubot bot commented Sep 22, 2023

Unable to run llama.cpp or GPT4All demos #2404

Unable to run llama.cpp or GPT4All demos #2404

Comments

Zetaphor commented Apr 4, 2023

0xhiroki commented Apr 4, 2023 • edited Loading

Zetaphor commented Apr 4, 2023

abetlen commented Apr 4, 2023

Zetaphor commented Apr 4, 2023

Zetaphor commented Apr 4, 2023

rjadr commented Apr 4, 2023

abetlen commented Apr 4, 2023 • edited Loading

hershkoy commented Apr 4, 2023

hershkoy commented Apr 4, 2023

abetlen commented Apr 4, 2023 • edited Loading

MillionthOdin16 commented Apr 5, 2023 • edited Loading

psychemedia commented Apr 5, 2023 • edited Loading

hershkoy commented Apr 5, 2023

abetlen commented Apr 5, 2023

hershkoy commented Apr 5, 2023

jhaan1979 commented Apr 5, 2023

benjamintanweihao commented Apr 6, 2023

vicevirus commented Apr 9, 2023

harshil21 commented Apr 9, 2023

vicevirus commented Apr 9, 2023

Freyert commented Apr 29, 2023 • edited Loading

fabmeyer commented May 3, 2023 • edited Loading

dosubot bot commented Sep 22, 2023

0xhiroki commented Apr 4, 2023 •

edited

Loading

abetlen commented Apr 4, 2023 •

edited

Loading

abetlen commented Apr 4, 2023 •

edited

Loading

MillionthOdin16 commented Apr 5, 2023 •

edited

Loading

psychemedia commented Apr 5, 2023 •

edited

Loading

Freyert commented Apr 29, 2023 •

edited

Loading

fabmeyer commented May 3, 2023 •

edited

Loading