-
Notifications
You must be signed in to change notification settings - Fork 16.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to run llama.cpp or GPT4All demos #2404
Comments
Try changing the I tried the following on Colab, but the last line never finishes... Does anyone have a clue?
|
That got me a little further, now I'm getting the following output from the GPT4All model:
|
Hi @Zetaphor are you referring to this Llama demo? I'm the author of the Can you give me an idea of what kind of processor you're running and the length of your prompt? Because llama.cpp is running inference on the CPU it can take a while to process the initial prompt and there are still some performance issues with certain CPU architectures. To rule that out have can you try running the same prompt through the examples in llama.cpp? |
Hey @abetlen, I'm able to run the 7B model on both my laptop and my server without issues. Here are the specs for both:
I'm trying to run the prompt provided in the demo code:
I have not yet run this exact prompt through llama.cpp, but I've been able to successfully run the chat with bob prompt on both my laptop and server. On the server in addition to running the GPT4All model I've also used the Vicuna 13B model. |
On the suggestion of someone in Discord I'm able to get output using the llama.cpp model, it looks like the thing we needed to do here was to increase the token context to a much higher value. However I am definitely seeing reduced performance compared to what I experience when just running inference through llama.cpp import os
from langchain.memory import ConversationTokenBufferMemory
from langchain.agents.tools import Tool
from langchain.chat_models import ChatOpenAI
from langchain.llms.base import LLM
from langchain.llms.llamacpp import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.agents import load_tools, initialize_agent, AgentExecutor, BaseSingleActionAgent, AgentType
custom_llm = LlamaCpp(model_path="models/ggml-vicuna-13b-4bit.bin", verbose=True,
n_threads=4, n_ctx=5000, temperature=0.05, repeat_penalty=1.22)
tools = []
memories = {}
question = "What is the meaning of life?"
unique_id = os.urandom(16).hex()
if unique_id not in memories:
memories[unique_id] = ConversationTokenBufferMemory(
memory_key="chat_history", llm=custom_llm, return_messages=True)
memory = memories[unique_id]
agent = initialize_agent(tools, llm=custom_llm, agent=AgentType.CONVERSATIONAL_REACT_DESCRIPTION,
verbose=True, memory=memory, max_retries=1)
response = agent.run(input=question)
print(response) |
Thanks for the reply, based on that I think it's related to this issue. I've opened a PR (#2411) to give you control over the batch size from LangChain (this got missed in the initial merge). For now, you can also enable verbose logging and update the The verbose logs should also give you an idea of the per token performance compared to EDIT: You'll need to |
@hershkoy I've not seen that floating point exception before but this is using a different library, I suspect it might be a bug with |
Just FYI, the slowdown in performance is a bug. It's being investigated here ggml-org/llama.cpp#603 . Inference should NOT slow down with increased context |
I'm running a chain successfully loading with LlamaCpp class (eg in https://gist.github.com/psychemedia/51f45fbfe160f78605bdd0c1b404e499) but not the GPT4All one. (Macbook pro mid-2015 intel) |
@abetlen I am new to this. I can try. Can you explain how to do it? |
@hershkoy absolutely, all you have to do is change the following two lines First update the llm import near the top of your file from langchain.llms import LlamaCpp and then where you instantiate the class llm = LlamaCpp(model_path="gpt4all-converted.bin", n_ctx=2048) Let me know if that still gives you an error on your system |
first, It seems that I was missing a print around the
and now I get an output. when running LlamaCpp class, there is an output, and the program quits with no error. GPT4ALL LLAMACPP |
@hershkoy Upgrading my dependencies to |
I had the same problem. Upgrading my version of Pydantic fixed it:
|
Same problem. But I've never gotten any output even if I used print(). Tried setting the context and everything, still couldnt find a solution. Things I have tried: |
Yeah I've had this same problem yesterday with the llama.cpp model. It was stuck and never produced any output, but I did get an output if I use llama.cpp directly (i.e. not through langchain). Strangely enough when I tried running the same code today with langchain, it worked just fine. |
Leaving some traceback logs here when I pressed Ctrl + C while stuck, if its any help.
|
Specifically for llama.cpp I think #2404 (comment) points to the issue being in the Callback Manager. It's likely that due to the async nature of callback manager the "main" program exits before the chain returns. To test this I put a sleep loop, but it also seems that perhaps the callback manager isn't being used with This shows that streaming should be used. (the streaming property is True by default) The CallbackManager should also be in play as well. Would be worth stepping through this with a debugger. |
You don't happen to have an example at hand of an minimal example with LlamaCpp and streaming the output, word for word? |
Hi, @Zetaphor! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale. From what I understand, you were experiencing issues running the llama.cpp and GPT4All demos. You mentioned that you tried changing the Could you please let us know if this issue is still relevant to the latest version of the LangChain repository? If it is, please comment on this issue to let us know. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you for your understanding and cooperation! |
I'm attempting to run both demos linked today but am running into issues. I've already migrated my GPT4All model.
When I run the llama.cpp demo all of my CPU cores are pegged at 100% for a minute or so and then it just exits without an error code or output.
When I run the GPT4All demo I get the following error:
The text was updated successfully, but these errors were encountered: