Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I switch to local LLM engine #43

Open
oppokui opened this issue Sep 19, 2023 · 3 comments
Open

How can I switch to local LLM engine #43

oppokui opened this issue Sep 19, 2023 · 3 comments

Comments

@oppokui
Copy link

oppokui commented Sep 19, 2023

In Chat UI, there is a long list of LLM model. The default one is GPT 3.5 Turbo, which is openAI as I guess.
I configure openAI api key in .env, so it should be used, as the answer is very fast.

When I try to switch it to Llama 7B, it report:

An error occurred while generating text: Model llama-7b-GGML is currently booting.

I setup another llm engine "vllm" based on llama-2-7b-chat model, and expose in port 3000, it is compatible with openAI API.
how can I configure it to use this new engine?

@c0sogi
Copy link
Owner

c0sogi commented Sep 19, 2023

The model name will be facebook/opt-125m for example purposes.

In ./app/models/llms.py, find the LLMModels class.
Then try adding this to the class members and reboot.

     my_model = OpenAIModel(
         name="facebook/opt-125m",
         max_total_tokens=4096;
         max_tokens_per_request=4096;
         token_margin=10;
         tokenizer=OpenAITokenizer("gpt-3.5-turbo"),
         api_url="http://localhost:3000/v1/chat/completions",
     )

In the case of tokenizer, since tiktoken(OpenAITokenizer) is used, token counting will not be accurate compared to vllm which uses llama tokenizer. However, if the vllm server side handles the token limit exceeding error well, you will be able to use it without any problems.

@oppokui
Copy link
Author

oppokui commented Sep 19, 2023

It works! But meet one problem when uploading a txt file for similarity search.

I use the script to start vllm engine remotely:

pip install git+https://github.com/vllm-project/vllm.git
pip install fschat
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-2-7b-chat-hf \
  --tensor-parallel-size 4 \
  --trust-remote-code \
  --gpu-memory-utilization 0.5 \
  --dtype half \
  --host 0.0.0.0 \
  --port 3000

Then enrich llms.py: (the max tokens can't be 4096)

    llama_2_7b_vllm = OpenAIModel(
        name="meta-llama/Llama-2-7b-chat-hf",
        max_total_tokens=2048,
        max_tokens_per_request=2048,
        token_margin=8,
        tokenizer=OpenAITokenizer("gpt-4"),
        api_url="http://ec2-18-211-48-230.compute-1.amazonaws.com:3000/v1/chat/completions",
        api_key=OPENAI_API_KEY,
    )

I can ask "who are you?" to remote llama engine, and it respond as llama.

Screenshot from 2023-09-19 17-13-20

Then I want to upload txt file, it hang there. It works if I use gpt-3.5 or gpt4.
The api container print logs like:

api_1          | [2023-09-19 09:00:28,729] ApiLogger:CRITICAL - 🦙 Llama.cpp server is running
api_1          | INFO:     ('172.16.0.1', 59542) - "WebSocket /ws/chat/daad0289-fc57-4e88-ada1-82052b94-d334-485d-a975-d386a605efd8" [accepted]
api_1          | INFO:     connection open
api_1          | - DEBUG: Calling command: retry with 0 args and ['buffer'] kwargs
api_1          | - DEBUG: remaining_tokens: 1528
api_1          | - DEBUG: Sending messages: 
api_1          | [
api_1          |   {
api_1          |     "role": "user",
api_1          |     "content": "who are you?"
api_1          |   }
api_1          | ]
api_1          | - DEBUG: Sending functions: None
api_1          | - DEBUG: Sending function_call: None
api_1          | Loading tokenizer:  gpt-4
api_1          | INFO:     172.16.0.1:48092 - "GET /assets/assets/lotties/file-upload.json HTTP/1.1" 200 OK
api_1          | Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised Timeout: Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Max retries exceeded with url: /v1/engines/text-embedding-ada-002/embeddings (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fde88725850>, 'Connection to api.openai.com timed out. (connect timeout=600)')).

@oppokui
Copy link
Author

oppokui commented Sep 19, 2023

Oh, I realized the error is related to openai.com access. I can't access it in local machine, let me retry it in AWS instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants