Template: aphrodite
Goal: deploy a GGUF Large Language Model in your local machine and expose it via an OpenAI compatible API.
- Install kalavai cli
- Setup a kalavai cluster with 2 machines.
- Your machine should have:
- 1 NVIDIA GPU with at least 4GB vRAM
- 8 GB of RAM (configurable below)
- 4 CPUs (configurable below)
We wish to deploy Qwen/Qwen1.5-0.5B-Chat-AWQ on a single GPU. We are going to request a KoboldAI GUI to be deployed with it so we can test it in our browser.
- Create
values.yaml
file and paste the following:
- name: deployment_name
value: qwen-awq-1
default: aphrodite-1
description: "Name of the deployment job"
- name: storage
value: "pool-cache"
default: "pool-cache"
description: "Pool storage to use to cache model weights"
- name: replicas
value: "1"
default: "1"
description: "How many replicas to deploy for the model"
- name: num_workers
value: "1"
default: "1"
description: "Workers per deployment (for tensor parallelism)"
- name: repo_id
value: Qwen/Qwen1.5-0.5B-Chat-AWQ
default: null
description: "Huggingface model id to load"
- name: model_filename
value: "None"
default: "None"
description: "Specific model file to use (handy for quantized models such as gguf)"
- name: hf_token
value: <your token>
default: null
description: "Huggingface token, required to load model weights"
- name: cpus
value: "4"
default: "4"
description: "CPUs per single worker (final one = cpus * num_workers)"
- name: gpus
value: "1"
default: "1"
description: "GPUs per single worker (final one = gpus * num_workers)"
- name: memory
value: "8"
default: "8"
description: "RAM memory per single worker (final one = memory * num_workers)"
- name: tensor_parallel_size
value: "1"
default: "1"
description: "Tensor parallelism (use the number of GPUs per node)"
- name: pipeline_parallel_size
value: "1"
default: "1"
description: "Pipeline parallelism (use the number of nodes)"
- name: shmem_size
value: "4000000000"
default: "4000000000"
description: "Size of the shared memory volume"
- name: extra
value: "--dtype float16 --enforce-eager --launch-kobold-api"
default: ""
description: "Extra parameters to pass to the vLLM server. See https://aphrodite.pygmalion.chat/"
- Deploy your aphrodite template:
kalavai job run aphrodite --values values.yaml
- Wait until it is ready; it may take a few minutes depending on your internet connection. Monitor the deployment until status is
Available
:
$ kalavai job list
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Deployment ┃ Status ┃ Endpoint ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩
│ qwen-awq-1 │ Available: All replicas are ready │ http://100.8.0.2:31947 │
└───────────────────┴───────────────────────────────────┴────────────────────────┘
- Now you are ready to do inference with the model! Substitute the URL below with the endpoint indicated above:
curl http://100.8.0.2:31947/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen1.5-0.5B-Chat-AWQ",
"prompt": "I would walk 500",
"max_tokens": 50,
"temperature": 0
}'
- Alternatively, you can do inference in Python:
from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://100.8.0.2:31947/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
completion = client.completions.create(
model="Qwen/Qwen1.5-0.5B-Chat-AWQ",
prompt="I would walk 500")
print("Completion result:", completion)
- A KoboldAI GUI should be available in your browser at the address
http://100.8.0.2:31947
:
If you want to inspect what's going on with the openAI server, you can access the full logs of the job (on each node) with:
kalavai job logs qwen-awq-1
Once you are done with your model, you can delete the deployment with
kalavai job delete qwen-awq-1