This repository contains a set of Helm charts for deploying Large Language Models (LLMs) on Kubernetes. It is developed primarily for use as a set of pre-packaged applications within Azimuth but is structured such that the charts can, in principle, be deployed on any Kubernetes cluster with at least 1 GPU node.
This primary LLM chat app is provided as part of a standard deployment Azimuth, so no specific steps are required to use this app other than access to an up-to-date Azimuth deployment.
Alternatively, to set up the Helm repository and manually install the LLM chat interface chart on an existing Kubernetes cluster, run
helm repo add <chosen-repo-name> https://stackhpc.github.io/azimuth-llm/
helm repo update
helm install <installation-name> <chosen-repo-name>/azimuth-llm-chat
This will install the latest stable release of the application.
Under the charts directory, there is a base azimuth-llm Helm chart which uses vLLM to deploy models from Huggingface. The azimuth-chat and azimuth-image-analysis are wrapper charts which add different Gradio web interfaces for interacting with the deployed LLM.
The charts/azimuth-llm/values.yaml
file documents the various customisation options which are available. In order to access the LLM from outside the Kubernetes cluster, the API and/or UI service types may be changed to
api:
service:
type: LoadBalancer
zenith:
enabled: false
ui:
service:
type: LoadBalancer
zenith:
enabled: false
Warning
Exposing the services in this way provides no authentication mechanism and anyone with access to the load balancer IPs will be able to query the language model. It is up to you to secure the running service as appropriate for your use case. In contrast, when deployed via Azimuth, authentication is provided via the standard Azimuth Identity Provider mechanisms and the authenticated services are exposed via Zenith.
The both the web-based interface and the backend OpenAI-compatible vLLM API server can also optionally be exposed using Kubernetes Ingress. See the ingress
section in values.yml
for available config options.
When deploying the chat or image-analysis wrapper charts, all configuration options must be nested under the azimuth-llm
heading (example) due to the way that Helm passes values between parent charts and sub-charts.
The application uses vLLM for model serving, therefore any of the vLLM supported models should work. Since vLLM pulls the model files directly from HuggingFace it is likely that some other models will also be compatible with vLLM but mileage may vary between models and model architectures. If a model is incompatible with vLLM then the API pod will likely enter a CrashLoopBackoff
state and any relevant error information will be found in the API pod logs. These logs can be viewed with
kubectl (-n <helm-release-namespace>) logs deploy/<helm-release-name>-api
If you suspect that a given error is not caused by the upstream vLLM version and is instead a problem with this Helm chart then please open an issue.
The LLM chart integrates with kube-prometheus-stack by creating a ServiceMonitor
resource and installing two custom Grafana dashboard as Kubernetes ConfigMap
s. If the target cluster has an existing kube-prometheus-stack
deployment which is appropriately configured to watch all namespaces for new Grafana dashboards, the LLM dashboards will automatically appear in Grafana's dashboard list.
To disable the monitoring integrations, set the api.monitoring.enabled
value to false
.
The Helm chart consists of the following components:
-
A backend web API which runs vLLM's OpenAI compatible web server.
-
A choice of frontend web-apps built using Gradio (see web-apps). Each web interface is available as a pre-built container image hosted on ghcr.io and be configured for each Helm release by changing the
ui.image
section of the chart values.