A FastAPI-based REST API for serving GGUF language models, with built-in rate limiting and GPU acceleration. While developed and tested with the Mistral Dolphin 2.0 model, this API is designed to work with any GGUF model.
The main idea behind creating this API is to make our game project easier to use in the end.
- 🚀 GPU-accelerated inference using llama.cpp
- ⚡ Fast response times with optimized model loading
- 🔄 Rate limiting for API protection
- 🛡️ CORS support for frontend integration
- 📝 Clean JSON responses for easy frontend consumption
- 🔧 Configurable model parameters
- 🎯 Modular design for easy model swapping
- Python 3.10
- CUDA-capable GPU (required)
- CUDA 11.8+ and appropriate drivers
- 8GB+ RAM (16GB+ recommended)
- Clone the repository:
git clone https://github.com/FelixSoderstrom/mistral-api.git
cd mistral-api
- Install dependencies:
pip install -r requirements.txt
- Place your GGUF model in the
models
directory:
mkdir -p models
# Place your .gguf model file here
Make sure you download the correct model and put it in the models folder. Model used: The Bloke's Mistral Dolphin 2.0 7B. Q5_K_S.gguf
The API can be configured through environment variables or the config.py
file. Key configurations include:
# Model Configuration
MODEL_PATH: Path to your GGUF model
MAX_NEW_TOKENS: Maximum tokens to generate (default: 2048)
DEFAULT_MAX_TOKENS: Default tokens if not specified (default: 512)
TEMPERATURE: Generation temperature (default: 0.7)
TOP_P: Top-p sampling (default: 0.95)
# GPU Configuration
N_GPU_LAYERS: Number of layers to offload to GPU (default: 35)
N_BATCH: Batch size for prompt processing (default: 512)
N_THREADS: Number of CPU threads (default: 8)
# API Configuration
RATE_LIMIT_CALLS: Number of allowed calls per time window (default: 10)
RATE_LIMIT_SECONDS: Time window for rate limiting in seconds (default: 60)
- Start the API:
python main.py
- The API will be available at
http://localhost:8000
GET /health
POST /generate
Content-Type: application/json
{
"prompt": "Once upon a time",
"max_tokens": 200
}
Response:
{
"text": "Once upon a time in a small village..."
}
Build and run with Docker:
docker build -t mistral-api .
docker run -p 8000:8000 --gpus all mistral-api
While this API has been tested primarily with the Mistral Dolphin 2.0 model, it is designed to work with any GGUF format model. To use a different model (its optimized for the model we use as of now in the prompting technique):
- Place your .gguf model in the
models
directory - Update the
MODEL_PATH
inconfig.py
or set it via environment variable - Adjust model parameters as needed for your specific model
- GPU acceleration is required in this version (NO CPU MODE)
- Model loading time depends on the model size and GPU memory
- Response generation speed depends on the requested token count and model configuration
Free to use and modify.
Contributions are welcome! Please feel free to submit a Pull Request or send me a message.
Bjorn or Felix