Mistral API

A FastAPI-based REST API for serving GGUF language models, with built-in rate limiting and GPU acceleration. While developed and tested with the Mistral Dolphin 2.0 model, this API is designed to work with any GGUF model.

The main idea behind creating this API is to make our game project easier to use in the end.

Features

🚀 GPU-accelerated inference using llama.cpp
⚡ Fast response times with optimized model loading
🔄 Rate limiting for API protection
🛡️ CORS support for frontend integration
📝 Clean JSON responses for easy frontend consumption
🔧 Configurable model parameters
🎯 Modular design for easy model swapping

Prerequisites

Python 3.10
CUDA-capable GPU (required)
CUDA 11.8+ and appropriate drivers
8GB+ RAM (16GB+ recommended)

Installation

Clone the repository:

git clone https://github.com/FelixSoderstrom/mistral-api.git
cd mistral-api

Install dependencies:

pip install -r requirements.txt

Place your GGUF model in the models directory:

mkdir -p models
# Place your .gguf model file here

WE CHANGED MODEL TO THIS ONE: dolphin-2.0-mistral-7b.Q5_K_S.gguf

Make sure you download the correct model and put it in the models folder. Model used: The Bloke's Mistral Dolphin 2.0 7B. Q5_K_S.gguf

Configuration

The API can be configured through environment variables or the config.py file. Key configurations include:

# Model Configuration
MODEL_PATH: Path to your GGUF model
MAX_NEW_TOKENS: Maximum tokens to generate (default: 2048)
DEFAULT_MAX_TOKENS: Default tokens if not specified (default: 512)
TEMPERATURE: Generation temperature (default: 0.7)
TOP_P: Top-p sampling (default: 0.95)

# GPU Configuration
N_GPU_LAYERS: Number of layers to offload to GPU (default: 35)
N_BATCH: Batch size for prompt processing (default: 512)
N_THREADS: Number of CPU threads (default: 8)

# API Configuration
RATE_LIMIT_CALLS: Number of allowed calls per time window (default: 10)
RATE_LIMIT_SECONDS: Time window for rate limiting in seconds (default: 60)

Usage

Start the API:

python main.py

The API will be available at http://localhost:8000

Endpoints

Health Check

GET /health

Generate Text

POST /generate
Content-Type: application/json

{
    "prompt": "Once upon a time",
    "max_tokens": 200
}

Response:

{
    "text": "Once upon a time in a small village..."
}

Docker Support

Build and run with Docker:

docker build -t mistral-api .
docker run -p 8000:8000 --gpus all mistral-api

Model Compatibility

While this API has been tested primarily with the Mistral Dolphin 2.0 model, it is designed to work with any GGUF format model. To use a different model (its optimized for the model we use as of now in the prompting technique):

Place your .gguf model in the models directory
Update the MODEL_PATH in config.py or set it via environment variable
Adjust model parameters as needed for your specific model

Performance Notes

GPU acceleration is required in this version (NO CPU MODE)
Model loading time depends on the model size and GPU memory
Response generation speed depends on the requested token count and model configuration

License

Free to use and modify.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request or send me a message.

Contact

Bjorn or Felix

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
agent.py		agent.py
config.py		config.py
download_model.py		download_model.py
main.py		main.py
requirements.txt		requirements.txt
streamer.py		streamer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mistral API

Features

Prerequisites

Installation

WE CHANGED MODEL TO THIS ONE: dolphin-2.0-mistral-7b.Q5_K_S.gguf

Configuration

Usage

Endpoints

Health Check

Generate Text

Docker Support

Model Compatibility

Performance Notes

License

Contributing

Contact

About

Releases

Packages

Contributors 2

Languages

FelixSoderstrom/mistral-api

Folders and files

Latest commit

History

Repository files navigation

Mistral API

Features

Prerequisites

Installation

WE CHANGED MODEL TO THIS ONE: dolphin-2.0-mistral-7b.Q5_K_S.gguf

Configuration

Usage

Endpoints

Health Check

Generate Text

Docker Support

Model Compatibility

Performance Notes

License

Contributing

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages