-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MODULE] Add module on inference #150
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
# Inference | ||
|
||
Inference is the process of using a trained language model to generate predictions or responses. While inference might seem straightforward, deploying models efficiently at scale requires careful consideration of various factors like performance, cost, and reliability. Large Language Models (LLMs) present unique challenges due to their size and computational requirements. | ||
|
||
We'll explore both simple and production-ready approaches using the [`transformers`](https://huggingface.co/docs/transformers/index) library and [`text-generation-inference`](https://github.com/huggingface/text-generation-inference), two popular frameworks for LLM inference. For production deployments, we'll focus on Text Generation Inference (TGI), which provides optimized serving capabilities. | ||
|
||
## Module Overview | ||
|
||
LLM inference can be categorized into two main approaches: simple pipeline-based inference for development and testing, and optimized serving solutions for production deployments. We'll cover both approaches, starting with the simpler pipeline approach and moving to production-ready solutions. | ||
|
||
## Contents | ||
|
||
### 1. [Basic Pipeline Inference](./pipeline_inference.md) | ||
|
||
Learn how to use the Hugging Face Transformers pipeline for basic inference. We'll cover setting up pipelines, configuring generation parameters, and best practices for local development. The pipeline approach is perfect for prototyping and small-scale applications. [Start learning](./pipeline_inference.md). | ||
|
||
### 2. [Production Inference with TGI](./tgi_inference.md) | ||
|
||
Learn how to deploy models for production using Text Generation Inference. We'll explore optimized serving techniques, batching strategies, and monitoring solutions. TGI provides production-ready features like health checks, metrics, and Docker deployment options. [Start learning](./tgi_inference.md). | ||
|
||
### Exercise Notebooks | ||
|
||
| Title | Description | Exercise | Link | Colab | | ||
|-------|-------------|----------|------|-------| | ||
| Pipeline Inference | Basic inference with transformers pipeline | 🐢 Set up a basic pipeline <br> 🐕 Configure generation parameters <br> 🦁 Create a simple web server | [Link](./notebooks/basic_pipeline_inference.ipynb) | [Colab](https://githubtocolab.com/huggingface/smol-course/tree/main/7_inference/notebooks/basic_pipeline_inference.ipynb) | | ||
| TGI Deployment | Production deployment with TGI | 🐢 Deploy a model with TGI <br> 🐕 Configure performance optimizations <br> 🦁 Set up monitoring and scaling | [Link](./notebooks/tgi_deployment.ipynb) | [Colab](https://githubtocolab.com/huggingface/smol-course/tree/main/7_inference/notebooks/tgi_deployment.ipynb) | | ||
|
||
## Resources | ||
|
||
- [Hugging Face Pipeline Tutorial](https://huggingface.co/docs/transformers/en/pipeline_tutorial) | ||
- [Text Generation Inference Documentation](https://huggingface.co/docs/text-generation-inference/en/index) | ||
- [Pipeline WebServer Guide](https://huggingface.co/docs/transformers/en/pipeline_tutorial#using-pipelines-for-a-webserver) | ||
- [TGI GitHub Repository](https://github.com/huggingface/text-generation-inference) | ||
- [Hugging Face Model Deployment Documentation](https://huggingface.co/docs/inference-endpoints/index) | ||
- [vLLM: High-throughput LLM Serving](https://github.com/vllm-project/vllm) | ||
- [Optimizing Transformer Inference](https://huggingface.co/blog/optimize-transformer-inference) |
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,174 @@ | ||||||||
# Basic Inference with Transformers Pipeline | ||||||||
|
||||||||
The `pipeline` abstraction in 🤗 Transformers provides a simple way to run inference with any model from the Hugging Face Hub. It handles all the preprocessing and postprocessing steps, making it easy to use models without deep knowledge of their architecture or requirements. | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. perhaps briefly mentioning something like llama-cpp or vllm would be fair too? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This reference goes over open source alternatives for model serving and inference including llama-cpp, llamafile and ollama: https://www.tamingllms.com/notebooks/local.html#tools-for-local-llm-deployment |
||||||||
|
||||||||
## How Pipelines Work | ||||||||
|
||||||||
Hugging Face pipelines streamline the machine learning workflow by automating three critical stages between raw input and human-readable output: | ||||||||
|
||||||||
**Preprocessing Stage** | ||||||||
The pipeline first prepares your raw inputs for the model. This varies by input type: | ||||||||
- Text inputs undergo tokenization to convert words into model-friendly token IDs | ||||||||
- Images are resized and normalized to match model requirements | ||||||||
- Audio is processed through feature extraction to create spectrograms or other representations | ||||||||
|
||||||||
**Model Inference** | ||||||||
During the forward pass, the pipeline: | ||||||||
- Handles batching of inputs automatically for efficient processing | ||||||||
- Places computation on the optimal device (CPU/GPU) | ||||||||
- Applies performance optimizations like half-precision (FP16) inference where supported | ||||||||
|
||||||||
**Postprocessing Stage** | ||||||||
Finally, the pipeline converts raw model outputs into useful results: | ||||||||
- Decodes token IDs back into readable text | ||||||||
- Transforms logits into probability scores | ||||||||
- Formats outputs according to the specific task (e.g., classification labels, generated text) | ||||||||
|
||||||||
This abstraction lets you focus on your application logic while the pipeline handles the technical complexity of model inference. | ||||||||
|
||||||||
## Basic Usage | ||||||||
|
||||||||
Here's how to use a pipeline for text generation: | ||||||||
|
||||||||
```python | ||||||||
from transformers import pipeline | ||||||||
|
||||||||
# Create a pipeline with a specific model | ||||||||
generator = pipeline( | ||||||||
"text-generation", | ||||||||
model="HuggingFaceTB/SmolLM2-1.7B-Instruct", | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. perhaps quantized models would be more usable? |
||||||||
torch_dtype="auto", | ||||||||
device_map="auto" | ||||||||
) | ||||||||
|
||||||||
# Generate text | ||||||||
response = generator( | ||||||||
"Write a short poem about coding:", | ||||||||
max_new_tokens=100, | ||||||||
do_sample=True, | ||||||||
temperature=0.7 | ||||||||
) | ||||||||
Comment on lines
+45
to
+50
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. isn't 100 tokens a bit small? |
||||||||
print(response[0]['generated_text']) | ||||||||
``` | ||||||||
|
||||||||
## Key Configuration Options | ||||||||
|
||||||||
### Model Loading | ||||||||
```python | ||||||||
# CPU inference | ||||||||
generator = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-1.7B-Instruct", device="cpu") | ||||||||
|
||||||||
# GPU inference (device 0) | ||||||||
generator = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-1.7B-Instruct", device=0) | ||||||||
|
||||||||
# Automatic device placement | ||||||||
generator = pipeline( | ||||||||
"text-generation", | ||||||||
model="HuggingFaceTB/SmolLM2-1.7B-Instruct", | ||||||||
device_map="auto", | ||||||||
torch_dtype="auto" | ||||||||
) | ||||||||
``` | ||||||||
|
||||||||
### Generation Parameters | ||||||||
```python | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
response = generator( | ||||||||
"Translate this to French:", | ||||||||
max_new_tokens=100, # Maximum length of generated text | ||||||||
do_sample=True, # Use sampling instead of greedy decoding | ||||||||
temperature=0.7, # Control randomness (higher = more random) | ||||||||
top_k=50, # Limit to top k tokens | ||||||||
top_p=0.95, # Nucleus sampling threshold | ||||||||
num_return_sequences=1 # Number of different generations | ||||||||
) | ||||||||
``` | ||||||||
|
||||||||
## Processing Multiple Inputs | ||||||||
|
||||||||
Pipelines can efficiently handle multiple inputs through batching: | ||||||||
|
||||||||
```python | ||||||||
# Prepare multiple prompts | ||||||||
prompts = [ | ||||||||
"Write a haiku about programming:", | ||||||||
"Explain what an API is:", | ||||||||
"Write a short story about a robot:" | ||||||||
] | ||||||||
|
||||||||
# Process all prompts efficiently | ||||||||
responses = generator( | ||||||||
prompts, | ||||||||
batch_size=4, # Number of prompts to process together | ||||||||
max_new_tokens=100, | ||||||||
do_sample=True, | ||||||||
temperature=0.7 | ||||||||
) | ||||||||
|
||||||||
# Print results | ||||||||
for prompt, response in zip(prompts, responses): | ||||||||
print(f"Prompt: {prompt}") | ||||||||
print(f"Response: {response[0]['generated_text']}\n") | ||||||||
``` | ||||||||
|
||||||||
## Web Server Integration | ||||||||
|
||||||||
Here's how to integrate a pipeline into a Flask application: | ||||||||
|
||||||||
```python | ||||||||
from flask import Flask, request, jsonify | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why are you using Flask and not something more modern like FastAPI? |
||||||||
from transformers import pipeline | ||||||||
|
||||||||
app = Flask(__name__) | ||||||||
|
||||||||
# Initialize pipeline globally | ||||||||
generator = pipeline( | ||||||||
"text-generation", | ||||||||
model="HuggingFaceTB/SmolLM2-1.7B-Instruct", | ||||||||
device_map="auto" | ||||||||
) | ||||||||
|
||||||||
@app.route("/generate", methods=["POST"]) | ||||||||
def generate_text(): | ||||||||
try: | ||||||||
data = request.json | ||||||||
prompt = data.get("prompt", "") | ||||||||
|
||||||||
# Input validation | ||||||||
if not prompt: | ||||||||
return jsonify({"error": "No prompt provided"}), 400 | ||||||||
|
||||||||
# Generate response with timeout | ||||||||
response = generator( | ||||||||
prompt, | ||||||||
max_new_tokens=100, | ||||||||
do_sample=True, | ||||||||
temperature=0.7 | ||||||||
) | ||||||||
|
||||||||
return jsonify({ | ||||||||
"generated_text": response[0]['generated_text'] | ||||||||
}) | ||||||||
|
||||||||
except Exception as e: | ||||||||
return jsonify({"error": str(e)}), 500 | ||||||||
|
||||||||
if __name__ == "__main__": | ||||||||
app.run(host="0.0.0.0", port=5000) | ||||||||
``` | ||||||||
|
||||||||
## Limitations | ||||||||
|
||||||||
While pipelines are great for prototyping and small-scale deployments, they have some limitations: | ||||||||
|
||||||||
- Limited optimization options compared to dedicated serving solutions | ||||||||
- No built-in support for advanced features like dynamic batching | ||||||||
- May not be suitable for high-throughput production workloads | ||||||||
|
||||||||
For production deployments with high throughput requirements, consider using Text Generation Inference (TGI) or other specialized serving solutions. | ||||||||
|
||||||||
## Resources | ||||||||
|
||||||||
- [Hugging Face Pipeline Tutorial](https://huggingface.co/docs/transformers/en/pipeline_tutorial) | ||||||||
- [Pipeline API Reference](https://huggingface.co/docs/transformers/en/main_classes/pipelines) | ||||||||
- [Text Generation Parameters](https://huggingface.co/docs/transformers/en/main_classes/text_generation) | ||||||||
- [Model Quantization Guide](https://huggingface.co/docs/transformers/en/perf_infer_gpu_one) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,137 @@ | ||
# Text Generation Inference (TGI) | ||
|
||
Text Generation Inference (TGI) is a toolkit developed by Hugging Face for deploying and serving Large Language Models (LLMs). It's designed to enable high-performance text generation for popular open-source LLMs. TGI is used in production by Hugging Chat - An open-source interface for open-access models. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would be fair to mention other providers like vLLM and Ollama too, as being the same but different? |
||
|
||
## Why Use Text Generation Inference? | ||
|
||
Text Generation Inference addresses the key challenges of deploying large language models in production. While many frameworks excel at model development, TGI specifically optimizes for production deployment and scaling. Some key features include: | ||
|
||
- **Tensor Parallelism**: TGI's can split models across multiple GPUs through tensor parallelism, essential for serving larger models efficiently. | ||
- **Continuous Batching**: The continuous batching system maximizes GPU utilization by dynamically processing requests, while optimizations like Flash Attention and Paged Attention significantly reduce memory usage and increase speed. | ||
- **Token Streaming**: Real-time applications benefit from token streaming via Server-Sent Events, delivering responses with minimal latency. | ||
|
||
|
||
## How to Use Text Generation Inference | ||
|
||
### Basic Python Usage | ||
|
||
TGI uses a simple yet powerful REST API integration which makes it easy to integrate with your applications. | ||
|
||
### Using the REST API | ||
|
||
TGI exposes a RESTful API that accepts JSON payloads. This makes it accessible from any programming language or tool that can make HTTP requests. Here's a basic example using curl: | ||
|
||
```bash | ||
# Basic generation request | ||
curl localhost:8080/v1/chat/completions \ | ||
-X POST \ | ||
-d '{ | ||
"model": "tgi", | ||
"messages": [ | ||
{ | ||
"role": "system", | ||
"content": "You are a helpful assistant." | ||
}, | ||
{ | ||
"role": "user", | ||
"content": "What is deep learning?" | ||
} | ||
], | ||
"stream": true, | ||
"max_tokens": 20 | ||
}' \ | ||
-H 'Content-Type: application/json' | ||
``` | ||
|
||
### Using the `huggingface_hub` Python Client | ||
|
||
The `huggingface_hub` python client client handles connection management, request formatting, and response parsing. Here's how to get started. | ||
|
||
```python | ||
from huggingface_hub import InferenceClient | ||
|
||
client = InferenceClient( | ||
base_url="http://localhost:8080/v1/", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps it would be helpful to use working API URLs such that sample code runs. Plus explain that in HuggingFace one can run inference on Serverless Inference API, dedicated APIs etc. |
||
) | ||
|
||
output = client.chat.completions.create( | ||
model="tgi", | ||
messages=[ | ||
{"role": "system", "content": "You are a helpful assistant."}, | ||
{"role": "user", "content": "Count to 10"}, | ||
], | ||
stream=True, | ||
max_tokens=1024, | ||
) | ||
|
||
for chunk in output: | ||
print(chunk.choices[0].delta.content) | ||
``` | ||
|
||
|
||
### Using OpenAI API | ||
|
||
Many libraries support the OpenAI API, so you can use the same client to interact with TGI. | ||
|
||
```python | ||
from openai import OpenAI | ||
|
||
# init the client but point it to TGI | ||
client = OpenAI( | ||
base_url="http://localhost:8080/v1/", | ||
api_key="-" | ||
) | ||
|
||
chat_completion = client.chat.completions.create( | ||
model="tgi", | ||
messages=[ | ||
{"role": "system", "content": "You are a helpful assistant." }, | ||
{"role": "user", "content": "What is deep learning?"} | ||
], | ||
stream=True | ||
) | ||
|
||
# iterate and print stream | ||
for message in chat_completion: | ||
print(message) | ||
``` | ||
|
||
## Preparing Models for TGI | ||
|
||
To serve a model with TGI, ensure it meets these requirements: | ||
|
||
1. **Supported Architecture**: Verify your model architecture is supported (Llama, BLOOM, T5, etc.) | ||
|
||
2. **Model Format**: Convert weights to safetensors format for faster loading: | ||
|
||
```python | ||
from safetensors.torch import save_file | ||
from transformers import AutoModelForCausalLM | ||
|
||
model = AutoModelForCausalLM.from_pretrained("your-model") | ||
state_dict = model.state_dict() | ||
save_file(state_dict, "model.safetensors") | ||
``` | ||
|
||
3. **Quantization** (optional): Quantize your model to reduce memory usage: | ||
|
||
```python | ||
from transformers import BitsAndBytesConfig | ||
|
||
quantization_config = BitsAndBytesConfig( | ||
load_in_4bit=True, | ||
bnb_4bit_compute_dtype="float16" | ||
) | ||
|
||
model = AutoModelForCausalLM.from_pretrained( | ||
"your-model", | ||
quantization_config=quantization_config | ||
) | ||
``` | ||
|
||
## References | ||
|
||
- [Text Generation Inference Documentation](https://huggingface.co/docs/text-generation-inference) | ||
- [TGI GitHub Repository](https://github.com/huggingface/text-generation-inference) | ||
- [Hugging Face Model Hub](https://huggingface.co/models) | ||
- [TGI API Reference](https://huggingface.co/docs/text-generation-inference/api_reference) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we consider showing and mentioning inference options again like adapters and structured generation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. It would be very helpful to include controlled inference generation, which is particular crucial when integrating LLMs with downstream systems (e.g. grammars with llama.cpp, FSM with outlines or logit processing techniques in general).
Here's an example on how to control inference using Smol model +
LogitsProcessor
class from theTransformers
library:https://www.tamingllms.com/notebooks/structured_output.html#logit-post-processing