-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MODULE] Add module on inference #150
base: main
Are you sure you want to change the base?
Conversation
# Create a pipeline with a specific model | ||
generator = pipeline( | ||
"text-generation", | ||
model="HuggingFaceTB/SmolLM2-1.7B-Instruct", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps quantized models would be more usable?
response = generator( | ||
"Write a short poem about coding:", | ||
max_new_tokens=100, | ||
do_sample=True, | ||
temperature=0.7 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't 100 tokens a bit small?
``` | ||
|
||
### Generation Parameters | ||
```python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
```python | |
```python |
Here's how to integrate a pipeline into a Flask application: | ||
|
||
```python | ||
from flask import Flask, request, jsonify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you using Flask and not something more modern like FastAPI?
@@ -0,0 +1,137 @@ | |||
# Text Generation Inference (TGI) | |||
|
|||
Text Generation Inference (TGI) is a toolkit developed by Hugging Face for deploying and serving Large Language Models (LLMs). It's designed to enable high-performance text generation for popular open-source LLMs. TGI is used in production by Hugging Chat - An open-source interface for open-access models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be fair to mention other providers like vLLM and Ollama too, as being the same but different?
@@ -0,0 +1,174 @@ | |||
# Basic Inference with Transformers Pipeline | |||
|
|||
The `pipeline` abstraction in 🤗 Transformers provides a simple way to run inference with any model from the Hugging Face Hub. It handles all the preprocessing and postprocessing steps, making it easy to use models without deep knowledge of their architecture or requirements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps briefly mentioning something like llama-cpp or vllm would be fair too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reference goes over open source alternatives for model serving and inference including llama-cpp, llamafile and ollama:
https://www.tamingllms.com/notebooks/local.html#tools-for-local-llm-deployment
|
||
LLM inference can be categorized into two main approaches: simple pipeline-based inference for development and testing, and optimized serving solutions for production deployments. We'll cover both approaches, starting with the simpler pipeline approach and moving to production-ready solutions. | ||
|
||
## Contents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we consider showing and mentioning inference options again like adapters and structured generation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. It would be very helpful to include controlled inference generation, which is particular crucial when integrating LLMs with downstream systems (e.g. grammars with llama.cpp, FSM with outlines or logit processing techniques in general).
Here's an example on how to control inference using Smol model + LogitsProcessor
class from the Transformers
library:
https://www.tamingllms.com/notebooks/structured_output.html#logit-post-processing
from huggingface_hub import InferenceClient | ||
|
||
client = InferenceClient( | ||
base_url="http://localhost:8080/v1/", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it would be helpful to use working API URLs such that sample code runs. Plus explain that in HuggingFace one can run inference on Serverless Inference API, dedicated APIs etc.
This is a module on inference techniques like pipeline and TGI.