[MODULE] Add module on inference #150

burtenshaw · 2024-12-30T06:28:37Z

This is a module on inference techniques like pipeline and TGI.

davidberenstein1957 · 2024-12-30T14:54:35Z

7_inference/inference_pipeline.md

+# Create a pipeline with a specific model
+generator = pipeline(
+    "text-generation",
+    model="HuggingFaceTB/SmolLM2-1.7B-Instruct",


perhaps quantized models would be more usable?

davidberenstein1957 · 2024-12-30T14:54:56Z

7_inference/inference_pipeline.md

+response = generator(
+    "Write a short poem about coding:",
+    max_new_tokens=100,
+    do_sample=True,
+    temperature=0.7
+)


isn't 100 tokens a bit small?

davidberenstein1957 · 2024-12-30T14:55:13Z

7_inference/inference_pipeline.md

+```
+
+### Generation Parameters
+```python


Suggested change

```python

```python

davidberenstein1957 · 2024-12-30T14:55:57Z

7_inference/inference_pipeline.md

+Here's how to integrate a pipeline into a Flask application:
+
+```python
+from flask import Flask, request, jsonify


why are you using Flask and not something more modern like FastAPI?

davidberenstein1957 · 2024-12-30T14:57:18Z

7_inference/text_generation_inference.md

@@ -0,0 +1,137 @@
+# Text Generation Inference (TGI)
+
+Text Generation Inference (TGI) is a toolkit developed by Hugging Face for deploying and serving Large Language Models (LLMs). It's designed to enable high-performance text generation for popular open-source LLMs. TGI is used in production by Hugging Chat - An open-source interface for open-access models.


I think it would be fair to mention other providers like vLLM and Ollama too, as being the same but different?

davidberenstein1957 · 2024-12-30T14:57:43Z

7_inference/inference_pipeline.md

@@ -0,0 +1,174 @@
+# Basic Inference with Transformers Pipeline
+
+The `pipeline` abstraction in 🤗 Transformers provides a simple way to run inference with any model from the Hugging Face Hub. It handles all the preprocessing and postprocessing steps, making it easy to use models without deep knowledge of their architecture or requirements.


perhaps briefly mentioning something like llama-cpp or vllm would be fair too?

This reference goes over open source alternatives for model serving and inference including llama-cpp, llamafile and ollama:

https://www.tamingllms.com/notebooks/local.html#tools-for-local-llm-deployment

davidberenstein1957 · 2024-12-30T14:58:53Z

7_inference/README.md

+
+LLM inference can be categorized into two main approaches: simple pipeline-based inference for development and testing, and optimized serving solutions for production deployments. We'll cover both approaches, starting with the simpler pipeline approach and moving to production-ready solutions.
+
+## Contents


should we consider showing and mentioning inference options again like adapters and structured generation?

I agree. It would be very helpful to include controlled inference generation, which is particular crucial when integrating LLMs with downstream systems (e.g. grammars with llama.cpp, FSM with outlines or logit processing techniques in general).

Here's an example on how to control inference using Smol model + LogitsProcessor class from the Transformers library:

https://www.tamingllms.com/notebooks/structured_output.html#logit-post-processing

souzatharsis · 2025-01-05T02:06:33Z

7_inference/text_generation_inference.md

+from huggingface_hub import InferenceClient
+
+client = InferenceClient(
+    base_url="http://localhost:8080/v1/",


Perhaps it would be helpful to use working API URLs such that sample code runs. Plus explain that in HuggingFace one can run inference on Serverless Inference API, dedicated APIs etc.

first commit

0b7675e

burtenshaw marked this pull request as draft December 30, 2024 06:28

davidberenstein1957 reviewed Dec 30, 2024

View reviewed changes

souzatharsis reviewed Jan 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MODULE] Add module on inference #150

[MODULE] Add module on inference #150

burtenshaw commented Dec 30, 2024

davidberenstein1957 Dec 30, 2024

davidberenstein1957 Dec 30, 2024

davidberenstein1957 Dec 30, 2024

davidberenstein1957 Dec 30, 2024

davidberenstein1957 Dec 30, 2024

davidberenstein1957 Dec 30, 2024

souzatharsis Jan 5, 2025

davidberenstein1957 Dec 30, 2024

souzatharsis Jan 5, 2025 •

edited

Loading

souzatharsis Jan 5, 2025

		@@ -0,0 +1,137 @@
		# Text Generation Inference (TGI)

		Text Generation Inference (TGI) is a toolkit developed by Hugging Face for deploying and serving Large Language Models (LLMs). It's designed to enable high-performance text generation for popular open-source LLMs. TGI is used in production by Hugging Chat - An open-source interface for open-access models.

		@@ -0,0 +1,174 @@
		# Basic Inference with Transformers Pipeline

		The `pipeline` abstraction in 🤗 Transformers provides a simple way to run inference with any model from the Hugging Face Hub. It handles all the preprocessing and postprocessing steps, making it easy to use models without deep knowledge of their architecture or requirements.


		LLM inference can be categorized into two main approaches: simple pipeline-based inference for development and testing, and optimized serving solutions for production deployments. We'll cover both approaches, starting with the simpler pipeline approach and moving to production-ready solutions.

		## Contents

[MODULE] Add module on inference #150

Are you sure you want to change the base?

[MODULE] Add module on inference #150

Conversation

burtenshaw commented Dec 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

souzatharsis Jan 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

souzatharsis Jan 5, 2025 •

edited

Loading