Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llm docs 2 #1313

Merged
merged 21 commits into from
Oct 11, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 33 additions & 23 deletions docs/llms/text-generation-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,14 @@ limitations under the License.

# **Text Generation Pipelines**

This user guide describes how to run inference of text generation models with DeepSparse.
This user guide explains how to run inference of text generation models with DeepSparse.

## **Installation**

DeepSparse support for LLMs is currently available on DeepSparse's nightly build on PyPi:
DeepSparse support for LLMs is available on DeepSparse's nightly build on PyPi:

```bash
pip install -U deepsparse-nightly==1.6.0.20231007[transformers]
pip install -U deepsparse-nightly[transformers]==1.6.0.20231007
```

#### **System Requirements**
Expand All @@ -41,8 +41,8 @@ DeepSparse exposes a Pipeline interface called `TextGeneration`, which is used t
from deepsparse import TextGeneration

# construct a pipeline
MODEL_PATH = "zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none"
pipeline = TextGeneration(model_path=MODEL_PATH)
model_path = "zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized"
pipeline = TextGeneration(model=model_path)

# generate text
prompt = "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: What is Kubernetes? ### Response:"
Expand All @@ -52,27 +52,29 @@ print(output.generations[0].text)
# >> Kubernetes is an open-source container orchestration system for automating deployment, scaling, and management of containerized applications.
```

> **Note:** The 7B model takes about 2 minutes to compile. Set `MODEL_PATH` to `hf:mgoin/TinyStories-33M-quant-deepsparse` to use a small TinyStories model for quick compilation if you are just experimenting.
> **Note:** The 7B model takes about 2 minutes to compile. Set `model_path = hf:mgoin/TinyStories-33M-quant-deepsparse` to use a small TinyStories model for quick compilation if you are just experimenting.

## **Model Format**

DeepSparse accepts models in ONNX format, passed either as SparseZoo stubs or local directories.

> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs from SparseZoo.***
> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs created by Neural Magic.***
>
### **SparseZoo Stubs**

SparseZoo stubs identify a model in SparseZoo. For instance, `zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none` identifes a 50% pruned-quantized MPT-7b model fine-tuned on the Dolly dataset. We can pass the stub to `TextGeneration`, which downloads and caches the ONNX file.
SparseZoo stubs identify a model in SparseZoo. For instance, `zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized` identifes a 50% pruned-quantized pretrained MPT-7b model fine-tuned on the Dolly dataset. We can pass the stub to `TextGeneration`, which downloads and caches the ONNX file.

```python
model_path = "zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none"
pipeline = TextGeneration(model_path=model_path)
model_path = "zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized"
pipeline = TextGeneration(model=model_path)
```

### **Local Deployment Directory**

Additionally, we can pass a local path to a deployment directory. Use the SparseZoo API to download an example deployment directory:
```python
import sparsezoo
sz_model = sparsezoo.Model("zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none", "./local-model")
from sparsezoo import Model
sz_model = Model("zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized", "./local-model")
sz_model.deployment.download()
```

Expand All @@ -84,8 +86,16 @@ ls ./local-model/deployment

We can pass the local directory path to `TextGeneration`:
```python
model_path = "./local-model/deployment"
pipeline = TextGeneration(model_path=model_path)
from deepsparse import TextGeneration
pipeline = TextGeneration(model="./local-model/deployment")
```

### **Hugging Face Models**
Hugging Face models which conform to the directory structure listed above can also be run with DeepSparse by prepending `hf:` to a model id. The following runs a [60% pruned-quantized MPT-7b model trained on GSM](https://huggingface.co/neuralmagic/mpt-7b-gsm8k-pruned60-quant).

```python
from deepsparse import TextGeneration
pipeline = TextGeneration(model="hf:neuralmagic/mpt-7b-gsm8k-pruned60-quant")
```

## **Input and Output Formats**
Expand All @@ -96,8 +106,7 @@ The following examples use a quantized 33M parameter TinyStories model for quick
```python
from deepsparse import TextGeneration

MODEL_PATH = "hf:mgoin/TinyStories-33M-quant-deepsparse"
pipeline = TextGeneration(model_path=MODEL_PATH)
pipeline = TextGeneration(model="hf:mgoin/TinyStories-33M-quant-deepsparse")
```

### Input Format
Expand All @@ -112,13 +121,14 @@ for prompt_i, generation_i in zip(output.prompts, output.generations):
print(f"{prompt_i}{generation_i.text}")

# >> Princess Peach jumped from the balcony and landed on the ground. She was so happy that she had found her treasure. She thanked the old

# >> Mario ran into the castle and started to explore. He ran around the castle and climbed on the throne. He even tried to climb
```

- `streaming`: Boolean determining whether to stream response. If True, then the results are returned as a generator object which yields the results as they are generated.

```python
prompt = "Princess peach jumped from the balcony"
prompt = "Princess Peach jumped from the balcony"
output_iterator = pipeline(prompt=prompt, streaming=True, max_new_tokens=20)

print(prompt, end="")
Expand Down Expand Up @@ -172,8 +182,8 @@ The following examples use a quantized 33M parameter TinyStories model for quick
```python
from deepsparse import TextGeneration

MODEL_PATH = "hf:mgoin/TinyStories-33M-quant-deepsparse"
pipeline = TextGeneration(model_path=MODEL_PATH)
model_id = "hf:mgoin/TinyStories-33M-quant-deepsparse"
pipeline = TextGeneration(model=model_id)
```

### **Creating A `GenerationConfig`**
Expand Down Expand Up @@ -213,7 +223,7 @@ We can pass a `GenerationConfig` to `TextGeneration.__init__` or `TextGeneration

```python
# set generation_config during __init__
pipeline_w_gen_config = TextGeneration(model_path=MODEL_PATH, generation_config={"max_new_tokens": 10})
pipeline_w_gen_config = TextGeneration(model=model_id, generation_config={"max_new_tokens": 10})

# generation_config is the default during __call__
output = pipeline_w_gen_config(prompt=prompt)
Expand All @@ -225,7 +235,7 @@ print(f"{prompt}{output.generations[0].text}")

```python
# no generation_config set during __init__
pipeline_w_no_gen_config = TextGeneration(model_path=MODEL_PATH)
pipeline_w_no_gen_config = TextGeneration(model=model_id)

# generation_config is the passed during __call__
output = pipeline_w_no_gen_config(prompt=prompt, generation_config= {"max_new_tokens": 10})
Expand Down Expand Up @@ -295,7 +305,7 @@ import numpy
# only 20 logits are not set to -inf == only 20 logits used to sample token
output = pipeline(prompt=prompt, do_sample=True, top_k=20, max_new_tokens=15, output_scores=True)
print(numpy.isfinite(output.generations[0].score).sum(axis=1))
# >> array([20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20])
# >> [20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20]
```

- `top_p`: Float to define the tokens that are considered with nucleus sampling. If `0.0`, `top_p` is turned off. Default is `0.0`
Expand All @@ -306,7 +316,7 @@ import numpy
output = pipeline(prompt=prompt, do_sample=True, top_p=0.9, max_new_tokens=15, output_scores=True)
print(numpy.isfinite(output.generations[0].score).sum(axis=1))

# >> array([20, 15, 10, 5, 25, 3, 10, 7, 6, 6, 15, 12, 11, 3, 4, 4])
# >> [ 5 119 18 14 204 6 7 367 191 20 12 7 46 6 2 35]
```
- `repetition_penalty`: The more a token is used within generation the more it is penalized to not be picked in successive generation passes. If `0.0`, `repetation_penalty` is turned off. Default is `0.0`

Expand Down
77 changes: 40 additions & 37 deletions research/mpt/README.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,37 @@
# **Sparse Finetuned LLMs with DeepSparse**

DeepSparse has support for performant inference of sparse large language models, starting with Mosaic's MPT.
*LAST UPDATED: 10/11/2023*

In this overview, we will discuss:
1. [Current status of our sparse fine-tuning research](#sparse-fine-tuning-research)
2. [How to try text generation with DeepSparse](#try-it-now)
# **Sparse Finetuned LLMs with DeepSparse**

For detailed usage instructions, [see the text generation user guide](https://github.com/neuralmagic/deepsparse/tree/main/docs/llms/text-generation-pipeline.md).
DeepSparse has support for performant inference of sparse large language models, starting with Mosaic's MPT.

![deepsparse_mpt_gsm_speedup](https://github.com/neuralmagic/deepsparse/assets/3195154/8687401c-f479-4999-ba6b-e01c747dace9)
In this research overview, we will discuss:
1. [Our Sparse Fineuning Research](#sparse-finetuning-research)
2. [How to try Text Generation with DeepSparse](#try-it-now)

## **Sparse Finetuning Research**

Sparsity is a powerful model compression technique, where weights are removed from the network with limited accuracy drop.
We show that MPT-7B can be pruned to ~60% sparsity with INT8 quantization (and 70% sparsity without quantization), with no accuracy drop, using a technique called **Sparse Finetuning**, where we prune the network during the finetuning process.

We show that MPT-7B can be pruned to ~60% sparsity with INT8 quantization, without loss, using a technique called **Sparse Finetuning**, where we prune the network during the fine-tuning process.
When running the pruned network with DeepSparse, we can accelerate inference by ~7x over the dense-FP32 baseline!

### **Sparse Finetuning on Grade-School Math (GSM)**

Open-source LLMs are typically fine-tuned onto downstream datasets for two reasons:
* **Instruction Tuning**: show the LLM examples of how to respond to human input or prompts properly
* **Domain Adaptation**: show the LLM examples with information it does not currently understand
Training LLMs consist of two steps. First, the model is pre-trained on a very large corpus of text (typically >1T tokens). Then, the model is adapted for downstream use by continuing training with a much smaller high quality curated dataset. This second step is called finetuning.

Fine-tuning is useful for two main reasons:
1. It can teach the model *how to respond* to input (often called **instruction tuning**).
2. It can teach the model *new information* (often called **domain adaptation**).


An example of how domain adaptation is helpful is solving the [Grade-school math (GSM) dataset](https://huggingface.co/datasets/gsm8k). GSM is a set of grade school word problems and a notoriously difficult task for LLMs, as evidenced by the 0% zero-shot accuracy of MPT-7B-base. By fine-tuning with a very small set of ~7k training examples, however, we can boost the model's accuracy on the test set to 28.2%.
An example of how domain adaptation is helpful is solving the [Grade-school math (GSM) dataset](https://huggingface.co/datasets/gsm8k). GSM is a set of grade school word problems and a notoriously difficult task for LLMs, as evidenced by the 0% zero-shot accuracy of MPT-7B. By fine-tuning with a very small set of ~7k training examples, however, we can boost the model's accuracy on the test set to 28.2%.

The key insight from our paper is that we can prune the network during the finetuning process. We apply [SparseGPT](https://arxiv.org/pdf/2301.00774.pdf) to prune the network after dense finetuning and retrain for 2 epochs with L2 distillation. The result is a 60% sparse-quantized model with limited accuracy drop on GSM8k runs 6.7x faster than the dense baseline with DeepSparse!
The key insight from our paper is that we can prune the network during the finetuning process. We apply [SparseGPT](https://arxiv.org/pdf/2301.00774.pdf) to prune the network after dense finetuning and retrain for 2 epochs with L2 distillation. The result is a 60% sparse-quantized model with no accuracy drop on GSM8k runs 7x faster than the dense baseline with DeepSparse!

Paper: (link to paper)
<div align="center">
<img src="https://github.com/neuralmagic/deepsparse/assets/3195154/8687401c-f479-4999-ba6b-e01c747dace9" width="60%"/>
</div>

- [See the paper on Arxiv]() << UPDATE >>

### **How Is This Useful For Real World Use?**

Expand All @@ -37,18 +42,20 @@ While GSM is a "toy" math dataset, it serves as an example of how LLMs can be ad
Install the DeepSparse Nightly build (requires Linux):

```bash
pip install deepsparse-nightly[transformers]
pip install deepsparse-nightly[transformers]==1.6.0.20231007
```

The models generated in the paper are hosted on [SparseZoo](https://sparsezoo.neuralmagic.com/?ungrouped=true&sort=null&datasets=gsm8k&architectures=mpt) and [Hugging Face](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d).

### MPT-7B on GSM

We can run inference on the 60% sparse-quantized MPT-7B GSM model using DeepSparse's `TextGeneration` Pipeline:
We can run inference on the models using DeepSparse's `TextGeneration` Pipeline:

```python
from deepsparse import TextGeneration

MODEL_PATH = "zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/gsm8k/pruned60_quant-none"
pipeline = TextGeneration(model_path=MODEL_PATH)
model = "zoo:mpt-7b-gsm8k_mpt_pretrain-pruned60_quantized"
pipeline = TextGeneration(model_path=model)

prompt = "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May"
output = pipeline(prompt=prompt)
Expand All @@ -59,13 +66,13 @@ print(output.generations[0].text)
### >> #### 72
```

It is also possible to run models directly from Hugging Face by prepending `"hf:"` to a model id, such as:
It is also possible to run the models directly from Hugging Face by prepending `"hf:"` to a model id, such as:

```python
from deepsparse import TextGeneration

MODEL_PATH = "hf:neuralmagic/mpt-7b-gsm8k-pruned60-quant"
pipeline = TextGeneration(model_path=MODEL_PATH)
hf_model_id = "hf:neuralmagic/mpt-7b-gsm8k-pruned60-quant"
pipeline = TextGeneration(model=hf_model_id)

prompt = "Question: Marty has 100 centimeters of ribbon that he must cut into 4 equal parts. Each of the cut parts must be divided into 5 equal parts. How long will each final cut be?"
output = pipeline(prompt=prompt)
Expand All @@ -76,26 +83,22 @@ print(output.generations[0].text)
### >> #### 5
```

> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs created by Neural Magic's team***


#### Other Resources
- [Check out all the MPT GSM models on SparseZoo](https://sparsezoo.neuralmagic.com/?datasets=gsm8k&ungrouped=true)
- [Try out the live demo on Hugging Face Spaces](https://huggingface.co/spaces/neuralmagic/sparse-mpt-7b-gsm8k) and view the [collection of paper, demos, and models](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d)
- [Check out the detailed `TextGeneration` Pipeline documentation](https://github.com/neuralmagic/deepsparse/blob/main/docs/llms/text-generation-pipeline.md)

### **MPT-7B on Dolly-HHRLHF**
## **Roadmap**

We have also made a 50% sparse-quantized MPT-7B fine-tuned on [Dolly-hhrlhf](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) available on SparseZoo. We can run inference with the following:
Following these initial results, we are rapidly expanding our support for LLMs across the Neural Magic stack, including:

```python
from deepsparse import TextGeneration

MODEL_PATH = "zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none"
pipeline = TextGeneration(model_path=MODEL_PATH)

prompt = "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is Kubernetes? ### Response:"
output = pipeline(prompt=prompt)
print(output.generations[0].text)

### >> Kubernetes is an open-source container orchestration system for automating deployment, scaling, and management of containerized applications.
```
- **Productizing Sparse Fine Tuning**: Enable external users to apply the sparse fine-tuning to business datasets
- **Expanding Model Support**: Apply sparse fine-tuning results to Llama2 and Mistral models
- **Pushing to Higher Sparsity**: Improving our pruning algorithms to reach higher sparsity
- **Building General Sparse Model**: Create sparse model that can perform well on general tasks like OpenLLM leaderboard

## **Feedback / Roadmap Requests**

Expand Down