diff --git a/docs/llms/text-generation-pipeline.md b/docs/llms/text-generation-pipeline.md index 976393cf46..b308728a90 100644 --- a/docs/llms/text-generation-pipeline.md +++ b/docs/llms/text-generation-pipeline.md @@ -16,14 +16,14 @@ limitations under the License. # **Text Generation Pipelines** -This user guide describes how to run inference of text generation models with DeepSparse. +This user guide explains how to run inference of text generation models with DeepSparse. ## **Installation** -DeepSparse support for LLMs is currently available on DeepSparse's nightly build on PyPi: +DeepSparse support for LLMs is available on DeepSparse's nightly build on PyPi: ```bash -pip install -U deepsparse-nightly==1.6.0.20231007[transformers] +pip install -U deepsparse-nightly[transformers]==1.6.0.20231007 ``` #### **System Requirements** @@ -41,8 +41,8 @@ DeepSparse exposes a Pipeline interface called `TextGeneration`, which is used t from deepsparse import TextGeneration # construct a pipeline -MODEL_PATH = "zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none" -pipeline = TextGeneration(model_path=MODEL_PATH) +model_path = "zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized" +pipeline = TextGeneration(model=model_path) # generate text prompt = "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: What is Kubernetes? ### Response:" @@ -52,27 +52,29 @@ print(output.generations[0].text) # >> Kubernetes is an open-source container orchestration system for automating deployment, scaling, and management of containerized applications. ``` -> **Note:** The 7B model takes about 2 minutes to compile. Set `MODEL_PATH` to `hf:mgoin/TinyStories-33M-quant-deepsparse` to use a small TinyStories model for quick compilation if you are just experimenting. +> **Note:** The 7B model takes about 2 minutes to compile. Set `model_path = hf:mgoin/TinyStories-33M-quant-deepsparse` to use a small TinyStories model for quick compilation if you are just experimenting. + ## **Model Format** DeepSparse accepts models in ONNX format, passed either as SparseZoo stubs or local directories. -> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs from SparseZoo.*** +> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs created by Neural Magic.*** +> ### **SparseZoo Stubs** -SparseZoo stubs identify a model in SparseZoo. For instance, `zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none` identifes a 50% pruned-quantized MPT-7b model fine-tuned on the Dolly dataset. We can pass the stub to `TextGeneration`, which downloads and caches the ONNX file. +SparseZoo stubs identify a model in SparseZoo. For instance, `zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized` identifes a 50% pruned-quantized pretrained MPT-7b model fine-tuned on the Dolly dataset. We can pass the stub to `TextGeneration`, which downloads and caches the ONNX file. ```python -model_path = "zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none" -pipeline = TextGeneration(model_path=model_path) +model_path = "zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized" +pipeline = TextGeneration(model=model_path) ``` ### **Local Deployment Directory** Additionally, we can pass a local path to a deployment directory. Use the SparseZoo API to download an example deployment directory: ```python -import sparsezoo -sz_model = sparsezoo.Model("zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none", "./local-model") +from sparsezoo import Model +sz_model = Model("zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized", "./local-model") sz_model.deployment.download() ``` @@ -84,8 +86,16 @@ ls ./local-model/deployment We can pass the local directory path to `TextGeneration`: ```python -model_path = "./local-model/deployment" -pipeline = TextGeneration(model_path=model_path) +from deepsparse import TextGeneration +pipeline = TextGeneration(model="./local-model/deployment") +``` + +### **Hugging Face Models** +Hugging Face models which conform to the directory structure listed above can also be run with DeepSparse by prepending `hf:` to a model id. The following runs a [60% pruned-quantized MPT-7b model trained on GSM](https://huggingface.co/neuralmagic/mpt-7b-gsm8k-pruned60-quant). + +```python +from deepsparse import TextGeneration +pipeline = TextGeneration(model="hf:neuralmagic/mpt-7b-gsm8k-pruned60-quant") ``` ## **Input and Output Formats** @@ -96,8 +106,7 @@ The following examples use a quantized 33M parameter TinyStories model for quick ```python from deepsparse import TextGeneration -MODEL_PATH = "hf:mgoin/TinyStories-33M-quant-deepsparse" -pipeline = TextGeneration(model_path=MODEL_PATH) +pipeline = TextGeneration(model="hf:mgoin/TinyStories-33M-quant-deepsparse") ``` ### Input Format @@ -112,13 +121,14 @@ for prompt_i, generation_i in zip(output.prompts, output.generations): print(f"{prompt_i}{generation_i.text}") # >> Princess Peach jumped from the balcony and landed on the ground. She was so happy that she had found her treasure. She thanked the old + # >> Mario ran into the castle and started to explore. He ran around the castle and climbed on the throne. He even tried to climb ``` - `streaming`: Boolean determining whether to stream response. If True, then the results are returned as a generator object which yields the results as they are generated. ```python -prompt = "Princess peach jumped from the balcony" +prompt = "Princess Peach jumped from the balcony" output_iterator = pipeline(prompt=prompt, streaming=True, max_new_tokens=20) print(prompt, end="") @@ -172,8 +182,8 @@ The following examples use a quantized 33M parameter TinyStories model for quick ```python from deepsparse import TextGeneration -MODEL_PATH = "hf:mgoin/TinyStories-33M-quant-deepsparse" -pipeline = TextGeneration(model_path=MODEL_PATH) +model_id = "hf:mgoin/TinyStories-33M-quant-deepsparse" +pipeline = TextGeneration(model=model_id) ``` ### **Creating A `GenerationConfig`** @@ -213,7 +223,7 @@ We can pass a `GenerationConfig` to `TextGeneration.__init__` or `TextGeneration ```python # set generation_config during __init__ -pipeline_w_gen_config = TextGeneration(model_path=MODEL_PATH, generation_config={"max_new_tokens": 10}) +pipeline_w_gen_config = TextGeneration(model=model_id, generation_config={"max_new_tokens": 10}) # generation_config is the default during __call__ output = pipeline_w_gen_config(prompt=prompt) @@ -225,7 +235,7 @@ print(f"{prompt}{output.generations[0].text}") ```python # no generation_config set during __init__ -pipeline_w_no_gen_config = TextGeneration(model_path=MODEL_PATH) +pipeline_w_no_gen_config = TextGeneration(model=model_id) # generation_config is the passed during __call__ output = pipeline_w_no_gen_config(prompt=prompt, generation_config= {"max_new_tokens": 10}) @@ -295,7 +305,7 @@ import numpy # only 20 logits are not set to -inf == only 20 logits used to sample token output = pipeline(prompt=prompt, do_sample=True, top_k=20, max_new_tokens=15, output_scores=True) print(numpy.isfinite(output.generations[0].score).sum(axis=1)) -# >> array([20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20]) +# >> [20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20] ``` - `top_p`: Float to define the tokens that are considered with nucleus sampling. If `0.0`, `top_p` is turned off. Default is `0.0` @@ -306,7 +316,7 @@ import numpy output = pipeline(prompt=prompt, do_sample=True, top_p=0.9, max_new_tokens=15, output_scores=True) print(numpy.isfinite(output.generations[0].score).sum(axis=1)) -# >> array([20, 15, 10, 5, 25, 3, 10, 7, 6, 6, 15, 12, 11, 3, 4, 4]) +# >> [ 5 119 18 14 204 6 7 367 191 20 12 7 46 6 2 35] ``` - `repetition_penalty`: The more a token is used within generation the more it is penalized to not be picked in successive generation passes. If `0.0`, `repetation_penalty` is turned off. Default is `0.0` diff --git a/research/mpt/README.md b/research/mpt/README.md index 15edc6f126..ecb233c884 100644 --- a/research/mpt/README.md +++ b/research/mpt/README.md @@ -1,32 +1,37 @@ -# **Sparse Finetuned LLMs with DeepSparse** - -DeepSparse has support for performant inference of sparse large language models, starting with Mosaic's MPT. +*LAST UPDATED: 10/11/2023* -In this overview, we will discuss: -1. [Current status of our sparse fine-tuning research](#sparse-fine-tuning-research) -2. [How to try text generation with DeepSparse](#try-it-now) +# **Sparse Finetuned LLMs with DeepSparse** -For detailed usage instructions, [see the text generation user guide](https://github.com/neuralmagic/deepsparse/tree/main/docs/llms/text-generation-pipeline.md). +DeepSparse has support for performant inference of sparse large language models, starting with Mosaic's MPT. -![deepsparse_mpt_gsm_speedup](https://github.com/neuralmagic/deepsparse/assets/3195154/8687401c-f479-4999-ba6b-e01c747dace9) +In this research overview, we will discuss: +1. [Our Sparse Fineuning Research](#sparse-finetuning-research) +2. [How to try Text Generation with DeepSparse](#try-it-now) ## **Sparse Finetuning Research** -Sparsity is a powerful model compression technique, where weights are removed from the network with limited accuracy drop. +We show that MPT-7B can be pruned to ~60% sparsity with INT8 quantization (and 70% sparsity without quantization), with no accuracy drop, using a technique called **Sparse Finetuning**, where we prune the network during the finetuning process. -We show that MPT-7B can be pruned to ~60% sparsity with INT8 quantization, without loss, using a technique called **Sparse Finetuning**, where we prune the network during the fine-tuning process. +When running the pruned network with DeepSparse, we can accelerate inference by ~7x over the dense-FP32 baseline! ### **Sparse Finetuning on Grade-School Math (GSM)** -Open-source LLMs are typically fine-tuned onto downstream datasets for two reasons: -* **Instruction Tuning**: show the LLM examples of how to respond to human input or prompts properly -* **Domain Adaptation**: show the LLM examples with information it does not currently understand +Training LLMs consist of two steps. First, the model is pre-trained on a very large corpus of text (typically >1T tokens). Then, the model is adapted for downstream use by continuing training with a much smaller high quality curated dataset. This second step is called finetuning. + +Fine-tuning is useful for two main reasons: +1. It can teach the model *how to respond* to input (often called **instruction tuning**). +2. It can teach the model *new information* (often called **domain adaptation**). + -An example of how domain adaptation is helpful is solving the [Grade-school math (GSM) dataset](https://huggingface.co/datasets/gsm8k). GSM is a set of grade school word problems and a notoriously difficult task for LLMs, as evidenced by the 0% zero-shot accuracy of MPT-7B-base. By fine-tuning with a very small set of ~7k training examples, however, we can boost the model's accuracy on the test set to 28.2%. +An example of how domain adaptation is helpful is solving the [Grade-school math (GSM) dataset](https://huggingface.co/datasets/gsm8k). GSM is a set of grade school word problems and a notoriously difficult task for LLMs, as evidenced by the 0% zero-shot accuracy of MPT-7B. By fine-tuning with a very small set of ~7k training examples, however, we can boost the model's accuracy on the test set to 28.2%. -The key insight from our paper is that we can prune the network during the finetuning process. We apply [SparseGPT](https://arxiv.org/pdf/2301.00774.pdf) to prune the network after dense finetuning and retrain for 2 epochs with L2 distillation. The result is a 60% sparse-quantized model with limited accuracy drop on GSM8k runs 6.7x faster than the dense baseline with DeepSparse! +The key insight from our paper is that we can prune the network during the finetuning process. We apply [SparseGPT](https://arxiv.org/pdf/2301.00774.pdf) to prune the network after dense finetuning and retrain for 2 epochs with L2 distillation. The result is a 60% sparse-quantized model with no accuracy drop on GSM8k runs 7x faster than the dense baseline with DeepSparse! -Paper: (link to paper) +
+ +
+ +- [See the paper on Arxiv]() << UPDATE >> ### **How Is This Useful For Real World Use?** @@ -37,18 +42,20 @@ While GSM is a "toy" math dataset, it serves as an example of how LLMs can be ad Install the DeepSparse Nightly build (requires Linux): ```bash -pip install deepsparse-nightly[transformers] +pip install deepsparse-nightly[transformers]==1.6.0.20231007 ``` +The models generated in the paper are hosted on [SparseZoo](https://sparsezoo.neuralmagic.com/?ungrouped=true&sort=null&datasets=gsm8k&architectures=mpt) and [Hugging Face](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d). + ### MPT-7B on GSM -We can run inference on the 60% sparse-quantized MPT-7B GSM model using DeepSparse's `TextGeneration` Pipeline: +We can run inference on the models using DeepSparse's `TextGeneration` Pipeline: ```python from deepsparse import TextGeneration -MODEL_PATH = "zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/gsm8k/pruned60_quant-none" -pipeline = TextGeneration(model_path=MODEL_PATH) +model = "zoo:mpt-7b-gsm8k_mpt_pretrain-pruned60_quantized" +pipeline = TextGeneration(model_path=model) prompt = "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May" output = pipeline(prompt=prompt) @@ -59,13 +66,13 @@ print(output.generations[0].text) ### >> #### 72 ``` -It is also possible to run models directly from Hugging Face by prepending `"hf:"` to a model id, such as: +It is also possible to run the models directly from Hugging Face by prepending `"hf:"` to a model id, such as: ```python from deepsparse import TextGeneration -MODEL_PATH = "hf:neuralmagic/mpt-7b-gsm8k-pruned60-quant" -pipeline = TextGeneration(model_path=MODEL_PATH) +hf_model_id = "hf:neuralmagic/mpt-7b-gsm8k-pruned60-quant" +pipeline = TextGeneration(model=hf_model_id) prompt = "Question: Marty has 100 centimeters of ribbon that he must cut into 4 equal parts. Each of the cut parts must be divided into 5 equal parts. How long will each final cut be?" output = pipeline(prompt=prompt) @@ -76,26 +83,22 @@ print(output.generations[0].text) ### >> #### 5 ``` +> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs created by Neural Magic's team*** + + #### Other Resources - [Check out all the MPT GSM models on SparseZoo](https://sparsezoo.neuralmagic.com/?datasets=gsm8k&ungrouped=true) - [Try out the live demo on Hugging Face Spaces](https://huggingface.co/spaces/neuralmagic/sparse-mpt-7b-gsm8k) and view the [collection of paper, demos, and models](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d) +- [Check out the detailed `TextGeneration` Pipeline documentation](https://github.com/neuralmagic/deepsparse/blob/main/docs/llms/text-generation-pipeline.md) -### **MPT-7B on Dolly-HHRLHF** +## **Roadmap** -We have also made a 50% sparse-quantized MPT-7B fine-tuned on [Dolly-hhrlhf](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) available on SparseZoo. We can run inference with the following: +Following these initial results, we are rapidly expanding our support for LLMs across the Neural Magic stack, including: -```python -from deepsparse import TextGeneration - -MODEL_PATH = "zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none" -pipeline = TextGeneration(model_path=MODEL_PATH) - -prompt = "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is Kubernetes? ### Response:" -output = pipeline(prompt=prompt) -print(output.generations[0].text) - -### >> Kubernetes is an open-source container orchestration system for automating deployment, scaling, and management of containerized applications. -``` +- **Productizing Sparse Fine Tuning**: Enable external users to apply the sparse fine-tuning to business datasets +- **Expanding Model Support**: Apply sparse fine-tuning results to Llama2 and Mistral models +- **Pushing to Higher Sparsity**: Improving our pruning algorithms to reach higher sparsity +- **Building General Sparse Model**: Create sparse model that can perform well on general tasks like OpenLLM leaderboard ## **Feedback / Roadmap Requests**