From 8bf45b6cc89a1bfa81ca56c82e882a52d7a5d0d6 Mon Sep 17 00:00:00 2001 From: rsnm2 Date: Wed, 11 Oct 2023 15:06:04 +0000 Subject: [PATCH 01/21] 1) edited text generation pipeline --- docs/llms/text-generation-pipeline.md | 42 ++++++++++++++------------- 1 file changed, 22 insertions(+), 20 deletions(-) diff --git a/docs/llms/text-generation-pipeline.md b/docs/llms/text-generation-pipeline.md index 976393cf46..4e6f52782c 100644 --- a/docs/llms/text-generation-pipeline.md +++ b/docs/llms/text-generation-pipeline.md @@ -20,10 +20,10 @@ This user guide describes how to run inference of text generation models with De ## **Installation** -DeepSparse support for LLMs is currently available on DeepSparse's nightly build on PyPi: +DeepSparse support for LLMs is available on DeepSparse's nightly build on PyPi: ```bash -pip install -U deepsparse-nightly==1.6.0.20231007[transformers] +pip install -U deepsparse-nightly[transformers]==1.6.0.20231007 ``` #### **System Requirements** @@ -41,8 +41,8 @@ DeepSparse exposes a Pipeline interface called `TextGeneration`, which is used t from deepsparse import TextGeneration # construct a pipeline -MODEL_PATH = "zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none" -pipeline = TextGeneration(model_path=MODEL_PATH) +model_path = "zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized" +pipeline = TextGeneration(model=model_path) # generate text prompt = "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: What is Kubernetes? ### Response:" @@ -52,7 +52,8 @@ print(output.generations[0].text) # >> Kubernetes is an open-source container orchestration system for automating deployment, scaling, and management of containerized applications. ``` -> **Note:** The 7B model takes about 2 minutes to compile. Set `MODEL_PATH` to `hf:mgoin/TinyStories-33M-quant-deepsparse` to use a small TinyStories model for quick compilation if you are just experimenting. +> **Note:** The 7B model takes about 2 minutes to compile. Set `model_path = hf:mgoin/TinyStories-33M-quant-deepsparse` to use a small TinyStories model for quick compilation if you are just experimenting. + ## **Model Format** DeepSparse accepts models in ONNX format, passed either as SparseZoo stubs or local directories. @@ -60,19 +61,19 @@ DeepSparse accepts models in ONNX format, passed either as SparseZoo stubs or lo > **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs from SparseZoo.*** ### **SparseZoo Stubs** -SparseZoo stubs identify a model in SparseZoo. For instance, `zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none` identifes a 50% pruned-quantized MPT-7b model fine-tuned on the Dolly dataset. We can pass the stub to `TextGeneration`, which downloads and caches the ONNX file. +SparseZoo stubs identify a model in SparseZoo. For instance, `zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized` identifes a 50% pruned-quantized pretrained MPT-7b model fine-tuned on the Dolly dataset. We can pass the stub to `TextGeneration`, which downloads and caches the ONNX file. ```python -model_path = "zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none" -pipeline = TextGeneration(model_path=model_path) +model_path = "zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized" +pipeline = TextGeneration(model=model_path) ``` ### **Local Deployment Directory** Additionally, we can pass a local path to a deployment directory. Use the SparseZoo API to download an example deployment directory: ```python -import sparsezoo -sz_model = sparsezoo.Model("zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none", "./local-model") +from sparsezoo import Model +sz_model = Model("zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized", "./local-model") sz_model.deployment.download() ``` @@ -85,7 +86,7 @@ ls ./local-model/deployment We can pass the local directory path to `TextGeneration`: ```python model_path = "./local-model/deployment" -pipeline = TextGeneration(model_path=model_path) +pipeline = TextGeneration(model=model_path) ``` ## **Input and Output Formats** @@ -96,8 +97,8 @@ The following examples use a quantized 33M parameter TinyStories model for quick ```python from deepsparse import TextGeneration -MODEL_PATH = "hf:mgoin/TinyStories-33M-quant-deepsparse" -pipeline = TextGeneration(model_path=MODEL_PATH) +model_id = "hf:mgoin/TinyStories-33M-quant-deepsparse" +pipeline = TextGeneration(model=model_id) ``` ### Input Format @@ -112,13 +113,14 @@ for prompt_i, generation_i in zip(output.prompts, output.generations): print(f"{prompt_i}{generation_i.text}") # >> Princess Peach jumped from the balcony and landed on the ground. She was so happy that she had found her treasure. She thanked the old + # >> Mario ran into the castle and started to explore. He ran around the castle and climbed on the throne. He even tried to climb ``` - `streaming`: Boolean determining whether to stream response. If True, then the results are returned as a generator object which yields the results as they are generated. ```python -prompt = "Princess peach jumped from the balcony" +prompt = "Princess Peach jumped from the balcony" output_iterator = pipeline(prompt=prompt, streaming=True, max_new_tokens=20) print(prompt, end="") @@ -172,8 +174,8 @@ The following examples use a quantized 33M parameter TinyStories model for quick ```python from deepsparse import TextGeneration -MODEL_PATH = "hf:mgoin/TinyStories-33M-quant-deepsparse" -pipeline = TextGeneration(model_path=MODEL_PATH) +model_id = "hf:mgoin/TinyStories-33M-quant-deepsparse" +pipeline = TextGeneration(model=model_id) ``` ### **Creating A `GenerationConfig`** @@ -213,7 +215,7 @@ We can pass a `GenerationConfig` to `TextGeneration.__init__` or `TextGeneration ```python # set generation_config during __init__ -pipeline_w_gen_config = TextGeneration(model_path=MODEL_PATH, generation_config={"max_new_tokens": 10}) +pipeline_w_gen_config = TextGeneration(model=model_id, generation_config={"max_new_tokens": 10}) # generation_config is the default during __call__ output = pipeline_w_gen_config(prompt=prompt) @@ -225,7 +227,7 @@ print(f"{prompt}{output.generations[0].text}") ```python # no generation_config set during __init__ -pipeline_w_no_gen_config = TextGeneration(model_path=MODEL_PATH) +pipeline_w_no_gen_config = TextGeneration(model=model_id) # generation_config is the passed during __call__ output = pipeline_w_no_gen_config(prompt=prompt, generation_config= {"max_new_tokens": 10}) @@ -295,7 +297,7 @@ import numpy # only 20 logits are not set to -inf == only 20 logits used to sample token output = pipeline(prompt=prompt, do_sample=True, top_k=20, max_new_tokens=15, output_scores=True) print(numpy.isfinite(output.generations[0].score).sum(axis=1)) -# >> array([20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20]) +# >> [20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20] ``` - `top_p`: Float to define the tokens that are considered with nucleus sampling. If `0.0`, `top_p` is turned off. Default is `0.0` @@ -306,7 +308,7 @@ import numpy output = pipeline(prompt=prompt, do_sample=True, top_p=0.9, max_new_tokens=15, output_scores=True) print(numpy.isfinite(output.generations[0].score).sum(axis=1)) -# >> array([20, 15, 10, 5, 25, 3, 10, 7, 6, 6, 15, 12, 11, 3, 4, 4]) +# >> [ 5 119 18 14 204 6 7 367 191 20 12 7 46 6 2 35] ``` - `repetition_penalty`: The more a token is used within generation the more it is penalized to not be picked in successive generation passes. If `0.0`, `repetation_penalty` is turned off. Default is `0.0` From b2c1b12f26c52642a2d2c55ffd921a89c479cdc5 Mon Sep 17 00:00:00 2001 From: rsnm2 Date: Wed, 11 Oct 2023 19:49:26 +0000 Subject: [PATCH 02/21] fixed up pages --- research/mpt/README.md | 50 +++++++++++++++++++++++++----------------- 1 file changed, 30 insertions(+), 20 deletions(-) diff --git a/research/mpt/README.md b/research/mpt/README.md index 15edc6f126..c9040479f5 100644 --- a/research/mpt/README.md +++ b/research/mpt/README.md @@ -1,32 +1,36 @@ # **Sparse Finetuned LLMs with DeepSparse** -DeepSparse has support for performant inference of sparse large language models, starting with Mosaic's MPT. +DeepSparse has support for performant inference of sparse large language models, starting with Mosaic's MPT. -In this overview, we will discuss: -1. [Current status of our sparse fine-tuning research](#sparse-fine-tuning-research) +In this research overview, we will discuss: +1. [Our Sparse Fineuning Research](#sparse-finetuning-research) 2. [How to try text generation with DeepSparse](#try-it-now) -For detailed usage instructions, [see the text generation user guide](https://github.com/neuralmagic/deepsparse/tree/main/docs/llms/text-generation-pipeline.md). - -![deepsparse_mpt_gsm_speedup](https://github.com/neuralmagic/deepsparse/assets/3195154/8687401c-f479-4999-ba6b-e01c747dace9) +[See the text generation user guide](https://github.com/neuralmagic/deepsparse/tree/main/docs/llms/text-generation-pipeline.md) for detailed usage documentation. +
+ +
## **Sparse Finetuning Research** Sparsity is a powerful model compression technique, where weights are removed from the network with limited accuracy drop. -We show that MPT-7B can be pruned to ~60% sparsity with INT8 quantization, without loss, using a technique called **Sparse Finetuning**, where we prune the network during the fine-tuning process. +We show that MPT-7B can be pruned to ~60% sparsity with INT8 quantization, without accuracy loss, using a technique called **Sparse Finetuning**, where we prune the network during the finetuning process. ### **Sparse Finetuning on Grade-School Math (GSM)** -Open-source LLMs are typically fine-tuned onto downstream datasets for two reasons: -* **Instruction Tuning**: show the LLM examples of how to respond to human input or prompts properly -* **Domain Adaptation**: show the LLM examples with information it does not currently understand +Training LLMs consist of two steps. First, the model is pre-trained on a very large corpus of text (typically >1T tokens). Then, the model is adapted for downstream use by continuing training with a much smaller high quality curated dataset. This second step is called finetuning. + +Fine-tuning is useful for two main reasons: +1. It can teach the model *how* to respond* to input (often called **instruction tuning**). +2. It can teach the model *new information* (often called **domain adaptation**). + -An example of how domain adaptation is helpful is solving the [Grade-school math (GSM) dataset](https://huggingface.co/datasets/gsm8k). GSM is a set of grade school word problems and a notoriously difficult task for LLMs, as evidenced by the 0% zero-shot accuracy of MPT-7B-base. By fine-tuning with a very small set of ~7k training examples, however, we can boost the model's accuracy on the test set to 28.2%. +An example of how domain adaptation is helpful is solving the [Grade-school math (GSM) dataset](https://huggingface.co/datasets/gsm8k). GSM is a set of grade school word problems and a notoriously difficult task for LLMs, as evidenced by the 0% zero-shot accuracy of MPT-7B. By fine-tuning with a very small set of ~7k training examples, however, we can boost the model's accuracy on the test set to 28.2%. -The key insight from our paper is that we can prune the network during the finetuning process. We apply [SparseGPT](https://arxiv.org/pdf/2301.00774.pdf) to prune the network after dense finetuning and retrain for 2 epochs with L2 distillation. The result is a 60% sparse-quantized model with limited accuracy drop on GSM8k runs 6.7x faster than the dense baseline with DeepSparse! +The key insight from our paper is that we can prune the network during the finetuning process. We apply [SparseGPT](https://arxiv.org/pdf/2301.00774.pdf) to prune the network after dense finetuning and retrain for 2 epochs with L2 distillation. The result is a 60% sparse-quantized model with limited accuracy drop on GSM8k runs 7x faster than the dense baseline with DeepSparse! -Paper: (link to paper) +- [See the paper on Arxiv]() << UPDATE >> ### **How Is This Useful For Real World Use?** @@ -37,18 +41,21 @@ While GSM is a "toy" math dataset, it serves as an example of how LLMs can be ad Install the DeepSparse Nightly build (requires Linux): ```bash -pip install deepsparse-nightly[transformers] +pip install deepsparse-nightly[transformers]==1.6.0.20231007 ``` +The models generated in the paper are hosted on [SparseZoo](https://sparsezoo.neuralmagic.com/?ungrouped=true&sort=null&datasets=gsm8k&architectures=mpt) and [Hugging Face](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d). + +We can run inference on the 60% sparse-quantized MPT-7B GSM model using DeepSparse's `TextGeneration` Pipeline. + ### MPT-7B on GSM -We can run inference on the 60% sparse-quantized MPT-7B GSM model using DeepSparse's `TextGeneration` Pipeline: ```python from deepsparse import TextGeneration -MODEL_PATH = "zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/gsm8k/pruned60_quant-none" -pipeline = TextGeneration(model_path=MODEL_PATH) +model = "zoo:mpt-7b-gsm8k_mpt_pretrain-pruned60_quantized" +pipeline = TextGeneration(model_path=model) prompt = "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May" output = pipeline(prompt=prompt) @@ -59,13 +66,13 @@ print(output.generations[0].text) ### >> #### 72 ``` -It is also possible to run models directly from Hugging Face by prepending `"hf:"` to a model id, such as: +It is also possible to run the models directly from Hugging Face by prepending `"hf:"` to a model id, such as: ```python from deepsparse import TextGeneration -MODEL_PATH = "hf:neuralmagic/mpt-7b-gsm8k-pruned60-quant" -pipeline = TextGeneration(model_path=MODEL_PATH) +hf_model_id = "hf:neuralmagic/mpt-7b-gsm8k-pruned60-quant" +pipeline = TextGeneration(model=hf_model_id) prompt = "Question: Marty has 100 centimeters of ribbon that he must cut into 4 equal parts. Each of the cut parts must be divided into 5 equal parts. How long will each final cut be?" output = pipeline(prompt=prompt) @@ -76,6 +83,9 @@ print(output.generations[0].text) ### >> #### 5 ``` +> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs created by Neural Magic's team*** + + #### Other Resources - [Check out all the MPT GSM models on SparseZoo](https://sparsezoo.neuralmagic.com/?datasets=gsm8k&ungrouped=true) - [Try out the live demo on Hugging Face Spaces](https://huggingface.co/spaces/neuralmagic/sparse-mpt-7b-gsm8k) and view the [collection of paper, demos, and models](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d) From a2fefd3f486d9f23ff1829ddaf0a2dc4f42ca809 Mon Sep 17 00:00:00 2001 From: Robert Shaw <114415538+rsnm2@users.noreply.github.com> Date: Wed, 11 Oct 2023 18:07:38 -0400 Subject: [PATCH 03/21] Update text-generation-pipeline.md --- docs/llms/text-generation-pipeline.md | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/docs/llms/text-generation-pipeline.md b/docs/llms/text-generation-pipeline.md index 4e6f52782c..0d420fb7b9 100644 --- a/docs/llms/text-generation-pipeline.md +++ b/docs/llms/text-generation-pipeline.md @@ -58,7 +58,8 @@ print(output.generations[0].text) DeepSparse accepts models in ONNX format, passed either as SparseZoo stubs or local directories. -> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs from SparseZoo.*** +> **Note:** DeepSparse uses ONNX graphs modified for KV-caching. We will publish specs to enable external users to create LLM ONNX graphs for DeepSparse over the next few weeks. ***At current, we suggest only using LLM ONNX graphs created by Neural Magic.*** +> ### **SparseZoo Stubs** SparseZoo stubs identify a model in SparseZoo. For instance, `zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized` identifes a 50% pruned-quantized pretrained MPT-7b model fine-tuned on the Dolly dataset. We can pass the stub to `TextGeneration`, which downloads and caches the ONNX file. @@ -89,6 +90,14 @@ model_path = "./local-model/deployment" pipeline = TextGeneration(model=model_path) ``` +### **Hugging Face Models** +Hugging Face models which conform to the directory structure listed above can also be run with DeepSparse by prepending "hf:" to a model id, such as: + +```python +from deepsparse import TextGeneration +pipeline = TextGeneration(model="hf:neuralmagic/mpt-7b-gsm8k-pruned60-quant") +``` + ## **Input and Output Formats** `TextGeneration` accepts [`TextGenerationInput`](https://github.com/neuralmagic/deepsparse/blob/main/src/deepsparse/transformers/pipelines/text_generation.py#L83) as input and returns [`TextGenerationOutput`](https://github.com/neuralmagic/deepsparse/blob/main/src/deepsparse/transformers/pipelines/text_generation.py#L170) as output. From be0d2c03f340871a4e8ee20405cc9d4707d82f83 Mon Sep 17 00:00:00 2001 From: Robert Shaw <114415538+rsnm2@users.noreply.github.com> Date: Wed, 11 Oct 2023 18:09:21 -0400 Subject: [PATCH 04/21] Update text-generation-pipeline.md --- docs/llms/text-generation-pipeline.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/llms/text-generation-pipeline.md b/docs/llms/text-generation-pipeline.md index 0d420fb7b9..1cdd3ca52e 100644 --- a/docs/llms/text-generation-pipeline.md +++ b/docs/llms/text-generation-pipeline.md @@ -91,7 +91,7 @@ pipeline = TextGeneration(model=model_path) ``` ### **Hugging Face Models** -Hugging Face models which conform to the directory structure listed above can also be run with DeepSparse by prepending "hf:" to a model id, such as: +Hugging Face models which conform to the directory structure listed above can also be run with DeepSparse by prepending `hf:` to a model id. The following runs a [60% pruned-quantized MPT-7b model trained on GSM](https://huggingface.co/neuralmagic/mpt-7b-gsm8k-pruned60-quant). ```python from deepsparse import TextGeneration From ed17aed443515d077cb9264d4e8b0062a8efd949 Mon Sep 17 00:00:00 2001 From: Robert Shaw <114415538+rsnm2@users.noreply.github.com> Date: Wed, 11 Oct 2023 18:10:00 -0400 Subject: [PATCH 05/21] Update text-generation-pipeline.md --- docs/llms/text-generation-pipeline.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/llms/text-generation-pipeline.md b/docs/llms/text-generation-pipeline.md index 1cdd3ca52e..f42b6780aa 100644 --- a/docs/llms/text-generation-pipeline.md +++ b/docs/llms/text-generation-pipeline.md @@ -16,7 +16,7 @@ limitations under the License. # **Text Generation Pipelines** -This user guide describes how to run inference of text generation models with DeepSparse. +This user guide explains how to run inference of text generation models with DeepSparse. ## **Installation** From c3d77168a2be2b393679aaab59feebbf091f2870 Mon Sep 17 00:00:00 2001 From: Robert Shaw <114415538+rsnm2@users.noreply.github.com> Date: Wed, 11 Oct 2023 18:11:59 -0400 Subject: [PATCH 06/21] Update text-generation-pipeline.md --- docs/llms/text-generation-pipeline.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/llms/text-generation-pipeline.md b/docs/llms/text-generation-pipeline.md index f42b6780aa..b48d2dbe75 100644 --- a/docs/llms/text-generation-pipeline.md +++ b/docs/llms/text-generation-pipeline.md @@ -86,8 +86,8 @@ ls ./local-model/deployment We can pass the local directory path to `TextGeneration`: ```python -model_path = "./local-model/deployment" -pipeline = TextGeneration(model=model_path) +from deepsparse import TextGeneration +pipeline = TextGeneration(model="./local-model/deployment") ``` ### **Hugging Face Models** From c89228c762061932d1163d8ad38585757f3c891f Mon Sep 17 00:00:00 2001 From: Robert Shaw <114415538+rsnm2@users.noreply.github.com> Date: Wed, 11 Oct 2023 18:12:40 -0400 Subject: [PATCH 07/21] Update text-generation-pipeline.md --- docs/llms/text-generation-pipeline.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/llms/text-generation-pipeline.md b/docs/llms/text-generation-pipeline.md index b48d2dbe75..b308728a90 100644 --- a/docs/llms/text-generation-pipeline.md +++ b/docs/llms/text-generation-pipeline.md @@ -106,8 +106,7 @@ The following examples use a quantized 33M parameter TinyStories model for quick ```python from deepsparse import TextGeneration -model_id = "hf:mgoin/TinyStories-33M-quant-deepsparse" -pipeline = TextGeneration(model=model_id) +pipeline = TextGeneration(model="hf:mgoin/TinyStories-33M-quant-deepsparse") ``` ### Input Format From f152f4cfb31ea2d81802b7b530a7101403b83621 Mon Sep 17 00:00:00 2001 From: Robert Shaw <114415538+rsnm2@users.noreply.github.com> Date: Wed, 11 Oct 2023 18:15:31 -0400 Subject: [PATCH 08/21] Update README.md --- research/mpt/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/research/mpt/README.md b/research/mpt/README.md index c9040479f5..73159da83a 100644 --- a/research/mpt/README.md +++ b/research/mpt/README.md @@ -4,7 +4,7 @@ DeepSparse has support for performant inference of sparse large language models, In this research overview, we will discuss: 1. [Our Sparse Fineuning Research](#sparse-finetuning-research) -2. [How to try text generation with DeepSparse](#try-it-now) +2. [How to try Text Generation with DeepSparse](#try-it-now) [See the text generation user guide](https://github.com/neuralmagic/deepsparse/tree/main/docs/llms/text-generation-pipeline.md) for detailed usage documentation.
From 3a11ff1fcea634892b0cd5db94a1bf4f16c6f433 Mon Sep 17 00:00:00 2001 From: Robert Shaw <114415538+rsnm2@users.noreply.github.com> Date: Wed, 11 Oct 2023 18:16:14 -0400 Subject: [PATCH 09/21] Update README.md --- research/mpt/README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/research/mpt/README.md b/research/mpt/README.md index 73159da83a..daabaca113 100644 --- a/research/mpt/README.md +++ b/research/mpt/README.md @@ -6,7 +6,8 @@ In this research overview, we will discuss: 1. [Our Sparse Fineuning Research](#sparse-finetuning-research) 2. [How to try Text Generation with DeepSparse](#try-it-now) -[See the text generation user guide](https://github.com/neuralmagic/deepsparse/tree/main/docs/llms/text-generation-pipeline.md) for detailed usage documentation. +[See the Text Generation User Guide for detailed DeepSparse documentation](https://github.com/neuralmagic/deepsparse/tree/main/docs/llms/text-generation-pipeline.md) +
From f870b9940bdb8cacf47d9ee05d610bbec8384563 Mon Sep 17 00:00:00 2001 From: Robert Shaw <114415538+rsnm2@users.noreply.github.com> Date: Wed, 11 Oct 2023 18:17:06 -0400 Subject: [PATCH 10/21] Update README.md --- research/mpt/README.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/research/mpt/README.md b/research/mpt/README.md index daabaca113..33f98d7e2a 100644 --- a/research/mpt/README.md +++ b/research/mpt/README.md @@ -14,9 +14,7 @@ In this research overview, we will discuss: ## **Sparse Finetuning Research** -Sparsity is a powerful model compression technique, where weights are removed from the network with limited accuracy drop. - -We show that MPT-7B can be pruned to ~60% sparsity with INT8 quantization, without accuracy loss, using a technique called **Sparse Finetuning**, where we prune the network during the finetuning process. +Sparsity is a powerful model compression technique, where weights are removed from the network with limited accuracy drop. We show that MPT-7B can be pruned to ~60% sparsity with INT8 quantization (and 70% sparsity without quantization), with no accuracy drop, using a technique called **Sparse Finetuning**, where we prune the network during the finetuning process. ### **Sparse Finetuning on Grade-School Math (GSM)** From cf07f3a60aa3c453f3ca07717a015dc9d6f12fc7 Mon Sep 17 00:00:00 2001 From: Robert Shaw <114415538+rsnm2@users.noreply.github.com> Date: Wed, 11 Oct 2023 18:18:11 -0400 Subject: [PATCH 11/21] Update README.md --- research/mpt/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/research/mpt/README.md b/research/mpt/README.md index 33f98d7e2a..6fa1c6812b 100644 --- a/research/mpt/README.md +++ b/research/mpt/README.md @@ -6,7 +6,7 @@ In this research overview, we will discuss: 1. [Our Sparse Fineuning Research](#sparse-finetuning-research) 2. [How to try Text Generation with DeepSparse](#try-it-now) -[See the Text Generation User Guide for detailed DeepSparse documentation](https://github.com/neuralmagic/deepsparse/tree/main/docs/llms/text-generation-pipeline.md) +Checkout the detailed [`TextGeneration` documentation](https://github.com/neuralmagic/deepsparse/tree/main/docs/llms/text-generation-pipeline.md) for more usage instruction.
From 9f515356f38a7d16e3897f4341b2857f887d516e Mon Sep 17 00:00:00 2001 From: Robert Shaw <114415538+rsnm2@users.noreply.github.com> Date: Wed, 11 Oct 2023 18:18:26 -0400 Subject: [PATCH 12/21] Update README.md --- research/mpt/README.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/research/mpt/README.md b/research/mpt/README.md index 6fa1c6812b..45005bae9c 100644 --- a/research/mpt/README.md +++ b/research/mpt/README.md @@ -6,8 +6,6 @@ In this research overview, we will discuss: 1. [Our Sparse Fineuning Research](#sparse-finetuning-research) 2. [How to try Text Generation with DeepSparse](#try-it-now) -Checkout the detailed [`TextGeneration` documentation](https://github.com/neuralmagic/deepsparse/tree/main/docs/llms/text-generation-pipeline.md) for more usage instruction. -
From 5abc30d907144d9f4398e745c23984548ff3244f Mon Sep 17 00:00:00 2001 From: Robert Shaw <114415538+rsnm2@users.noreply.github.com> Date: Wed, 11 Oct 2023 18:18:47 -0400 Subject: [PATCH 13/21] Update README.md --- research/mpt/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/research/mpt/README.md b/research/mpt/README.md index 45005bae9c..601ebe8c19 100644 --- a/research/mpt/README.md +++ b/research/mpt/README.md @@ -12,7 +12,7 @@ In this research overview, we will discuss: ## **Sparse Finetuning Research** -Sparsity is a powerful model compression technique, where weights are removed from the network with limited accuracy drop. We show that MPT-7B can be pruned to ~60% sparsity with INT8 quantization (and 70% sparsity without quantization), with no accuracy drop, using a technique called **Sparse Finetuning**, where we prune the network during the finetuning process. +We show that MPT-7B can be pruned to ~60% sparsity with INT8 quantization (and 70% sparsity without quantization), with no accuracy drop, using a technique called **Sparse Finetuning**, where we prune the network during the finetuning process. ### **Sparse Finetuning on Grade-School Math (GSM)** From 747804973a147f5524ea3fee72eb63f02b3cf945 Mon Sep 17 00:00:00 2001 From: Robert Shaw <114415538+rsnm2@users.noreply.github.com> Date: Wed, 11 Oct 2023 18:19:25 -0400 Subject: [PATCH 14/21] Update README.md --- research/mpt/README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/research/mpt/README.md b/research/mpt/README.md index 601ebe8c19..f947f5dd00 100644 --- a/research/mpt/README.md +++ b/research/mpt/README.md @@ -14,6 +14,8 @@ In this research overview, we will discuss: We show that MPT-7B can be pruned to ~60% sparsity with INT8 quantization (and 70% sparsity without quantization), with no accuracy drop, using a technique called **Sparse Finetuning**, where we prune the network during the finetuning process. +When running the pruned network with DeepSparse, we can accelerate inference by ~7x over the dense-FP32 baseline! + ### **Sparse Finetuning on Grade-School Math (GSM)** Training LLMs consist of two steps. First, the model is pre-trained on a very large corpus of text (typically >1T tokens). Then, the model is adapted for downstream use by continuing training with a much smaller high quality curated dataset. This second step is called finetuning. From 9d5947fdecdb1a02a6dd4f5ed040bb9790594e3e Mon Sep 17 00:00:00 2001 From: Robert Shaw <114415538+rsnm2@users.noreply.github.com> Date: Wed, 11 Oct 2023 18:21:42 -0400 Subject: [PATCH 15/21] Update README.md --- research/mpt/README.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/research/mpt/README.md b/research/mpt/README.md index f947f5dd00..3b74882791 100644 --- a/research/mpt/README.md +++ b/research/mpt/README.md @@ -6,10 +6,6 @@ In this research overview, we will discuss: 1. [Our Sparse Fineuning Research](#sparse-finetuning-research) 2. [How to try Text Generation with DeepSparse](#try-it-now) -
- -
- ## **Sparse Finetuning Research** We show that MPT-7B can be pruned to ~60% sparsity with INT8 quantization (and 70% sparsity without quantization), with no accuracy drop, using a technique called **Sparse Finetuning**, where we prune the network during the finetuning process. @@ -21,13 +17,17 @@ When running the pruned network with DeepSparse, we can accelerate inference by Training LLMs consist of two steps. First, the model is pre-trained on a very large corpus of text (typically >1T tokens). Then, the model is adapted for downstream use by continuing training with a much smaller high quality curated dataset. This second step is called finetuning. Fine-tuning is useful for two main reasons: -1. It can teach the model *how* to respond* to input (often called **instruction tuning**). +1. It can teach the model *how to respond* to input (often called **instruction tuning**). 2. It can teach the model *new information* (often called **domain adaptation**). An example of how domain adaptation is helpful is solving the [Grade-school math (GSM) dataset](https://huggingface.co/datasets/gsm8k). GSM is a set of grade school word problems and a notoriously difficult task for LLMs, as evidenced by the 0% zero-shot accuracy of MPT-7B. By fine-tuning with a very small set of ~7k training examples, however, we can boost the model's accuracy on the test set to 28.2%. -The key insight from our paper is that we can prune the network during the finetuning process. We apply [SparseGPT](https://arxiv.org/pdf/2301.00774.pdf) to prune the network after dense finetuning and retrain for 2 epochs with L2 distillation. The result is a 60% sparse-quantized model with limited accuracy drop on GSM8k runs 7x faster than the dense baseline with DeepSparse! +The key insight from our paper is that we can prune the network during the finetuning process. We apply [SparseGPT](https://arxiv.org/pdf/2301.00774.pdf) to prune the network after dense finetuning and retrain for 2 epochs with L2 distillation. The result is a 60% sparse-quantized model with no accuracy drop on GSM8k runs 7x faster than the dense baseline with DeepSparse! + +
+ +
- [See the paper on Arxiv]() << UPDATE >> From 8857ad180996d533a3a0eb7de606a2bcf1c4865b Mon Sep 17 00:00:00 2001 From: Robert Shaw <114415538+rsnm2@users.noreply.github.com> Date: Wed, 11 Oct 2023 18:22:29 -0400 Subject: [PATCH 16/21] Update README.md --- research/mpt/README.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/research/mpt/README.md b/research/mpt/README.md index 3b74882791..4096e6d2fa 100644 --- a/research/mpt/README.md +++ b/research/mpt/README.md @@ -43,9 +43,7 @@ Install the DeepSparse Nightly build (requires Linux): pip install deepsparse-nightly[transformers]==1.6.0.20231007 ``` -The models generated in the paper are hosted on [SparseZoo](https://sparsezoo.neuralmagic.com/?ungrouped=true&sort=null&datasets=gsm8k&architectures=mpt) and [Hugging Face](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d). - -We can run inference on the 60% sparse-quantized MPT-7B GSM model using DeepSparse's `TextGeneration` Pipeline. +The models generated in the paper are hosted on [SparseZoo](https://sparsezoo.neuralmagic.com/?ungrouped=true&sort=null&datasets=gsm8k&architectures=mpt) and [Hugging Face](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d). We can run them using DeepSparse's `TextGeneration` Pipeline. ### MPT-7B on GSM From bceb6f0741b875bc76eee285ec63e3342609cd21 Mon Sep 17 00:00:00 2001 From: Robert Shaw <114415538+rsnm2@users.noreply.github.com> Date: Wed, 11 Oct 2023 18:24:22 -0400 Subject: [PATCH 17/21] Update README.md --- research/mpt/README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/research/mpt/README.md b/research/mpt/README.md index 4096e6d2fa..c2fdafcd89 100644 --- a/research/mpt/README.md +++ b/research/mpt/README.md @@ -43,10 +43,11 @@ Install the DeepSparse Nightly build (requires Linux): pip install deepsparse-nightly[transformers]==1.6.0.20231007 ``` -The models generated in the paper are hosted on [SparseZoo](https://sparsezoo.neuralmagic.com/?ungrouped=true&sort=null&datasets=gsm8k&architectures=mpt) and [Hugging Face](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d). We can run them using DeepSparse's `TextGeneration` Pipeline. +The models generated in the paper are hosted on [SparseZoo](https://sparsezoo.neuralmagic.com/?ungrouped=true&sort=null&datasets=gsm8k&architectures=mpt) and [Hugging Face](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d). ### MPT-7B on GSM +We can run inference on the models using DeepSparse's `TextGeneration` Pipeline: ```python from deepsparse import TextGeneration @@ -86,6 +87,7 @@ print(output.generations[0].text) #### Other Resources - [Check out all the MPT GSM models on SparseZoo](https://sparsezoo.neuralmagic.com/?datasets=gsm8k&ungrouped=true) - [Try out the live demo on Hugging Face Spaces](https://huggingface.co/spaces/neuralmagic/sparse-mpt-7b-gsm8k) and view the [collection of paper, demos, and models](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d) +- [Check out the detailed `TextGeneration` Pipeline documentation](https://github.com/neuralmagic/deepsparse/blob/main/docs/llms/text-generation-pipeline.md) ### **MPT-7B on Dolly-HHRLHF** From 90f9bff278b3895d6892a7d7b0b290c970743235 Mon Sep 17 00:00:00 2001 From: Robert Shaw <114415538+rsnm2@users.noreply.github.com> Date: Wed, 11 Oct 2023 18:28:30 -0400 Subject: [PATCH 18/21] Update README.md --- research/mpt/README.md | 20 ++++++-------------- 1 file changed, 6 insertions(+), 14 deletions(-) diff --git a/research/mpt/README.md b/research/mpt/README.md index c2fdafcd89..cd3ea38317 100644 --- a/research/mpt/README.md +++ b/research/mpt/README.md @@ -89,22 +89,14 @@ print(output.generations[0].text) - [Try out the live demo on Hugging Face Spaces](https://huggingface.co/spaces/neuralmagic/sparse-mpt-7b-gsm8k) and view the [collection of paper, demos, and models](https://huggingface.co/collections/neuralmagic/sparse-finetuning-mpt-65241d875b29204d6d42697d) - [Check out the detailed `TextGeneration` Pipeline documentation](https://github.com/neuralmagic/deepsparse/blob/main/docs/llms/text-generation-pipeline.md) -### **MPT-7B on Dolly-HHRLHF** +## **Roadmap** -We have also made a 50% sparse-quantized MPT-7B fine-tuned on [Dolly-hhrlhf](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) available on SparseZoo. We can run inference with the following: +Following these initial results, we are rapidly expanding our support for LLMs across the Neural Magic stack, including: -```python -from deepsparse import TextGeneration - -MODEL_PATH = "zoo:nlg/text_generation/mpt-7b/pytorch/huggingface/dolly/pruned50_quant-none" -pipeline = TextGeneration(model_path=MODEL_PATH) - -prompt = "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is Kubernetes? ### Response:" -output = pipeline(prompt=prompt) -print(output.generations[0].text) - -### >> Kubernetes is an open-source container orchestration system for automating deployment, scaling, and management of containerized applications. -``` +- **Productizing Sparse Fine Tuning**: Enable external users to apply the sparse fine-tuning to business datasets +- **Expanding Model Support**: Apply sparse fine-tuning results to Llama2 and Mistral models +- **Pushing to Higher Sparsity**: Improving our pruning algorithms to reach higher sparsity +- **Building General Sparse Model**: Create sparse model that can perform well on general tasks like OpenLLM leaderboard ## **Feedback / Roadmap Requests** From b23333292d855632b0db34cb13a3a075a8ae6957 Mon Sep 17 00:00:00 2001 From: Robert Shaw <114415538+rsnm2@users.noreply.github.com> Date: Wed, 11 Oct 2023 18:28:45 -0400 Subject: [PATCH 19/21] Update README.md --- research/mpt/README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/research/mpt/README.md b/research/mpt/README.md index cd3ea38317..78de9aebb9 100644 --- a/research/mpt/README.md +++ b/research/mpt/README.md @@ -1,5 +1,7 @@ # **Sparse Finetuned LLMs with DeepSparse** +LAST UPDATED: 10/11/2023 + DeepSparse has support for performant inference of sparse large language models, starting with Mosaic's MPT. In this research overview, we will discuss: From 4138ce8af2226cb9ebf88bedee8061c889782e54 Mon Sep 17 00:00:00 2001 From: Robert Shaw <114415538+rsnm2@users.noreply.github.com> Date: Wed, 11 Oct 2023 18:28:55 -0400 Subject: [PATCH 20/21] Update README.md --- research/mpt/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/research/mpt/README.md b/research/mpt/README.md index 78de9aebb9..ad3ad5777b 100644 --- a/research/mpt/README.md +++ b/research/mpt/README.md @@ -1,6 +1,6 @@ # **Sparse Finetuned LLMs with DeepSparse** -LAST UPDATED: 10/11/2023 +*LAST UPDATED: 10/11/2023* DeepSparse has support for performant inference of sparse large language models, starting with Mosaic's MPT. From 7e0d6a844ebdab23b0d5417ba0a231da66a4e10a Mon Sep 17 00:00:00 2001 From: Robert Shaw <114415538+rsnm2@users.noreply.github.com> Date: Wed, 11 Oct 2023 18:29:05 -0400 Subject: [PATCH 21/21] Update README.md --- research/mpt/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/research/mpt/README.md b/research/mpt/README.md index ad3ad5777b..ecb233c884 100644 --- a/research/mpt/README.md +++ b/research/mpt/README.md @@ -1,7 +1,7 @@ -# **Sparse Finetuned LLMs with DeepSparse** - *LAST UPDATED: 10/11/2023* +# **Sparse Finetuned LLMs with DeepSparse** + DeepSparse has support for performant inference of sparse large language models, starting with Mosaic's MPT. In this research overview, we will discuss: