huggingface · philschmid · Apr 28, 2022 · Mar 24, 2022 · Mar 24, 2022 · Mar 24, 2022
diff --git a/.gitignore b/.gitignore
@@ -131,3 +131,7 @@ dmypy.json
 
 # Models
 *.onnx
+# include small test model for tests
+!tests/assets/onnx/model.onnx
+
+.vscode
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -3,8 +3,12 @@
     title: 🤗 Optimum
   - local: quickstart
     title: Quickstart
+  - local: pipelines
+    title: Pipelines for inference
   title: Get started
 - sections:
+  - local: onnxruntime/modeling_ort
+    title: Inference
   - local: onnxruntime/configuration
     title: Configuration
   - local: onnxruntime/optimization

diff --git a/docs/source/onnxruntime/modeling_ort.mdx b/docs/source/onnxruntime/modeling_ort.mdx
@@ -0,0 +1,112 @@
+<!--Copyright 2022 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Optimum Inference with ONNX Runtime
+
+Optimum Inference is a utility package for building and running inference with accelerated runtime like ONNX Runtime. 
+Optimum Inference can be used to load optimized models from the [Hugging Face Hub](hf.co/models) and create pipelines 
+to run accelerated inference without rewriting your APIs.
+
+## Switching from Transformers to Optimum Inference
+
+The Optimum Inference models are API compatible with Hugging Face Transformers models. This means you can just replace your `AutoModelForXxx` class with the corresponding `OnnxForXxx` class in `optimum`. For example, this is how you can use a question answering model in `optimum`: 
+
+```diff
+from transformers import AutoTokenizer, pipeline
+-from transformers import AutoModelForQuestionAnswering
++from optimum.onnxruntime import OnnxForQuestionAnswering
+
+-model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2")
++model = OnnxForQuestionAnswering.from_transformers("deepset/roberta-base-squad2")
+tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")
+
+onnx_qa = pipeline("question-answering",model=model,tokenizer=tokenizer)
+
+question = "What's my name?"
+context = "My name is Philipp and I live in Nuremberg."
+pred = onnx_qa(question, context)
+```
+
+Optimum Inference also includes methods to convert vanilla Transformers models to optimized ones via the `from_transformers()` method.
+After you have converted a model you can even `optimize` or `quantize` the model if it is supported by the runtime you use.
+
+```python
+from transformers import AutoTokenizer, pipeline
+from optimum.onnxruntime import OnnxForSequenceClassification
+
+# load model from hub and convert
+model = OnnxForSequenceClassification.from_transformers("distilbert-base-uncased-finetuned-sst-2-english")
+tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
+
+# optimize model
+model.optimize()
+# quantize model
+model.quantize()
+
+# create pipeline
+onnx_clx = pipeline("text-classification",model=model,tokenizer=tokenizer)
+
+result = onnx_clx(text="This is a great model")
+```
+
+You can find a complete walkhrough Optimum Inference for ONNX Runtime in this [notebook](xx).
+
+### Working with the [Hugging Face Model Hub](https://hf.co/models)
+
+The Optimum model classes, e.g. [`OnnxModel`] are directly integrated with the [Hugging Face Model Hub](https://hf.co/models)) meaning you can not only 
+load model from the Hub but also push your models to the Hub with `push_to_hub()` method. Below you find an example which pulls a vanilla transformers model 
+from the Hub and converts it to an optimum model and pushes it back into a new repository.
+
+```python
+from transformers import AutoTokenizer
+from optimum.onnxruntime import OnnxForSequenceClassification
+
+# load model from hub and convert
+model = OnnxForSequenceClassification.from_transformers("distilbert-base-uncased-finetuned-sst-2-english")
+tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
+
+# optimize model
+model.optimize()
+# quantize model
+model.quantize()
+
+# save converted model
+model.save_pretrained("a_local_path_for_convert_onnx_model")
+tokenizer.save_pretrained("a_local_path_for_convert_onnx_model")
+
+# push model onnx model to HF Hub
+model.push_to_hub("a_local_path_for_convert_onnx_model",
+                  repository_id="my-onnx-repo",
+                  use_auth_token=True
+                  )
+```
+
+## OnnxModel
+
+[[autodoc]] onnxruntime.modeling_ort.OnnxModel
+
+## OnnxForFeatureExtraction
+
+[[autodoc]] onnxruntime.modeling_ort.OnnxForFeatureExtraction
+
+## OnnxForQuestionAnswering
+
+[[autodoc]] onnxruntime.modeling_ort.OnnxForQuestionAnswering
+
+## OnnxForSequenceClassification
+
+[[autodoc]] onnxruntime.modeling_ort.OnnxForSequenceClassification
+
+## OnnxForTokenClassification
+
+[[autodoc]] onnxruntime.modeling_ort.OnnxForTokenClassification
+
diff --git a/docs/source/pipelines.mdx b/docs/source/pipelines.mdx
@@ -0,0 +1,150 @@
+<!--Copyright 2022 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Optimum pipelines for inference
+
+The [`optimum_pipeline`] makes it simple to use models from the [Model Hub](https://huggingface.co/models) for accelerated inference on a variety of tasks such as text classification.
+Even if you don't have experience with a specific modality or understand the code powering the models, you can still use them with the [`optimum_pipeline`]! This tutorial will teach you to:
+
+<Tip>
+
+You can also use the `transformers.pipeline` and provide your `OptimumModel`.
+
+</Tip>
+
+Currenlty supported tasks are:
+
+**Onnxruntime**
+
+* `feature-extraction`
+* `text-classification`
+* `token-classification`
+* `question-answering`
+* `zero-shot-classification`
+
+## Optimum pipeline usage
+
+While each task has an associated [`optimum_pipeline`], it is simpler to use the general [`pipeline`] abstraction which contains all the specific task pipelines. 
+The [`optimum_pipeline`] automatically loads a default model and tokenizer capable of inference for your task. 
+
+1. Start by creating a [`optimum_pipeline`] and specify an inference task:
+
+```py
+>>> from optimum import optimum_pipeline
+
+>>> classifier = optimum_pipeline(task="text-classification", accelerator="onnx")
+
+```
+
+2. Pass your input text to the [`optimum_pipeline`]:
+
+```python
+>>> classifier("I like you. I love you.")
+```
+
+### Use a vanilla transformers model and convert
+
+The [`optimum_pipeline`] accepts supported model from the [Model Hub](https://huggingface.co/models). 
+There are tags on the Model Hub that allow you to filter for a model you'd like to use for your task. 
+Once you've picked an appropriate model, load it with the `from_transformers()` method for corresponding `OnnxFor*` 
+and [`AutoTokenizer'] class. For example, load the [`OnnxForQuestionAnswering`] class for a question answering modeling task:
+
+```py
+>>> from transformers import AutoTokenizer
+>>> from optimum.onnxruntime import OnnxForQuestionAnswering
+>>> from optimum import optimum_pipeline
+
+>>> tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")
+>>> model = OnnxForQuestionAnswering.from_transformers("deepset/roberta-base-squad2")
+
+>>> onnx_qa = optimum_pipeline("question-answering", model=model, tokenizer=tokenizer)
+>>> question = "Whats my name?"
->>> question = "Whats my name?"
+>>> question = "What's my name?"
->>> question = "Whats my name?"
+>>> question = "What's my name?"
+>>> context = "My Name is Philipp and I live in Nuremberg."
+
+>>> pred = onnx_qa(question=question, context=context)
+```
+
+### Use optimum model
+
+The [`optimum_pipeline`] is tightly integrated with [Model Hub](https://huggingface.co/models) and can load optimized models directly, e.g. Onnxruntime. 
+There are tags on the Model Hub that allow you to filter for a model you'd like to use for your task. 
+Once you've picked an appropriate model, load it with the `.from_pretrained` method for corresponding `OnnxFor*` 
+and [`AutoTokenizer'] class. For example, load the [`OnnxForQuestionAnswering`] class for a question answering modeling task:
+
+```py
+>>> from transformers import AutoTokenizer
+>>> from optimum.onnxruntime import OnnxForQuestionAnswering
+>>> from optimum import optimum_pipeline
+
+>>> tokenizer = AutoTokenizer.from_pretrained("optimum/roberta-base-squad2")
+>>> model = OnnxForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2")
+
+>>> onnx_qa = optimum_pipeline("question-answering", model=model, tokenizer=tokenizer)
+>>> question = "Whats my name?"
+>>> context = "My Name is Philipp and I live in Nuremberg."
+
+>>> pred = onnx_qa(question=question, context=context)
+```
+
+
+### Optimizing and Quantizing in Pipelines
+
+The [`optimum_pipeline`] can not only run inference it also provides arguments to quantize and optimize your model on the fly.
+Once you've picked an appropriate model, load it with the `.from_transformers` or `.from_pretrained` method for corresponding `OnnxFor*` 
+and [`AutoTokenizer'] class. For example, load the [`OnnxForQuestionAnswering`] class for a question answering modeling task and provide 
+the `do_optimization=True` and/or `do_quantization=True` arguments:
+
+```py
+>>> from transformers import AutoTokenizer
+>>> from optimum.onnxruntime import OnnxForQuestionAnswering
+>>> from optimum import optimum_pipeline
+
+>>> tokenizer = AutoTokenizer.from_pretrained("optimum/roberta-base-squad2")
+>>> model = OnnxForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2")
+
+>>> model.optimize()
+>>> model.quantize()
+
+>>> onnx_qa = optimum_pipeline("question-answering",
+                               model=model,
+                               tokenizer=tokenizer, 
+                               do_optimization=True, 
+                               do_quantization=True
+                               )
+>>> question = "Whats my name?"
+>>> context = "My Name is Philipp and I live in Nuremberg."
+
+>>> pred = onnx_qa(question=question, context=context)
+```
+
+
+## Transformers pipeline usage
+
+The [`optimum_pipeline`] is just a light wrapper around the `transformers.pipeline` to enable checks for supported tasks and additional features
-The [`optimum_pipeline`] is just a light wrapper around the `transformers.pipeline` to enable checks for supported tasks and additional features
+The [`optimum_pipeline`] is just a light wrapper around the `transformers.pipeline` function to enable checks for supported tasks and additional features
-The [`optimum_pipeline`] is just a light wrapper around the `transformers.pipeline` to enable checks for supported tasks and additional features
+The [`optimum_pipeline`] is just a light wrapper around the `transformers.pipeline` function to enable checks for supported tasks and additional features
+, like quantization and optimization. This being said you can use the `transformers.pipeline` and just replace your `AutoFor*` with the optimum 
+ `OnnxFor*` class. 
+
+```diff
+from transformers import AutoTokenizer, pipeline
+-from transformers import AutoModelForQuestionAnswering
++from optimum.onnxruntime import OnnxForQuestionAnswering
+
+-model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2")
++model = OnnxForQuestionAnswering.from_transformers("deepset/roberta-base-squad2")
+tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")
+
+onnx_qa = pipeline("question-answering",model=model,tokenizer=tokenizer)
+
+question = "Whats my name?"
+context = "My Name is Philipp and I live in Nuremberg."
+pred = onnx_qa(question, context)
+```