-
Notifications
You must be signed in to change notification settings - Fork 497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ORT inference #113
Add ORT inference #113
Changes from 16 commits
7b05a01
81f0bff
a77a9a9
128dae1
015fdb5
57106b5
2d6af6d
7e4f32d
38a0344
a9c3fbe
909e559
035296c
06bbb44
bcb0cb5
36a63c6
106f3c1
4a2e524
bb2504b
42de0e9
fd4580d
a09bae7
5d7b4b0
c32345a
b87f06c
e673d9c
82452f9
8b3a576
89710b6
ba00ccf
3c5b694
c20d9ff
d2d5bd2
87c9ce7
a6c936d
1d1c9e9
226565a
d98d7f3
660220e
7f1e7b8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -131,3 +131,7 @@ dmypy.json | |
|
||
# Models | ||
*.onnx | ||
# include small test model for tests | ||
!tests/assets/onnx/model.onnx | ||
|
||
.vscode |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,112 @@ | ||
<!--Copyright 2022 The HuggingFace Team. All rights reserved. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
the License. You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
specific language governing permissions and limitations under the License. | ||
--> | ||
|
||
# Optimum Inference with ONNX Runtime | ||
|
||
Optimum Inference is a utility package for building and running inference with accelerated runtime like ONNX Runtime. | ||
philschmid marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Optimum Inference can be used to load optimized models from the [Hugging Face Hub](hf.co/models) and create pipelines | ||
to run accelerated inference without rewriting your APIs. | ||
|
||
## Switching from Transformers to Optimum Inference | ||
|
||
The Optimum Inference models are API compatible with Hugging Face Transformers models. This means you can just replace your `AutoModelForXxx` class with the corresponding `OnnxForXxx` class in `optimum`. For example, this is how you can use a question answering model in `optimum`: | ||
|
||
```diff | ||
philschmid marked this conversation as resolved.
Show resolved
Hide resolved
|
||
from transformers import AutoTokenizer, pipeline | ||
-from transformers import AutoModelForQuestionAnswering | ||
+from optimum.onnxruntime import OnnxForQuestionAnswering | ||
|
||
-model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2") | ||
+model = OnnxForQuestionAnswering.from_transformers("deepset/roberta-base-squad2") | ||
tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2") | ||
|
||
onnx_qa = pipeline("question-answering",model=model,tokenizer=tokenizer) | ||
|
||
question = "What's my name?" | ||
context = "My name is Philipp and I live in Nuremberg." | ||
pred = onnx_qa(question, context) | ||
``` | ||
|
||
Optimum Inference also includes methods to convert vanilla Transformers models to optimized ones via the `from_transformers()` method. | ||
After you have converted a model you can even `optimize` or `quantize` the model if it is supported by the runtime you use. | ||
|
||
```python | ||
from transformers import AutoTokenizer, pipeline | ||
philschmid marked this conversation as resolved.
Show resolved
Hide resolved
|
||
from optimum.onnxruntime import OnnxForSequenceClassification | ||
|
||
# load model from hub and convert | ||
model = OnnxForSequenceClassification.from_transformers("distilbert-base-uncased-finetuned-sst-2-english") | ||
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english") | ||
|
||
# optimize model | ||
model.optimize() | ||
# quantize model | ||
model.quantize() | ||
|
||
# create pipeline | ||
onnx_clx = pipeline("text-classification",model=model,tokenizer=tokenizer) | ||
philschmid marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
result = onnx_clx(text="This is a great model") | ||
philschmid marked this conversation as resolved.
Show resolved
Hide resolved
|
||
``` | ||
|
||
You can find a complete walkhrough Optimum Inference for ONNX Runtime in this [notebook](xx). | ||
|
||
### Working with the [Hugging Face Model Hub](https://hf.co/models) | ||
|
||
The Optimum model classes, e.g. [`OnnxModel`] are directly integrated with the [Hugging Face Model Hub](https://hf.co/models)) meaning you can not only | ||
load model from the Hub but also push your models to the Hub with `push_to_hub()` method. Below you find an example which pulls a vanilla transformers model | ||
philschmid marked this conversation as resolved.
Show resolved
Hide resolved
|
||
from the Hub and converts it to an optimum model and pushes it back into a new repository. | ||
philschmid marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
```python | ||
from transformers import AutoTokenizer | ||
from optimum.onnxruntime import OnnxForSequenceClassification | ||
|
||
# load model from hub and convert | ||
model = OnnxForSequenceClassification.from_transformers("distilbert-base-uncased-finetuned-sst-2-english") | ||
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english") | ||
|
||
# optimize model | ||
model.optimize() | ||
# quantize model | ||
model.quantize() | ||
|
||
# save converted model | ||
model.save_pretrained("a_local_path_for_convert_onnx_model") | ||
tokenizer.save_pretrained("a_local_path_for_convert_onnx_model") | ||
|
||
# push model onnx model to HF Hub | ||
model.push_to_hub("a_local_path_for_convert_onnx_model", | ||
philschmid marked this conversation as resolved.
Show resolved
Hide resolved
|
||
repository_id="my-onnx-repo", | ||
use_auth_token=True | ||
) | ||
``` | ||
|
||
## OnnxModel | ||
|
||
[[autodoc]] onnxruntime.modeling_ort.OnnxModel | ||
|
||
## OnnxForFeatureExtraction | ||
|
||
[[autodoc]] onnxruntime.modeling_ort.OnnxForFeatureExtraction | ||
|
||
## OnnxForQuestionAnswering | ||
|
||
[[autodoc]] onnxruntime.modeling_ort.OnnxForQuestionAnswering | ||
|
||
## OnnxForSequenceClassification | ||
|
||
[[autodoc]] onnxruntime.modeling_ort.OnnxForSequenceClassification | ||
|
||
## OnnxForTokenClassification | ||
|
||
[[autodoc]] onnxruntime.modeling_ort.OnnxForTokenClassification | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,150 @@ | ||||||
<!--Copyright 2022 The HuggingFace Team. All rights reserved. | ||||||
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||||||
the License. You may obtain a copy of the License at | ||||||
|
||||||
http://www.apache.org/licenses/LICENSE-2.0 | ||||||
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||||||
specific language governing permissions and limitations under the License. | ||||||
--> | ||||||
|
||||||
# Optimum pipelines for inference | ||||||
|
||||||
The [`optimum_pipeline`] makes it simple to use models from the [Model Hub](https://huggingface.co/models) for accelerated inference on a variety of tasks such as text classification. | ||||||
Even if you don't have experience with a specific modality or understand the code powering the models, you can still use them with the [`optimum_pipeline`]! This tutorial will teach you to: | ||||||
philschmid marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
<Tip> | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this tip be moved below, i.e. just after the example in |
||||||
|
||||||
You can also use the `transformers.pipeline` and provide your `OptimumModel`. | ||||||
philschmid marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
</Tip> | ||||||
|
||||||
Currenlty supported tasks are: | ||||||
|
||||||
**Onnxruntime** | ||||||
philschmid marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
* `feature-extraction` | ||||||
* `text-classification` | ||||||
* `token-classification` | ||||||
* `question-answering` | ||||||
* `zero-shot-classification` | ||||||
|
||||||
## Optimum pipeline usage | ||||||
|
||||||
While each task has an associated [`optimum_pipeline`], it is simpler to use the general [`pipeline`] abstraction which contains all the specific task pipelines. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure I understand what you mean here with the difference between There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Copied it from here and forgot to adjust: https://huggingface.co/docs/transformers/pipeline_tutorial#pipeline-usage |
||||||
The [`optimum_pipeline`] automatically loads a default model and tokenizer capable of inference for your task. | ||||||
|
||||||
1. Start by creating a [`optimum_pipeline`] and specify an inference task: | ||||||
|
||||||
```py | ||||||
>>> from optimum import optimum_pipeline | ||||||
|
||||||
>>> classifier = optimum_pipeline(task="text-classification", accelerator="onnx") | ||||||
|
||||||
``` | ||||||
|
||||||
2. Pass your input text to the [`optimum_pipeline`]: | ||||||
|
||||||
```python | ||||||
>>> classifier("I like you. I love you.") | ||||||
philschmid marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
``` | ||||||
|
||||||
### Use a vanilla transformers model and convert | ||||||
philschmid marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
The [`optimum_pipeline`] accepts supported model from the [Model Hub](https://huggingface.co/models). | ||||||
philschmid marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
There are tags on the Model Hub that allow you to filter for a model you'd like to use for your task. | ||||||
Once you've picked an appropriate model, load it with the `from_transformers()` method for corresponding `OnnxFor*` | ||||||
and [`AutoTokenizer'] class. For example, load the [`OnnxForQuestionAnswering`] class for a question answering modeling task: | ||||||
|
||||||
```py | ||||||
>>> from transformers import AutoTokenizer | ||||||
>>> from optimum.onnxruntime import OnnxForQuestionAnswering | ||||||
>>> from optimum import optimum_pipeline | ||||||
|
||||||
>>> tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2") | ||||||
>>> model = OnnxForQuestionAnswering.from_transformers("deepset/roberta-base-squad2") | ||||||
|
||||||
>>> onnx_qa = optimum_pipeline("question-answering", model=model, tokenizer=tokenizer) | ||||||
>>> question = "Whats my name?" | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. turbo nit:
Suggested change
|
||||||
>>> context = "My Name is Philipp and I live in Nuremberg." | ||||||
philschmid marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
>>> pred = onnx_qa(question=question, context=context) | ||||||
``` | ||||||
|
||||||
### Use optimum model | ||||||
philschmid marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
The [`optimum_pipeline`] is tightly integrated with [Model Hub](https://huggingface.co/models) and can load optimized models directly, e.g. Onnxruntime. | ||||||
philschmid marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
There are tags on the Model Hub that allow you to filter for a model you'd like to use for your task. | ||||||
Once you've picked an appropriate model, load it with the `.from_pretrained` method for corresponding `OnnxFor*` | ||||||
and [`AutoTokenizer'] class. For example, load the [`OnnxForQuestionAnswering`] class for a question answering modeling task: | ||||||
|
||||||
```py | ||||||
>>> from transformers import AutoTokenizer | ||||||
>>> from optimum.onnxruntime import OnnxForQuestionAnswering | ||||||
>>> from optimum import optimum_pipeline | ||||||
|
||||||
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/roberta-base-squad2") | ||||||
>>> model = OnnxForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2") | ||||||
|
||||||
>>> onnx_qa = optimum_pipeline("question-answering", model=model, tokenizer=tokenizer) | ||||||
>>> question = "Whats my name?" | ||||||
philschmid marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
>>> context = "My Name is Philipp and I live in Nuremberg." | ||||||
philschmid marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
>>> pred = onnx_qa(question=question, context=context) | ||||||
``` | ||||||
|
||||||
|
||||||
### Optimizing and Quantizing in Pipelines | ||||||
|
||||||
The [`optimum_pipeline`] can not only run inference it also provides arguments to quantize and optimize your model on the fly. | ||||||
Once you've picked an appropriate model, load it with the `.from_transformers` or `.from_pretrained` method for corresponding `OnnxFor*` | ||||||
and [`AutoTokenizer'] class. For example, load the [`OnnxForQuestionAnswering`] class for a question answering modeling task and provide | ||||||
the `do_optimization=True` and/or `do_quantization=True` arguments: | ||||||
|
||||||
```py | ||||||
>>> from transformers import AutoTokenizer | ||||||
>>> from optimum.onnxruntime import OnnxForQuestionAnswering | ||||||
>>> from optimum import optimum_pipeline | ||||||
|
||||||
>>> tokenizer = AutoTokenizer.from_pretrained("optimum/roberta-base-squad2") | ||||||
>>> model = OnnxForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2") | ||||||
|
||||||
>>> model.optimize() | ||||||
>>> model.quantize() | ||||||
philschmid marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
>>> onnx_qa = optimum_pipeline("question-answering", | ||||||
model=model, | ||||||
tokenizer=tokenizer, | ||||||
do_optimization=True, | ||||||
do_quantization=True | ||||||
) | ||||||
>>> question = "Whats my name?" | ||||||
>>> context = "My Name is Philipp and I live in Nuremberg." | ||||||
|
||||||
>>> pred = onnx_qa(question=question, context=context) | ||||||
``` | ||||||
|
||||||
|
||||||
## Transformers pipeline usage | ||||||
|
||||||
The [`optimum_pipeline`] is just a light wrapper around the `transformers.pipeline` to enable checks for supported tasks and additional features | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
, like quantization and optimization. This being said you can use the `transformers.pipeline` and just replace your `AutoFor*` with the optimum | ||||||
`OnnxFor*` class. | ||||||
|
||||||
```diff | ||||||
from transformers import AutoTokenizer, pipeline | ||||||
-from transformers import AutoModelForQuestionAnswering | ||||||
+from optimum.onnxruntime import OnnxForQuestionAnswering | ||||||
|
||||||
-model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2") | ||||||
+model = OnnxForQuestionAnswering.from_transformers("deepset/roberta-base-squad2") | ||||||
tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2") | ||||||
|
||||||
onnx_qa = pipeline("question-answering",model=model,tokenizer=tokenizer) | ||||||
|
||||||
question = "Whats my name?" | ||||||
philschmid marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
context = "My Name is Philipp and I live in Nuremberg." | ||||||
philschmid marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
pred = onnx_qa(question, context) | ||||||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to have a small model on the Hub, e.g. under the
optimum
org?