Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ORT inference #113

Merged
merged 39 commits into from
Apr 28, 2022
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
7b05a01
added gpu extras and added > transformers for token-classification pi…
philschmid Mar 24, 2022
81f0bff
added numpy and huggingface hub to required packages
philschmid Mar 24, 2022
a77a9a9
added modeling_* classes
philschmid Mar 24, 2022
128dae1
adding tests and pipelines
philschmid Mar 24, 2022
015fdb5
remove vs code folder
philschmid Mar 24, 2022
57106b5
added test model and adjusted gitignore
philschmid Mar 24, 2022
2d6af6d
add readme for tests
philschmid Mar 24, 2022
7e4f32d
working tests
philschmid Mar 24, 2022
38a0344
added some documentation
philschmid Mar 24, 2022
a9c3fbe
will ci run?
philschmid Mar 24, 2022
909e559
added real model checkpoints
philschmid Mar 24, 2022
035296c
test ci
philschmid Mar 24, 2022
06bbb44
fix styling
philschmid Mar 24, 2022
bcb0cb5
fix some documentation
philschmid Mar 24, 2022
36a63c6
more doc fixes
philschmid Mar 24, 2022
106f3c1
added some feedback and wording from michael and lewis
philschmid Apr 5, 2022
4a2e524
renamed model class to ORTModelForXX
philschmid Apr 8, 2022
bb2504b
Merge branch 'main' into add-ort-inference
philschmid Apr 8, 2022
42de0e9
moved from_transformers to from_pretrained
philschmid Apr 8, 2022
fd4580d
applied ellas feedback
philschmid Apr 8, 2022
a09bae7
make style
philschmid Apr 8, 2022
5d7b4b0
first version of ORTModelForCausalLM without past-keys
philschmid Apr 12, 2022
c32345a
added first draft of new .optimize method
philschmid Apr 12, 2022
b87f06c
added better quantize method
philschmid Apr 13, 2022
e673d9c
Merge branch 'main' into add-ort-inference
philschmid Apr 13, 2022
82452f9
fix import
philschmid Apr 13, 2022
8b3a576
remove optimize and quantize
philschmid Apr 22, 2022
89710b6
Merge branch 'main' into add-ort-inference
philschmid Apr 22, 2022
ba00ccf
added lewis feedback
philschmid Apr 26, 2022
3c5b694
added style for test
philschmid Apr 26, 2022
c20d9ff
added >>> to code snippets
philschmid Apr 26, 2022
d2d5bd2
style
philschmid Apr 26, 2022
87c9ce7
added condition for staging tests
philschmid Apr 27, 2022
a6c936d
feedback morgan & michael
philschmid Apr 27, 2022
1d1c9e9
added action
philschmid Apr 27, 2022
226565a
forgot to install pytest
philschmid Apr 27, 2022
d98d7f3
forgot sentence piece
philschmid Apr 27, 2022
660220e
made sure we won't have import conflicts
philschmid Apr 28, 2022
7f1e7b8
make style happy
philschmid Apr 28, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -131,3 +131,7 @@ dmypy.json

# Models
*.onnx
# include small test model for tests
!tests/assets/onnx/model.onnx
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have a small model on the Hub, e.g. under the optimum org?


.vscode
4 changes: 4 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,12 @@
title: 🤗 Optimum
- local: quickstart
title: Quickstart
- local: pipelines
title: Pipelines for inference
title: Get started
- sections:
- local: onnxruntime/modeling_ort
title: Inference
- local: onnxruntime/configuration
title: Configuration
- local: onnxruntime/optimization
Expand Down
112 changes: 112 additions & 0 deletions docs/source/onnxruntime/modeling_ort.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
<!--Copyright 2022 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# Optimum Inference with ONNX Runtime

Optimum Inference is a utility package for building and running inference with accelerated runtime like ONNX Runtime.
philschmid marked this conversation as resolved.
Show resolved Hide resolved
Optimum Inference can be used to load optimized models from the [Hugging Face Hub](hf.co/models) and create pipelines
to run accelerated inference without rewriting your APIs.

## Switching from Transformers to Optimum Inference

The Optimum Inference models are API compatible with Hugging Face Transformers models. This means you can just replace your `AutoModelForXxx` class with the corresponding `OnnxForXxx` class in `optimum`. For example, this is how you can use a question answering model in `optimum`:

```diff
philschmid marked this conversation as resolved.
Show resolved Hide resolved
from transformers import AutoTokenizer, pipeline
-from transformers import AutoModelForQuestionAnswering
+from optimum.onnxruntime import OnnxForQuestionAnswering

-model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2")
+model = OnnxForQuestionAnswering.from_transformers("deepset/roberta-base-squad2")
tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")

onnx_qa = pipeline("question-answering",model=model,tokenizer=tokenizer)

question = "What's my name?"
context = "My name is Philipp and I live in Nuremberg."
pred = onnx_qa(question, context)
```

Optimum Inference also includes methods to convert vanilla Transformers models to optimized ones via the `from_transformers()` method.
After you have converted a model you can even `optimize` or `quantize` the model if it is supported by the runtime you use.

```python
from transformers import AutoTokenizer, pipeline
philschmid marked this conversation as resolved.
Show resolved Hide resolved
from optimum.onnxruntime import OnnxForSequenceClassification

# load model from hub and convert
model = OnnxForSequenceClassification.from_transformers("distilbert-base-uncased-finetuned-sst-2-english")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

# optimize model
model.optimize()
# quantize model
model.quantize()

# create pipeline
onnx_clx = pipeline("text-classification",model=model,tokenizer=tokenizer)
philschmid marked this conversation as resolved.
Show resolved Hide resolved

result = onnx_clx(text="This is a great model")
philschmid marked this conversation as resolved.
Show resolved Hide resolved
```

You can find a complete walkhrough Optimum Inference for ONNX Runtime in this [notebook](xx).

### Working with the [Hugging Face Model Hub](https://hf.co/models)

The Optimum model classes, e.g. [`OnnxModel`] are directly integrated with the [Hugging Face Model Hub](https://hf.co/models)) meaning you can not only
load model from the Hub but also push your models to the Hub with `push_to_hub()` method. Below you find an example which pulls a vanilla transformers model
philschmid marked this conversation as resolved.
Show resolved Hide resolved
from the Hub and converts it to an optimum model and pushes it back into a new repository.
philschmid marked this conversation as resolved.
Show resolved Hide resolved

```python
from transformers import AutoTokenizer
from optimum.onnxruntime import OnnxForSequenceClassification

# load model from hub and convert
model = OnnxForSequenceClassification.from_transformers("distilbert-base-uncased-finetuned-sst-2-english")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

# optimize model
model.optimize()
# quantize model
model.quantize()

# save converted model
model.save_pretrained("a_local_path_for_convert_onnx_model")
tokenizer.save_pretrained("a_local_path_for_convert_onnx_model")

# push model onnx model to HF Hub
model.push_to_hub("a_local_path_for_convert_onnx_model",
philschmid marked this conversation as resolved.
Show resolved Hide resolved
repository_id="my-onnx-repo",
use_auth_token=True
)
```

## OnnxModel

[[autodoc]] onnxruntime.modeling_ort.OnnxModel

## OnnxForFeatureExtraction

[[autodoc]] onnxruntime.modeling_ort.OnnxForFeatureExtraction

## OnnxForQuestionAnswering

[[autodoc]] onnxruntime.modeling_ort.OnnxForQuestionAnswering

## OnnxForSequenceClassification

[[autodoc]] onnxruntime.modeling_ort.OnnxForSequenceClassification

## OnnxForTokenClassification

[[autodoc]] onnxruntime.modeling_ort.OnnxForTokenClassification

150 changes: 150 additions & 0 deletions docs/source/pipelines.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
<!--Copyright 2022 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# Optimum pipelines for inference

The [`optimum_pipeline`] makes it simple to use models from the [Model Hub](https://huggingface.co/models) for accelerated inference on a variety of tasks such as text classification.
Even if you don't have experience with a specific modality or understand the code powering the models, you can still use them with the [`optimum_pipeline`]! This tutorial will teach you to:
philschmid marked this conversation as resolved.
Show resolved Hide resolved

<Tip>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this tip be moved below, i.e. just after the example in Optimum pipeline usage?


You can also use the `transformers.pipeline` and provide your `OptimumModel`.
philschmid marked this conversation as resolved.
Show resolved Hide resolved

</Tip>

Currenlty supported tasks are:

**Onnxruntime**
philschmid marked this conversation as resolved.
Show resolved Hide resolved

* `feature-extraction`
* `text-classification`
* `token-classification`
* `question-answering`
* `zero-shot-classification`

## Optimum pipeline usage

While each task has an associated [`optimum_pipeline`], it is simpler to use the general [`pipeline`] abstraction which contains all the specific task pipelines.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what you mean here with the difference between optimum_pipeline and pipeline. You say one should use the pipeline abstraction but the example following this uses optimum_pipeline. Perhaps we can clarify this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The [`optimum_pipeline`] automatically loads a default model and tokenizer capable of inference for your task.

1. Start by creating a [`optimum_pipeline`] and specify an inference task:

```py
>>> from optimum import optimum_pipeline

>>> classifier = optimum_pipeline(task="text-classification", accelerator="onnx")

```

2. Pass your input text to the [`optimum_pipeline`]:

```python
>>> classifier("I like you. I love you.")
philschmid marked this conversation as resolved.
Show resolved Hide resolved
```

### Use a vanilla transformers model and convert
philschmid marked this conversation as resolved.
Show resolved Hide resolved

The [`optimum_pipeline`] accepts supported model from the [Model Hub](https://huggingface.co/models).
philschmid marked this conversation as resolved.
Show resolved Hide resolved
There are tags on the Model Hub that allow you to filter for a model you'd like to use for your task.
Once you've picked an appropriate model, load it with the `from_transformers()` method for corresponding `OnnxFor*`
and [`AutoTokenizer'] class. For example, load the [`OnnxForQuestionAnswering`] class for a question answering modeling task:

```py
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import OnnxForQuestionAnswering
>>> from optimum import optimum_pipeline

>>> tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")
>>> model = OnnxForQuestionAnswering.from_transformers("deepset/roberta-base-squad2")

>>> onnx_qa = optimum_pipeline("question-answering", model=model, tokenizer=tokenizer)
>>> question = "Whats my name?"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turbo nit:

Suggested change
>>> question = "Whats my name?"
>>> question = "What's my name?"

>>> context = "My Name is Philipp and I live in Nuremberg."
philschmid marked this conversation as resolved.
Show resolved Hide resolved

>>> pred = onnx_qa(question=question, context=context)
```

### Use optimum model
philschmid marked this conversation as resolved.
Show resolved Hide resolved

The [`optimum_pipeline`] is tightly integrated with [Model Hub](https://huggingface.co/models) and can load optimized models directly, e.g. Onnxruntime.
philschmid marked this conversation as resolved.
Show resolved Hide resolved
There are tags on the Model Hub that allow you to filter for a model you'd like to use for your task.
Once you've picked an appropriate model, load it with the `.from_pretrained` method for corresponding `OnnxFor*`
and [`AutoTokenizer'] class. For example, load the [`OnnxForQuestionAnswering`] class for a question answering modeling task:

```py
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import OnnxForQuestionAnswering
>>> from optimum import optimum_pipeline

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/roberta-base-squad2")
>>> model = OnnxForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2")

>>> onnx_qa = optimum_pipeline("question-answering", model=model, tokenizer=tokenizer)
>>> question = "Whats my name?"
philschmid marked this conversation as resolved.
Show resolved Hide resolved
>>> context = "My Name is Philipp and I live in Nuremberg."
philschmid marked this conversation as resolved.
Show resolved Hide resolved

>>> pred = onnx_qa(question=question, context=context)
```


### Optimizing and Quantizing in Pipelines

The [`optimum_pipeline`] can not only run inference it also provides arguments to quantize and optimize your model on the fly.
Once you've picked an appropriate model, load it with the `.from_transformers` or `.from_pretrained` method for corresponding `OnnxFor*`
and [`AutoTokenizer'] class. For example, load the [`OnnxForQuestionAnswering`] class for a question answering modeling task and provide
the `do_optimization=True` and/or `do_quantization=True` arguments:

```py
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import OnnxForQuestionAnswering
>>> from optimum import optimum_pipeline

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/roberta-base-squad2")
>>> model = OnnxForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2")

>>> model.optimize()
>>> model.quantize()
philschmid marked this conversation as resolved.
Show resolved Hide resolved

>>> onnx_qa = optimum_pipeline("question-answering",
model=model,
tokenizer=tokenizer,
do_optimization=True,
do_quantization=True
)
>>> question = "Whats my name?"
>>> context = "My Name is Philipp and I live in Nuremberg."

>>> pred = onnx_qa(question=question, context=context)
```


## Transformers pipeline usage

The [`optimum_pipeline`] is just a light wrapper around the `transformers.pipeline` to enable checks for supported tasks and additional features
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The [`optimum_pipeline`] is just a light wrapper around the `transformers.pipeline` to enable checks for supported tasks and additional features
The [`optimum_pipeline`] is just a light wrapper around the `transformers.pipeline` function to enable checks for supported tasks and additional features

, like quantization and optimization. This being said you can use the `transformers.pipeline` and just replace your `AutoFor*` with the optimum
`OnnxFor*` class.

```diff
from transformers import AutoTokenizer, pipeline
-from transformers import AutoModelForQuestionAnswering
+from optimum.onnxruntime import OnnxForQuestionAnswering

-model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2")
+model = OnnxForQuestionAnswering.from_transformers("deepset/roberta-base-squad2")
tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")

onnx_qa = pipeline("question-answering",model=model,tokenizer=tokenizer)

question = "Whats my name?"
philschmid marked this conversation as resolved.
Show resolved Hide resolved
context = "My Name is Philipp and I live in Nuremberg."
philschmid marked this conversation as resolved.
Show resolved Hide resolved
pred = onnx_qa(question, context)
```
Loading