Optimum Inference is a utility package for building and running inference with accleratored runtime like Onnxruntime. Optimum Inference can be used to load optimized models from the Hugging Face Hub and create pipelines to run acclerated inference without rewriting your APIs.
The Optimum Inference models are API compatible with Transformers models, meaning you can just replace your AutoFor*
with the optimum OnnxFor*
class.
from transformers import AutoTokenizer, pipeline
-from transformers import AutoModelForQuestionAnswering
+ from optimum.onnxruntime import OnnxForQuestionAnswering
-model = AutoModelForQuestionAnswering.from_pretrained("philschmid/distilbert-onnx")
+model = OnnxForQuestionAnswering.from_pretrained("philschmid/distilbert-onnx")
tokenizer = AutoTokenizer.from_pretrained("philschmid/distilbert-onnx")
onnx_qa = pipeline("question-answering",model=model,tokenizer=tokenizer)
question = "Whats my name?"
context = "My Name is Philipp and I live in Nuremberg."
pred = onnx_qa(question, context)
Optimum Inference also includes methods to convert vanilla Transformers models to optimized models (.from_transfromers
).
After you have converted a model you can even optimize
or quantize
the model if it is supported by the runtime you use.
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import OnnxForSequenceClassification
# load model from hub and convert
model = OnnxForSequenceClassification.from_transformers("distilbert-base-uncased-finetuned-sst-2-english")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
# optimize model
model.optimize()
# quantize model
model.quantize()
# create pipeline
onnx_clx = pipeline("text-classification",model=model,tokenizer=tokenizer)
result = onnx_clx(text="This is a great model")
You can find a complete walkhrough optimum inference for Onnxruntime in this notebook.
Working with the Hugging Face Model Hub
The optimum model classes, e.g. [OnnxModel
] are directly integrated with the Hugging Face Model Hub) meaning you can not only
load model from the Hub but also push your models to the Hub with push_to_hub
method. Below you find an example which pulls a vanilla transformers model
from the Hub and converts it to an optimum model and pushes it back into a new repository.
from transformers import AutoTokenizer
from optimum.onnxruntime import OnnxForSequenceClassification
# load model from hub and convert
model = OnnxForSequenceClassification.from_transformers("distilbert-base-uncased-finetuned-sst-2-english")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
# optimize model
model.optimize()
# quantize model
model.quantize()
# save converted model
model.save_pretrained("a_local_path_for_convert_onnx_model")
tokenizer.save_pretrained("a_local_path_for_convert_onnx_model")
# push model onnx model to HF Hub
model.push_to_hub("a_local_path_for_convert_onnx_model",
repository_id="my-onnx-repo",
use_auth_token=True
)
[[autodoc]] onnxruntime.modeling_ort.OnnxModel
[[autodoc]] onnxruntime.modeling_ort.OnnxForFeatureExtraction
[[autodoc]] onnxruntime.modeling_ort.OnnxForQuestionAnswering
[[autodoc]] onnxruntime.modeling_ort.OnnxForSequenceClassification
[[autodoc]] onnxruntime.modeling_ort.OnnxForTokenClassification