Optimization
🤗 Optimum provides an optimum.onnxruntime
package that enables you to apply graph optimization on many model hosted on the 🤗 hub using the ONNX Runtime model optimization tool.
Optimizing a model during the ONNX export
The ONNX model can be directly optimized during the ONNX export using Optimum CLI, by passing the argument --optimize {O1,O2,O3,O4}
in the CLI, for example:
optimum-cli export onnx --model gpt2 --optimize O3 gpt2_onnx/
The optimization levels are:
- O1: basic general optimizations.
- O2: basic and extended general optimizations, transformers-specific fusions.
- O3: same as O2 with GELU approximation.
- O4: same as O3 with mixed precision (fp16, GPU-only, requires
--device cuda
).
Optimizing a model programmatically with ORTOptimizer
ONNX models can be optimized with the ORTOptimizer. The class can be initialized using the from_pretrained() method, which supports different checkpoint formats.
- Using an already initialized ORTModel class.
>>> from optimum.onnxruntime import ORTOptimizer, ORTModelForSequenceClassification
# Loading ONNX Model from the Hub
>>> model = ORTModelForSequenceClassification.from_pretrained(
... "optimum/distilbert-base-uncased-finetuned-sst-2-english"
... )
# Create an optimizer from an ORTModelForXXX
>>> optimizer = ORTOptimizer.from_pretrained(model)
- Using a local ONNX model from a directory.
>>> from optimum.onnxruntime import ORTOptimizer
# This assumes a model.onnx exists in path/to/model
>>> optimizer = ORTOptimizer.from_pretrained("path/to/model")
Optimization Configuration
The OptimizationConfig class allows to specify how the optimization should be performed by the ORTOptimizer.
In the optimization configuration, there are 4 possible optimization levels:
optimization_level=0
: to disable all optimizationsoptimization_level=1
: to enable basic optimizations such as constant folding or redundant node eliminationsoptimization_level=2
: to enable extended graph optimizations such as node fusionsoptimization_level=99
: to enable data layout optimizations
Choosing a level enables the optimizations of that level, as well as the optimizations of all preceding levels. More information here.
enable_transformers_specific_optimizations=True
means that transformers
-specific graph fusion and approximation are performed in addition to the ONNX Runtime optimizations described above.
Here is a list of the possible optimizations you can enable:
- Gelu fusion with
disable_gelu_fusion=False
, - Layer Normalization fusion with
disable_layer_norm_fusion=False
, - Attention fusion with
disable_attention_fusion=False
, - SkipLayerNormalization fusion with
disable_skip_layer_norm_fusion=False
, - Add Bias and SkipLayerNormalization fusion with
disable_bias_skip_layer_norm_fusion=False
, - Add Bias and Gelu / FastGelu fusion with
disable_bias_gelu_fusion=False
, - Gelu approximation with
enable_gelu_approximation=True
.
Attention fusion is designed for right-side padding for BERT-like architectures (eg. BERT, RoBERTa, VIT, etc.) and for left-side padding for generative models (GPT-like). If you are not following the convention, please set use_raw_attention_mask=True
to avoid potential accuracy issues but sacrifice the performance.
While OptimizationConfig gives you full control on how to do optimization, it can be hard to know what to enable / disable. Instead, you can use AutoOptimizationConfig which provides four common optimization levels:
- O1: basic general optimizations.
- O2: basic and extended general optimizations, transformers-specific fusions.
- O3: same as O2 with GELU approximation.
- O4: same as O3 with mixed precision (fp16, GPU-only).
Example: Loading a O2 OptimizationConfig
>>> from optimum.onnxruntime import AutoOptimizationConfig
>>> optimization_config = AutoOptimizationConfig.O2()
You can also specify custom argument that were not defined in the O2 configuration, for instance:
>>> from optimum.onnxruntime import AutoOptimizationConfig
>>> optimization_config = AutoOptimizationConfig.O2(disable_embed_layer_norm_fusion=False)
Optimization examples
Below you will find an easy end-to-end example on how to optimize distilbert-base-uncased-finetuned-sst-2-english.
>>> from optimum.onnxruntime import (
... AutoOptimizationConfig, ORTOptimizer, ORTModelForSequenceClassification
... )
>>> model_id = "distilbert-base-uncased-finetuned-sst-2-english"
>>> save_dir = "distilbert_optimized"
>>> # Load a PyTorch model and export it to the ONNX format
>>> model = ORTModelForSequenceClassification.from_pretrained(model_id, export=True)
>>> # Create the optimizer
>>> optimizer = ORTOptimizer.from_pretrained(model)
>>> # Define the optimization strategy by creating the appropriate configuration
>>> optimization_config = AutoOptimizationConfig.O2()
>>> # Optimize the model
>>> optimizer.optimize(save_dir=save_dir, optimization_config=optimization_config)
Below you will find an easy end-to-end example on how to optimize a Seq2Seq model sshleifer/distilbart-cnn-12-6”.
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import OptimizationConfig, ORTOptimizer, ORTModelForSeq2SeqLM
>>> model_id = "sshleifer/distilbart-cnn-12-6"
>>> save_dir = "distilbart_optimized"
>>> # Load a PyTorch model and export it to the ONNX format
>>> model = ORTModelForSeq2SeqLM.from_pretrained(model_id, export=True)
>>> # Create the optimizer
>>> optimizer = ORTOptimizer.from_pretrained(model)
>>> # Define the optimization strategy by creating the appropriate configuration
>>> optimization_config = OptimizationConfig(
... optimization_level=2,
... enable_transformers_specific_optimizations=True,
... optimize_for_gpu=False,
... )
>>> # Optimize the model
>>> optimizer.optimize(save_dir=save_dir, optimization_config=optimization_config)
>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
>>> optimized_model = ORTModelForSeq2SeqLM.from_pretrained(save_dir)
>>> tokens = tokenizer("This is a sample input", return_tensors="pt")
>>> outputs = optimized_model.generate(**tokens)
Optimizing a model with Optimum CLI
The Optimum ONNX Runtime optimization tools can be used directly through Optimum command-line interface:
optimum-cli onnxruntime optimize --help
usage: optimum-cli <command> [<args>] onnxruntime optimize [-h] --onnx_model ONNX_MODEL -o OUTPUT (-O1 | -O2 | -O3 | -O4 | -c CONFIG)
options:
-h, --help show this help message and exit
-O1 Basic general optimizations (see: https://huggingface.co./docs/optimum/onnxruntime/usage_guides/optimization for more details).
-O2 Basic and extended general optimizations, transformers-specific fusions (see: https://huggingface.co./docs/optimum/onnxruntime/usage_guides/optimization for more
details).
-O3 Same as O2 with Gelu approximation (see: https://huggingface.co./docs/optimum/onnxruntime/usage_guides/optimization for more details).
-O4 Same as O3 with mixed precision (see: https://huggingface.co./docs/optimum/onnxruntime/usage_guides/optimization for more details).
-c CONFIG, --config CONFIG
`ORTConfig` file to use to optimize the model.
Required arguments:
--onnx_model ONNX_MODEL
Path to the repository where the ONNX models to optimize are located.
-o OUTPUT, --output OUTPUT
Path to the directory where to store generated ONNX model.
Optimizing an ONNX model can be done as follows:
optimum-cli onnxruntime optimize --onnx_model onnx_model_location/ -O1 -o optimized_model/
This optimizes all the ONNX files in onnx_model_location
with the basic general optimizations.