Neuron Model Inference
The APIs presented in the following documentation are relevant for the inference on inf2, trn1 and inf1.
NeuronModelForXXX
classes help to load models from the Hugging Face Hub and compile them to a serialized format optimized for
neuron devices. You will then be able to load the model and run inference with the acceleration powered by AWS Neuron devices.
Switching from Transformers to Optimum
The optimum.neuron.NeuronModelForXXX
model classes are APIs compatible with Hugging Face Transformers models. This means seamless integration
with Hugging Face’s ecosystem. You can just replace your AutoModelForXXX
class with the corresponding NeuronModelForXXX
class in optimum.neuron
.
If you already use Transformers, you will be able to reuse your code just by replacing model classes:
from transformers import AutoTokenizer
-from transformers import AutoModelForSequenceClassification
+from optimum.neuron import NeuronModelForSequenceClassification
# PyTorch checkpoint
-model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
+model = NeuronModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english",
+ export=True, **neuron_kwargs)
As shown above, when you use NeuronModelForXXX
for the first time, you will need to set export=True
to compile your model from PyTorch to a neuron-compatible format.
You will also need to pass Neuron specific parameters to configure the export. Each model architecture has its own set of parameters, as detailed in the next paragraphs.
Once your model has been exported, you can save it either on your local or in the Hugging Face Model Hub:
# Save the neuron model
>>> model.save_pretrained("a_local_path_for_compiled_neuron_model")
# Push the neuron model to HF Hub
>>> model.push_to_hub(
... "a_local_path_for_compiled_neuron_model", repository_id="my-neuron-repo", use_auth_token=True
... )
And the next time when you want to run inference, just load your compiled model which will save you the compilation time:
>>> from optimum.neuron import NeuronModelForSequenceClassification
>>> model = NeuronModelForSequenceClassification.from_pretrained("my-neuron-repo")
As you see, there is no need to pass the neuron arguments used during the export as they are
saved in a config.json
file, and will be restored automatically by NeuronModelForXXX
class.
When running inference for the first time, there is a warmup phase when you run the pipeline for the first time. This run would take 3x-4x higher latency than a regular run.
Discriminative NLP models
As explained in the previous section, you will need only few modifications to your Transformers code to export and run NLP models:
from transformers import AutoTokenizer
-from transformers import AutoModelForSequenceClassification
+from optimum.neuron import NeuronModelForSequenceClassification
# PyTorch checkpoint
-model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
# Compile your model during the first time
+compiler_args = {"auto_cast": "matmul", "auto_cast_type": "bf16"}
+input_shapes = {"batch_size": 1, "sequence_length": 64}
+model = NeuronModelForSequenceClassification.from_pretrained(
+ "distilbert-base-uncased-finetuned-sst-2-english", export=True, **compiler_args, **input_shapes,
+)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
inputs = tokenizer("Hamilton is considered to be the best musical of human history.", return_tensors="pt")
logits = model(**inputs).logits
print(model.config.id2label[logits.argmax().item()])
# 'POSITIVE'
compiler_args
are optional arguments for the compiler, these arguments usually control how the compiler makes tradeoff between the inference performance (latency and throughput) and the accuracy. Here we cast FP32 operations to BF16 using the Neuron matrix-multiplication engine.
input_shapes
are mandatory static shape information that you need to send to the neuron compiler. Wondering what shapes are mandatory for your model? Check it out
with the following code:
>>> from transformers import AutoModelForSequenceClassification
>>> from optimum.exporters import TasksManager
>>> model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
# Infer the task name if you don't know
>>> task = TasksManager.infer_task_from_model(model) # 'text-classification'
>>> neuron_config_constructor = TasksManager.get_exporter_config_constructor(
... model=model, exporter="neuron", task='text-classification'
... )
>>> print(neuron_config_constructor.func.get_mandatory_axes_for_task(task))
# ('batch_size', 'sequence_length')
Be careful, the input shapes used for compilation should be inferior than the size of inputs that you will feed into the model during the inference.
- What if input sizes are smaller than compilation input shapes?
No worries, NeuronModelForXXX
class will pad your inputs to an eligible shape. Besides you can set dynamic_batch_size=True
in the from_pretrained
method to enable dynamic batching, which means that your inputs can have variable batch size.
(Just keep in mind: dynamicity and padding comes with not only flexibility but also performance drop. Fair enough!)
Generative NLP models
As explained before, you will need only a few modifications to your Transformers code to export and run NLP models:
Configuring the export of a generative model
As for non-generative models, two sets of parameters can be passed to the from_pretrained()
method to configure how a transformers checkpoint is exported to
a neuron optimized model:
compiler_args = { num_cores, auto_cast_type }
are optional arguments for the compiler, these arguments usually control how the compiler makes tradeoff between the inference latency and throughput and the accuracy.input_shapes = { batch_size, sequence_length }
correspond to the static shape of the model input and the KV-cache (attention keys and values for past tokens).num_cores
is the number of neuron cores used when instantiating the model. Each neuron core has 16 Gb of memory, which means that bigger models need to be split on multiple cores. Defaults to 1,auto_cast_type
specifies the format to encode the weights. It can be one offp32
(float32
),fp16
(float16
) orbf16
(bfloat16
). Defaults tofp32
.batch_size
is the number of input sequences that the model will accept. Defaults to 1,sequence_length
is the maximum number of tokens in an input sequence. Defaults tomax_position_embeddings
(n_positions
for older models).
from transformers import AutoTokenizer
-from transformers import AutoModelForCausalLM
+from optimum.neuron import NeuronModelForCausalLM
# Instantiate and convert to Neuron a PyTorch checkpoint
+compiler_args = {"num_cores": 1, "auto_cast_type": 'fp32'}
+input_shapes = {"batch_size": 1, "sequence_length": 512}
-model = AutoModelForCausalLM.from_pretrained("gpt2")
+model = NeuronModelForCausalLM.from_pretrained("gpt2", export=True, **compiler_args, **input_shapes)
As explained before, these parameters can only be configured during export. This means in particular that during inference:
- the
batch_size
of the inputs should be equal to thebatch_size
used during export, - the
length
of the input sequences should be lower than thesequence_length
used during export, - the maximum number of tokens (input + generated) cannot exceed the
sequence_length
used during export.
Text generation inference
As with the original transformers models, use generate()
instead of forward()
to generate text sequences.
from transformers import AutoTokenizer
-from transformers import AutoModelForCausalLM
+from optimum.neuron import NeuronModelForCausalLM
# Instantiate and convert to Neuron a PyTorch checkpoint
-model = AutoModelForCausalLM.from_pretrained("gpt2")
+model = NeuronModelForCausalLM.from_pretrained("gpt2", export=True)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token_id = tokenizer.eos_token_id
tokens = tokenizer("I really wish ", return_tensors="pt")
with torch.inference_mode():
sample_output = model.generate(
**tokens,
do_sample=True,
min_length=128,
max_length=256,
temperature=0.7,
)
outputs = [tokenizer.decode(tok) for tok in sample_output]
print(outputs)
The generation is highly configurable. Please refer to https://huggingface.co./docs/transformers/generation_strategies for details.
Please be aware that:
- for each model architecture, default values are provided for all parameters, but values passed to the
generate
method will take precedence, - the generation parameters can be stored in a
generation_config.json
file. When such a file is present in model directory, it will be parsed to set the default parameters (the values passed to thegenerate
method still take precedence).
Happy inference with Neuron! 🚀