ONNX format

#3
by lordofthejars - opened

I wonder why Granite models are not offered as ONNX files, too. ONNX is an open format supported by most platforms (For example, Openshift AI too), or Java libraries to inference models (like DJL or Landchain4J).

I can convert them using any tool, but why not offer them by default in the repo to increase the adoption of the Granite models?

Hi @lordofthejars ! This is a great feature request. There are a number of additional formats beyond transformers that we're actively investigating (GGUF being the main other one). We're working on getting the story correct so that there's strong lineage between the source models and converted formats, all without cluttering this central ibm-granite organization. We hope to have the GGUF model conversions centrally available soon, and we'll certainly take ONNX as a good candidate for the next targets.

In the meantime, support for the granite architectures has been added to optimum, so you can do the ONNX conversions yourself as follows:

# Install optimum and dependencies for conversions
pip install optimum[exporters]

# Update optimum to the tip of main for granite support
pip install "git+https://github.com/huggingface/optimum"

# Download granite
huggingface-cli download ibm-granite/granite-3.1-2b-instruct --local-dir /path/to/granite-3.1-2b-instruct

# Export to ONNX
optimum-cli export \
  --model  /path/to/granite-3.1-2b-instruct \
  --task text-generation-with-past \
  /path/to/granite-3.1-2b-instruct-onnx

Once converted, you can validate the model in a simple generation use case:

from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer

# Load with onnxruntime
model_path = "/path/to/granite-3.1-2b-instruct-onnx"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = ORTModelForCausalLM.from_pretrained(model_path)

# Make a generation call
prompt = "This is a story about a developer and their dog: "
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(max_new_tokens=200, **inputs)
res = tokenizer.decode(outputs[0])
print(res)

Important NOTE: We're still working on support for granitemoe in ONNX.

Sign up or log in to comment