Common AI Model Formats

Community Article Published February 27, 2025

cover

This article is mirrored from https://blog.ngxson.com/common-ai-model-formats

๐Ÿ“ป ๐ŸŽ™๏ธ Here is an AI podcast about this blog post, check it out!

This podcast is generated via ngxson/kokoro-podcast-generator, using DeepSeek-R1 and Kokoro-TTS.

For the past two years, the open-source AI community has been buzzing with excitement over the development of new AI models. An increasing number of models are released daily on Hugging Face, and many are being used in production applications. However, one challenge developers encounter when working with these models is the variety of formats they are available in.

In this article, we will explore some common AI model formats used today, including GGUF, PyTorch, Safetensors, and ONNX. We will discuss the advantages and disadvantages of each format and offer guidance on when to use each one.

GGUF

GGUF was initially developed for the llama.cpp project. GGUF is a binary format designed for fast model loading and saving, and for ease of readability. Models are typically developed using PyTorch or another framework, and then converted to GGUF for use with GGML.

Over time, GGUF has become one of the most popular formats for sharing AI models within the open-source community. It is supported by numerous well-known inference runtimes, including llama.cpp, ollama, and vLLM.

Currently, GGUF is primarily used for language models. While it is possible to use it for other types of models, such as diffusion models via stable-diffusion.cpp, it is not as common as its application in language models.

A GGUF file comprises:

  • A metadata section organized in key-value pairs. This section contains information about the model, such as its architecture, version, and hyperparameters.
  • A section for tensor metadata. This section includes details about the tensors in the model, such as their shape, data type, and name.
  • Finally, a section containing the tensor data itself.

diagram of GGUF format structure Diagram by @mishig25 (GGUF v3)

The GGUF format and the GGML library also offer flexible quantization schemes, enabling efficient model storage while maintaining good accuracy. Some of the most common quantization schemes are:

  • Q4_K_M: Most tensors are quantized to 4 bits, with some quantized to 6 bits. This is the most frequently used quantization scheme.
  • IQ4_XS: Almost all tensors are quantized to 4 bits, but with the aid of an importance matrix. This matrix is used to calibrate the quantization of each tensor, potentially leading to better accuracy while maintaining storage efficiency.
  • IQ2_M: Similar to IQ4_XS, but with 2-bit quantization. This is the most aggressive quantization scheme, yet it can still achieve good accuracy on certain models. It is suitable for hardware with very limited memory.
  • Q8_0: All tensors are quantized to 8 bits. This is the least aggressive quantization scheme and provides almost the same accuracy as the original model.

Example of Llama-3.1 8B model in GGUF format Example of a Llama-3.1 8B model in GGUF format, link here

Let's recap the advantages and disadvantages of GGUF:

  • Advantages:
    • Simple: The single-file format is easy to share and distribute.
    • Fast: Fast loading and saving of models is achieved through compatibility with mmap().
    • Efficient: Offers flexible quantization schemes.
    • Portable: As a binary format, it can be easily read without requiring a specific library.
  • Disadvantages:
    • Most models need to be converted from other formats (PyTorch, Safetensors) to GGUF.
    • Not all models are convertible. Some are not supported by llama.cpp.
    • Modifying or fine-tuning a model after it has been saved in GGUF is not straightforward.

GGUF is primarily used for serving models in production environments, where fast loading times are crucial. It is also used for sharing models within the open-source community, as the format's simplicity facilitates easy distribution.

Useful resources:

  • llama.cpp project, which provides scripts for converting HF models to GGUF.
  • gguf-my-repo space on HF allows converting models to GGUF format without local downloading.
  • ollama and HF-ollama integration enable running any GGUF model from the HF Hub via the ollama run command.

PyTorch (.pt/.pth)

The .pt/.pth extension represents PyTorch's default serialization format, storing model state dictionaries that contain learned parameters (weights, biases), optimizer states, and training metadata.

PyTorch models can be saved in two formats:

  • .pt: This format saves the entire model, including its architecture and learned parameters.
  • .pth: This format saves only the model's state dictionary, which includes the model's learned parameters and some metadata.

The PyTorch format is based on Python's pickle module, which serializes Python objects. To understand how pickle works, let's examine the following example:

import pickle
model_state_dict = { "layer1": "hello", "layer2": "world" }
pickle.dump(model_state_dict, open("model.pkl", "wb"))

The pickle.dump() function serializes the model_state_dict dictionary and saves it to a file named model.pkl. The output file now contains a binary representation of the dictionary:

model.pkl hex view

To load the serialized dictionary back into Python, we can use the pickle.load() function:

import pickle
model_state_dict = pickle.load(open("model.pkl", "rb"))
print(model_state_dict)
# Output: {'layer1': 'hello', 'layer2': 'world'}

As you can see, the pickle module provides an easy way to serialize Python objects. However, it has some limitations:

  • Security: Anything can be pickled, including malicious code. This can lead to security vulnerabilities if serialized data is not properly validated. For example, this article from Snyk explains how pickle files can be backdoored.
  • Efficiency: It is not designed for lazy-loading or partial data loading. This can result in slow loading times and high memory usage when working with large models.
  • Portability: It is specific to Python, which can make sharing models with other languages challenging.

The PyTorch format can be a suitable choice if you are working exclusively within a Python and PyTorch environment. However, in recent years, the AI community has been shifting towards more efficient and secure serialization formats, such as GGUF and Safetensors.

Useful resources:

  • PyTorch documentation on saving and loading models.
  • executorch project that offers a way to convert PyTorch models to .pte, which are runnable on mobile and edge devices.

Safetensors

Developed by Hugging Face, safetensors addresses security and efficiency limitations present in traditional Python serialization approaches like pickle, used by PyTorch. The format uses a restricted deserialization process to prevent code execution vulnerabilities.

A safetensors file contains:

  • A metadata section saved in JSON format. This section contains information about all tensors in the model, such as their shape, data type, and name. It can optionally also contain custom metadata.
  • A section for the tensor data.

Diagram of Safetensors format structure Diagram of Safetensors format structure

  • Advantages:
    • Secure: Safetensors employs a restricted deserialization process to prevent code execution vulnerabilities.
    • Fast: It is designed for lazy-loading and partial data loading, which can lead to faster loading times and lower memory usage. This is similar to GGUF, where you can mmap() the file.
    • Efficient: Supports quantized tensors.
    • Portable: It is designed to be portable across different programming languages, making it easy to share models with other languages.
  • Disadvantages:
    • Quantization scheme is not as flexible as GGUF. This is mainly due to the quantization support provided by PyTorch.
    • A JSON parser is required to read the metadata section. This can be problematic when working with low-level languages like C++, which do not have built-in JSON support.

Note: While in theory, metadata can be saved within the file, in practice, model metadata is often stored in a separate JSON file. This can be both advantageous and disadvantageous, depending on the use case.

The safetensors format is the default serialization format used by Hugging Face's transformers library. It is widely used in the open-source community for sharing, training, fine-tuning, and serving AI models. New models released on Hugging Face are all stored in safetensors format, including Llama, Gemma, Phi, Stable-Diffusion, Flux, and many others.

Useful resources:

  • transformers library documentation on saving and loading models.
  • bitsandbytes guide on how to quantize models and save them in safetensors format.
  • mlx-community organization on HF that provides models compatible with the MLX framework (Apple silicon).

ONNX

Open Neural Network Exchange (ONNX) format offers a vendor-neutral representation of machine learning models. It is part of the ONNX ecosystem, which includes tools and libraries for interoperability between different frameworks like PyTorch, TensorFlow, and MXNet.

ONNX models are saved in a single file with the .onnx extension. Unlike GGUF or Safetensors, ONNX contains not only the model's tensors and metadata, but also the model's computation graph.

Including the computation graph in the model file allows for greater flexibility when working with the model. For instance, when a new model is released, you can readily convert it to ONNX format without needing to be concerned about the model's architecture or inference code, because the computation graph is already saved within the file.

Example for a compute graph in ONNX format Example of a computation graph in ONNX format, generated by Netron

  • Advantages:
    • Flexibility: The inclusion of the computation graph in the model file provides more flexibility when converting models between different frameworks.
    • Portability: Thanks to the ONNX ecosystem, the ONNX format can be easily deployed on various platforms and devices, including mobile and edge devices.
  • Disadvantages:
    • Limited support for quantized tensors. ONNX does not natively support quantized tensors, but instead decomposes them into an integer tensor and a scale factor tensor. This can lead to reduced quality when working with quantized models.
    • Complex architectures may necessitate operator fallbacks or custom implementations for unsupported layers. This can potentially result in performance loss when converting models to ONNX format.

Overall, ONNX is a good choice if you are working with mobile devices or in-browser inference.

Useful resources:

  • onnx-community organization on HF that provides models in ONNX format, as well as conversion guides.
  • transformer.js project that allows running ONNX models in the browser, using WebGPU or WebAssembly.
  • onnxruntime project that provides a high-performance inference engine on various platforms and hardware.
  • netron project that allows visualizing ONNX models in the browser.

Hardware Support

When choosing a model format, it is important to consider the hardware on which the model will be deployed. The table below shows hardware support recommendations for each format:

Hardware GGUF PyTorch Safetensors ONNX
CPU โœ… (best) ๐ŸŸก ๐ŸŸก โœ…
GPU โœ… โœ… โœ… โœ…
Mobile deployment โœ… ๐ŸŸก (via executorch) โŒ โœ…
Apple silicon โœ… ๐ŸŸก โœ… (via MLX framework) โœ…

Explanation:

  • โœ…: Fully supported
  • ๐ŸŸก: Partially supported or low performance
  • โŒ: Not supported

Conclusion

In this article, we have explored some of the common AI model formats used today, including GGUF, PyTorch, Safetensors, and ONNX. Each format possesses its own advantages and disadvantages, making it crucial to choose the right format based on your specific use case and hardware requirements.

Footnotes

mmap: Memory-mapped files are an operating system feature that allows files to be mapped into memory. This can be beneficial for reading and writing large files without needing to load the entire file into memory.

lazy-loading: Lazy-loading is a technique that defers the loading of data until it is actually required. This can help reduce memory usage and improve performance when working with large models.

computation graph: In the context of machine learning, a computation graph is a flowchart that illustrates how data flows through a model and how different calculations (such as addition, multiplication, or activation function application) are performed at each step.

Community

Hi, thanks for sharing! Curious if you could say more about why it's hard to finetune or modify a Model saved using GGUF?

ยท
Article author

For fine tuning, there was some community efforts to add this functionality into llama.cpp a while ago, but it's proved to be too complicated to maintain. Fine tuning on llama.cpp on GPU is also non-existent at the time of writing this.

And for modifying, it's obvious that all metadata and tensors are baked into one single file, which make it difficult to edit. For example, on safetensors model, modifying chat template is simple: it's stored in tokenizer_config.json. But for GGUF, you need to use a special script that does not modify, but creates a new GGUF with the modified metadata.

Sign up or log in to comment