Spaces:
Running
Running
title: Optimum-Nvidia - TensorRT-LLM optimized inference engines | |
emoji: π | |
colorFrom: green | |
colorTo: yellow | |
sdk: static | |
pinned: false | |
[Optimum-Nvidia](https://github.com/huggingface/optimum-nvidia) allows you to easily leverage Nvidia's [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) Inference tool | |
through a seemlessly integration following huggingface/transformers API. | |
This organisation holds prebuilt TensorRT-LLM compatible engines for various fondational models one can use, fork and deploy to get started as fast as possible and benefits from | |
out of the box peak performances on Nvidia hardware. | |
Prebuilt engines will attempt (as much as possible) to be build with the best options available and will push updated models following additions to TensorRT-LLM repository. | |
This can include (not limited to): | |
- Leveraging `float8` quantization on supported hardware (H100/L4/L40/RTX 40xx) | |
- Enabling `float8` or `int8` KV cache | |
- Enabling in-flight batching for dynamic batching when used in combinaison with Nvidia Triton Inference Server | |
- Enabling xQA attention kernels | |
Current engines are targetting the following Nvidia TensorCore GPUs and can be found using specific branch matching the targetted GPU in the repo: | |
- [4090 (sm_89)](https://huggingface.co./collections/optimum-nvidia/rtx-4090-optimized-tensorrt-llm-models-65e5ebc1240c11001a3e666b) | |
Feel free to open-up discussions and ask for models to support through the community tab | |
- The Optimum-Nvidia team at π€ |