Supported models and hardware

We are continually expanding our support for other model types and plan to include them in future updates.

Supported embeddings models

Text Embeddings Inference currently supports Nomic, BERT, CamemBERT, XLM-RoBERTa models with absolute positions, JinaBERT model with Alibi positions and Mistral, Alibaba GTE and Qwen2 models with Rope positions.

Below are some examples of the currently supported models:

MTEB Rank	Model Size	Model Type	Model ID
1	7B (Very Slow)	Mistral	Salesforce/SFR-Embedding-2_R
15	0.4B	Alibaba GTE	Alibaba-NLP/gte-large-en-v1.5
20	0.3B	Bert	WhereIsAI/UAE-Large-V1
24	0.5B	XLM-RoBERTa	intfloat/multilingual-e5-large-instruct
N/A	0.1B	NomicBert	nomic-ai/nomic-embed-text-v1
N/A	0.1B	NomicBert	nomic-ai/nomic-embed-text-v1.5
N/A	0.1B	JinaBERT	jinaai/jina-embeddings-v2-base-en
N/A	0.1B	JinaBERT	jinaai/jina-embeddings-v2-base-code

To explore the list of best performing text embeddings models, visit the Massive Text Embedding Benchmark (MTEB) Leaderboard.

Supported re-rankers and sequence classification models

Text Embeddings Inference currently supports CamemBERT, and XLM-RoBERTa Sequence Classification models with absolute positions.

Below are some examples of the currently supported models:

Task	Model Type	Model ID	Revision
Re-Ranking	XLM-RoBERTa	BAAI/bge-reranker-large	`refs/pr/4`
Re-Ranking	XLM-RoBERTa	BAAI/bge-reranker-base	`refs/pr/5`
Sentiment Analysis	RoBERTa	SamLowe/roberta-base-go_emotions

Supported hardware

Text Embeddings Inference supports can be used on CPU, Turing (T4, RTX 2000 series, …), Ampere 80 (A100, A30), Ampere 86 (A10, A40, …), Ada Lovelace (RTX 4000 series, …), and Hopper (H100) architectures.

The library does not support CUDA compute capabilities < 7.5, which means V100, Titan V, GTX 1000 series, etc. are not supported. To leverage your GPUs, make sure to install the NVIDIA Container Toolkit, and use NVIDIA drivers with CUDA version 12.2 or higher.

Find the appropriate Docker image for your hardware in the following table:

Architecture	Image
CPU	ghcr.io/huggingface/text-embeddings-inference:cpu-1.6
Volta	NOT SUPPORTED
Turing (T4, RTX 2000 series, …)	ghcr.io/huggingface/text-embeddings-inference:turing-1.6 (experimental)
Ampere 80 (A100, A30)	ghcr.io/huggingface/text-embeddings-inference:1.6
Ampere 86 (A10, A40, …)	ghcr.io/huggingface/text-embeddings-inference:86-1.6
Ada Lovelace (RTX 4000 series, …)	ghcr.io/huggingface/text-embeddings-inference:89-1.6
Hopper (H100)	ghcr.io/huggingface/text-embeddings-inference:hopper-1.6 (experimental)

Warning: Flash Attention is turned off by default for the Turing image as it suffers from precision issues. You can turn Flash Attention v1 ON by using the USE_FLASH_ATTENTION=True environment variable.

< > Update on GitHub

text-embeddings-inference

Supported models and hardware

Supported embeddings models

Supported re-rankers and sequence classification models

Supported hardware