HERE IS HOW YOU USE THIS WITH TEI OR INFERENCE ENDPOINTS
testing to see if this works for TEI
I can confirm this PR does work for TEI.
If using Inference Endpoints:
- go to https://ui.endpoints.huggingface.co/new
- put
Alibaba-NLP/gte-multilingual-base
as the model repository - (optional) set the endpoint name
- choose cloud provider, device (CPU/GPU)
- select "Advanced Configuration"
- select "sentence embeddings" in the "Task" dropdown
- put
refs/pr/7
in the "Revision" box - In "Environment Variables", set
MODEL_ID=/repository
Or if you'd like to use the python client to create the endpoint, you can use the following:
from huggingface_hub import create_inference_endpoint
endpoint = create_inference_endpoint(
"my-endpoint-name",
repository="Alibaba-NLP/gte-multilingual-base",
revision="refs/pr/7",
framework="pytorch",
task="sentence-embeddings",
custom_image={
"health_route": "/health",
"env": {"MODEL_ID": "/repository",},
"url": "ghcr.io/huggingface/text-embeddings-inference:1.5",
},
accelerator="gpu",
vendor="aws",
region="us-east-1",
type="protected",
instance_size="x1",
instance_type="nvidia-l4",
token="hf_token_with_write_permissions"
)
If launching from a command line, then you can use
model=Alibaba-NLP/gte-multilingual-base
volume=$PWD/data
revision=refs/pr/7
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.5 --model-id $model --revision=$revision
In short, I had to:
- remove
NewModelForTokenClassification
from architectures in config.json - rename the keys in the safetensors file to not start with "new". compare the new keys with the old keys
Huge thanks!
But we prefer to keep the ForTokenClassification
in config.json
for sparse weights prediction if it is need by the auto model loading AutoModelForTokenClassification
.
I will try to make the existing structure work with TEI, if it is possible.
Will back to you
You don’t need to merge this. People can use this branch for TEI or inference endpoints
@nbroad @izhx I want to run https://huggingface.co./Alibaba-NLP/gte-multilingual-reranker-base/tree/refs%2Fpr%2F3 and try to allocate id2label and label2id -- still working. have you tried it?
and you mean that using this branch will NOT make the model to infer with sparse weights?
(I am doing some experiments on here but no fruitful results has came yet: https://huggingface.co./Alibaba-NLP/gte-multilingual-reranker-base/discussions/3)
I'm really sorry to bother you, I’ve tried running TEL
using Docker
and Cargo
, but in Docker
, it keeps saying that ONNX
is missing.
docker run -p 8080:80 -v $volume:/data ${local-image} --model-id $model --revision=$revision
2024-08-20T09:32:14.646243Z INFO text_embeddings_router: router/src/main.rs:175: Args { model_id: "Ali****-***/***-************-*ase", revision: Some("refs/pr/7"), tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: "127b8c571d1b", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
2024-08-20T09:32:14.646508Z INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
2024-08-20T09:32:14.678359Z INFO download_pool_config: text_embeddings_core::download: core/src/download.rs:38: Downloading `1_Pooling/config.json`
2024-08-20T09:32:17.348216Z INFO download_new_st_config: text_embeddings_core::download: core/src/download.rs:62: Downloading `config_sentence_transformers.json`
2024-08-20T09:32:17.704659Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:21: Starting download
2024-08-20T09:32:17.704698Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:23: Downloading `config.json`
2024-08-20T09:32:18.507387Z INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:26: Downloading `tokenizer.json`
2024-08-20T09:32:22.272394Z INFO download_artifacts: text_embeddings_backend: backends/src/lib.rs:368: Downloading `model.onnx`
2024-08-20T09:32:22.635404Z WARN download_artifacts: text_embeddings_backend: backends/src/lib.rs:372: Could not download `model.onnx`: request error: HTTP status client error (404 Not Found) for url (https://huggingface.co./Alibaba-NLP/gte-multilingual-base/resolve/refs%2Fpr%2F7/model.onnx)
2024-08-20T09:32:22.635437Z INFO download_artifacts: text_embeddings_backend: backends/src/lib.rs:373: Downloading `onnx/model.onnx`
thread 'main' panicked at /usr/src/backends/src/lib.rs:316:17:
failed to download `model.onnx` or `model.onnx_data`. Check the onnx file exists in the repository. request error: HTTP status client error (404 Not Found) for url (https://huggingface.co./Alibaba-NLP/gte-multilingual-base/resolve/refs%2Fpr%2F7/onnx/model.onnx)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
@Maku319 ,
I'm not sure if there is a solution that works on Mac chips yet. The simplest option to get embeddings quickly would probably be to create an endpoint using Inference Endpoints. You can use the UI here or use the following code to create an endpoint.
from huggingface_hub import create_inference_endpoint
endpoint = create_inference_endpoint(
"my-endpoint-name",
repository="Alibaba-NLP/gte-multilingual-base",
revision="refs/pr/7",
framework="pytorch",
task="sentence-embeddings",
custom_image={
"health_route": "/health",
"env": {"MODEL_ID": "/repository",},
"url": "ghcr.io/huggingface/text-embeddings-inference:1.5",
},
accelerator="gpu",
vendor="aws",
region="us-east-1",
type="protected",
instance_size="x1",
instance_type="nvidia-l4",
token="hf_token_with_write_permissions"
)
@nbroad
Thank you for your patient guidance. The images for both GPU and CPU versions have been successfully deployed and are accepting requests. However, I have a question: my ONNX model was converted based on the configuration from the main branch, so why is it able to run with the configuration from the pr/7
version you provided?
The command I ran was: docker run -p 9090:80 -v ${PWD}:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-1.5-grpc --model-id /data/gte-multilingual-base
. The converted ONNX model is located at: gte-multilingual-base\onnx\
.
Also, if I use the repository from the main branch, it fails instead?
The error message is as follows :
INFO text_embeddings_router: router/src/main.rs:175: Args { model_id: "/dat*/***-************-*ase", revision: None,
tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: "bbf17dcff344", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
Error: `config.json` does not contain `id2label`
I think it's because of the architectures listed in the config file
Hi
@Maku319
, as a small follow-up, just to let you know that TEI 1.6.0 re-introduced the Intel backend for CPU inference, meaning that if the ONNX weights are not there, it will roll back to the safetensors
weights, so you should be able to run Alibaba-NLP/gte-multilingual-base
on CPU as docker run -p 8080:80 --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-1.6 --model-id Alibaba-NLP/gte-multilingual-base --revision refs/pr/7 --port 8080
with no issues, or if you happen to be using an MPS device or don't want to run it over Docker, you can also clone https://github.com/huggingface/text-embeddings-inference and run e.g. cargo install --path router --features metal
for MPS support and then just run text-embeddings-router --model-id Alibaba-NLP/gte-multilingual-base --revision refs/pr/7 --port 8080
(more information on the later at https://github.com/huggingface/text-embeddings-inference?tab=readme-ov-file#local-install).