HERE IS HOW YOU USE THIS WITH TEI OR INFERENCE ENDPOINTS

#7
by nbroad HF staff - opened

testing to see if this works for TEI

I can confirm this PR does work for TEI.

If using Inference Endpoints:

  1. go to https://ui.endpoints.huggingface.co/new
  2. put Alibaba-NLP/gte-multilingual-base as the model repository
  3. (optional) set the endpoint name
  4. choose cloud provider, device (CPU/GPU)
  5. select "Advanced Configuration"
  6. select "sentence embeddings" in the "Task" dropdown
  7. put refs/pr/7 in the "Revision" box
  8. In "Environment Variables", set MODEL_ID=/repository

Or if you'd like to use the python client to create the endpoint, you can use the following:

from huggingface_hub import create_inference_endpoint

endpoint = create_inference_endpoint(
    "my-endpoint-name",
    repository="Alibaba-NLP/gte-multilingual-base",
    revision="refs/pr/7",
    framework="pytorch",
    task="sentence-embeddings",
    custom_image={
        "health_route": "/health",
        "env": {"MODEL_ID": "/repository",},
        "url": "ghcr.io/huggingface/text-embeddings-inference:1.5",
    },
    accelerator="gpu",
    vendor="aws",
    region="us-east-1",
    type="protected",
    instance_size="x1",
    instance_type="nvidia-l4",
    token="hf_token_with_write_permissions"
)

If launching from a command line, then you can use

model=Alibaba-NLP/gte-multilingual-base
volume=$PWD/data
revision=refs/pr/7

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.5 --model-id $model --revision=$revision
nbroad changed pull request title from remove token cls architecture to HERE IS HOW YOU USE THIS WITH TEI OR INFERENCE ENDPOINTS

In short, I had to:

  1. remove NewModelForTokenClassification from architectures in config.json
  2. rename the keys in the safetensors file to not start with "new". compare the new keys with the old keys
Alibaba-NLP org

Huge thanks!
But we prefer to keep the ForTokenClassification in config.json for sparse weights prediction if it is need by the auto model loading AutoModelForTokenClassification.
I will try to make the existing structure work with TEI, if it is possible.

Will back to you

You don’t need to merge this. People can use this branch for TEI or inference endpoints

@nbroad @izhx I want to run https://huggingface.co./Alibaba-NLP/gte-multilingual-reranker-base/tree/refs%2Fpr%2F3 and try to allocate id2label and label2id -- still working. have you tried it?

and you mean that using this branch will NOT make the model to infer with sparse weights?

(I am doing some experiments on here but no fruitful results has came yet: https://huggingface.co./Alibaba-NLP/gte-multilingual-reranker-base/discussions/3)

izhx pinned discussion

I'm really sorry to bother you, I’ve tried running TEL using Docker and Cargo, but in Docker, it keeps saying that ONNX is missing.

docker run -p 8080:80 -v $volume:/data ${local-image} --model-id $model --revision=$revision
2024-08-20T09:32:14.646243Z  INFO text_embeddings_router: router/src/main.rs:175: Args { model_id: "Ali****-***/***-************-*ase", revision: Some("refs/pr/7"), tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: "127b8c571d1b", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
2024-08-20T09:32:14.646508Z  INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
2024-08-20T09:32:14.678359Z  INFO download_pool_config: text_embeddings_core::download: core/src/download.rs:38: Downloading `1_Pooling/config.json`
2024-08-20T09:32:17.348216Z  INFO download_new_st_config: text_embeddings_core::download: core/src/download.rs:62: Downloading `config_sentence_transformers.json`
2024-08-20T09:32:17.704659Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:21: Starting download
2024-08-20T09:32:17.704698Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:23: Downloading `config.json`
2024-08-20T09:32:18.507387Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:26: Downloading `tokenizer.json`
2024-08-20T09:32:22.272394Z  INFO download_artifacts: text_embeddings_backend: backends/src/lib.rs:368: Downloading `model.onnx`
2024-08-20T09:32:22.635404Z  WARN download_artifacts: text_embeddings_backend: backends/src/lib.rs:372: Could not download `model.onnx`: request error: HTTP status client error (404 Not Found) for url (https://huggingface.co./Alibaba-NLP/gte-multilingual-base/resolve/refs%2Fpr%2F7/model.onnx)
2024-08-20T09:32:22.635437Z  INFO download_artifacts: text_embeddings_backend: backends/src/lib.rs:373: Downloading `onnx/model.onnx`
thread 'main' panicked at /usr/src/backends/src/lib.rs:316:17:
failed to download `model.onnx` or `model.onnx_data`. Check the onnx file exists in the repository. request error: HTTP status client error (404 Not Found) for url (https://huggingface.co./Alibaba-NLP/gte-multilingual-base/resolve/refs%2Fpr%2F7/onnx/model.onnx)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

The way I made it work

model=Alibaba-NLP/gte-multilingual-base
revision=refs/pr/7
volume=/tmp

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.5 --model-id $model --revision $revision

@Maku319 , what is ${local-image}?

@nbroad Thank you for your reply! Since TEI doesn't provide an image version for the M series chip Macs, I built the image locally using the official TEI repository, and that's the local-image.

@Maku319 ,

I'm not sure if there is a solution that works on Mac chips yet. The simplest option to get embeddings quickly would probably be to create an endpoint using Inference Endpoints. You can use the UI here or use the following code to create an endpoint.

from huggingface_hub import create_inference_endpoint

endpoint = create_inference_endpoint(
    "my-endpoint-name",
    repository="Alibaba-NLP/gte-multilingual-base",
    revision="refs/pr/7",
    framework="pytorch",
    task="sentence-embeddings",
    custom_image={
        "health_route": "/health",
        "env": {"MODEL_ID": "/repository",},
        "url": "ghcr.io/huggingface/text-embeddings-inference:1.5",
    },
    accelerator="gpu",
    vendor="aws",
    region="us-east-1",
    type="protected",
    instance_size="x1",
    instance_type="nvidia-l4",
    token="hf_token_with_write_permissions"
)

@nbroad Thank you so much for your reply. I think the ONNX model is only necessary when running on the CPU. When I switched to the GPU, everything seemed to work fine, but now I need to figure out the issue with the container not recognizing CUDA after it starts up. Thanks again!

@nbroad Thank you for your patient guidance. The images for both GPU and CPU versions have been successfully deployed and are accepting requests. However, I have a question: my ONNX model was converted based on the configuration from the main branch, so why is it able to run with the configuration from the pr/7 version you provided?
The command I ran was: docker run -p 9090:80 -v ${PWD}:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-1.5-grpc --model-id /data/gte-multilingual-base. The converted ONNX model is located at: gte-multilingual-base\onnx\.

Also, if I use the repository from the main branch, it fails instead?
The error message is as follows :

INFO text_embeddings_router: router/src/main.rs:175: Args { model_id: "/dat*/***-************-*ase", revision: None, 
tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: "bbf17dcff344", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
Error: `config.json` does not contain `id2label`

I think it's because of the architectures listed in the config file

Hi @Maku319 , as a small follow-up, just to let you know that TEI 1.6.0 re-introduced the Intel backend for CPU inference, meaning that if the ONNX weights are not there, it will roll back to the safetensors weights, so you should be able to run Alibaba-NLP/gte-multilingual-base on CPU as docker run -p 8080:80 --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-1.6 --model-id Alibaba-NLP/gte-multilingual-base --revision refs/pr/7 --port 8080 with no issues, or if you happen to be using an MPS device or don't want to run it over Docker, you can also clone https://github.com/huggingface/text-embeddings-inference and run e.g. cargo install --path router --features metal for MPS support and then just run text-embeddings-router --model-id Alibaba-NLP/gte-multilingual-base --revision refs/pr/7 --port 8080 (more information on the later at https://github.com/huggingface/text-embeddings-inference?tab=readme-ov-file#local-install).

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment