Update example with infinity

Wrote a unit test to assert identical behaviour: https://github.com/michaelfeil/infinity/blob/774530bd54bfb98e3db70d3248140bac99baa938/libs/infinity_emb/tests/unit_test/transformer/vision/test_torch_vision.py#L50

```
docker run --gpus all -v $PWD/data:/app/.cache -p "7997":"7997" michaelf34/infinity:0.0.69 v2 --model-id vidore/colpali-v1.2-merged --revision "cd80ee4200c591b788a9c4e21bb5d549d4a04637" --dtype bfloat16 --batch-size 8 --device cuda --engine torch --port 7997
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO 2024-11-15 19:49:10,249 infinity_emb INFO: infinity_server.py:89
Creating 1engines:
engines=['vidore/colpali-v1.2-merged']
INFO 2024-11-15 19:49:10,260 infinity_emb INFO: select_model.py:64
model=`vidore/colpali-v1.2-merged` selected, using
engine=`torch` and device=`cuda`
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
INFO 2024-11-15 19:49:23,588 infinity_emb INFO: Getting select_model.py:97
timings for batch_size=4 and avg tokens per
sentence=1028
22.27 ms tokenization
186.14 ms inference
600.23 ms post-processing
808.64 ms total
embeddings/sec: 4.95
INFO 2024-11-15 19:49:25,783 infinity_emb INFO: Getting select_model.py:103
timings for batch_size=4 and avg tokens per
sentence=1044
17.65 ms tokenization
455.30 ms inference
590.60 ms post-processing
1063.54 ms total
embeddings/sec: 3.76
INFO 2024-11-15 19:49:25,785 infinity_emb INFO: model select_model.py:104
warmed up, between 3.76-4.95 embeddings/sec at
batch_size=4
INFO 2024-11-15 19:49:25,786 infinity_emb INFO: batch_handler.py:386
creating batching engine
INFO 2024-11-15 19:49:25,788 infinity_emb INFO: ready batch_handler.py:453
to batch requests.
INFO 2024-11-15 19:49:25,789 infinity_emb INFO: infinity_server.py:104

♾️ Infinity - Embedding Inference Server
MIT License; Copyright (c) 2023-now Michael Feil
Version 0.0.69

Open the Docs via Swagger UI:
http://0.0.0.0:7997/docs

Access all deployed models via 'GET':
curl http://0.0.0.0:7997/models

Visit the docs for more information:
https://michaelfeil.github.io/infinity

INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:7997 (Press CTRL+C to quit)
```

Files changed (1) hide show

README.md +11 -0

README.md CHANGED Viewed

@@ -103,6 +103,17 @@ with torch.no_grad():
 scores = processor.score_multi_vector(querry_embeddings, image_embeddings)
 ```
 ## Limitations
  - **Focus**: The model primarily focuses on PDF-type documents and high-ressources languages, potentially limiting its generalization to other document types or less represented languages.

 scores = processor.score_multi_vector(querry_embeddings, image_embeddings)
 ```
+## Infinity
+Usage with docker and [Infinity](https://github.com/michaelfeil/infinity).
+Infinity only works with the `-merged` weight variants of ColPali and ColQwen.
+```bash
+docker run --gpus all -v $PWD/data:/app/.cache -p "7997":"7997" \
+michaelf34/infinity:0.0.69 \
+v2 --model-id vidore/colpali-v1.2-merged --revision "cd80ee4200c591b788a9c4e21bb5d549d4a04637" --dtype bfloat16 --batch-size 8 --device cuda --engine torch --port 7997
+```
 ## Limitations
  - **Focus**: The model primarily focuses on PDF-type documents and high-ressources languages, potentially limiting its generalization to other document types or less represented languages.