Update readme with infinity

#17
by michaelfeil - opened

Should be faster than SentenceTransformers, as it uses a10 nested backend of pytorch.

docker run --gpus all -p "7997":"7997" michaelf34/infinity:0.0.70 v2 --model-id Snowflake/snowflake-arctic-embed-m --dtype float16 --batch-size 32 --engine torch --port 7997
Status: Downloaded newer image for michaelf34/infinity:0.0.70
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO     2024-11-26 17:33:29,015 infinity_emb INFO:        infinity_server.py:92
         Creating 1engines:                                                     
         engines=['Snowflake/snowflake-arctic-embed-m']                         
INFO     2024-11-26 17:33:29,026 infinity_emb INFO:           select_model.py:64
         model=`Snowflake/snowflake-arctic-embed-m` selected,                   
         using engine=`torch` and device=`None`                                 
INFO     2024-11-26 17:33:29,354                      SentenceTransformer.py:216
         sentence_transformers.SentenceTransformer                              
         INFO: Load pretrained SentenceTransformer:                             
         Snowflake/snowflake-arctic-embed-m                                     
INFO     2024-11-26 17:33:40,276                      SentenceTransformer.py:355
         sentence_transformers.SentenceTransformer                              
         INFO: 1 prompts are loaded, with the keys:                             
         ['query']                                                              
INFO     2024-11-26 17:33:40,293 infinity_emb INFO: Adding    acceleration.py:56
         optimizations via Huggingface optimum.                                 
The class `optimum.bettertransformers.transformation.BetterTransformer` is deprecated and will be removed in a future release.
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co./docs/optimum/bettertransformer/overview for more details.
/app/.venv/lib/python3.10/site-packages/optimum/bettertransformer/models/encoder_models.py:301: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at ../aten/src/ATen/NestedTensorImpl.cpp:178.)
  hidden_states = torch._nested_tensor_from_mask(hidden_states, ~attention_mask)
INFO     2024-11-26 17:33:40,716 infinity_emb INFO: Getting   select_model.py:97
         timings for batch_size=32 and avg tokens per                           
         sentence=1                                                             
                 2.66     ms tokenization                                       
                 5.58     ms inference                                          
                 0.12     ms post-processing                                    
                 8.36     ms total                                              
         embeddings/sec: 3828.62                                                
INFO     2024-11-26 17:33:41,058 infinity_emb INFO: Getting  select_model.py:103
         timings for batch_size=32 and avg tokens per                           
         sentence=512                                                           
                 14.66    ms tokenization                                       
                 122.97   ms inference                                          
                 0.14     ms post-processing                                    
                 137.78   ms total                                              
         embeddings/sec: 232.26                                                 
INFO     2024-11-26 17:33:41,060 infinity_emb INFO: model    select_model.py:104
         warmed up, between 232.26-3828.62 embeddings/sec at                    
         batch_size=32                                                          
INFO     2024-11-26 17:33:41,062 infinity_emb INFO:         batch_handler.py:443
         creating batching engine                                               
INFO     2024-11-26 17:33:41,063 infinity_emb INFO: ready   batch_handler.py:512
         to batch requests.                                                     
INFO     2024-11-26 17:33:41,066 infinity_emb INFO:       infinity_server.py:106
                                                                                
         ♾️  Infinity - Embedding Inference Server                               
         MIT License; Copyright (c) 2023-now Michael Feil                       
         Version 0.0.70                                                         
                                                                                
         Open the Docs via Swagger UI:                                          
         http://0.0.0.0:7997/docs                                               
                                                                                
         Access all deployed models via 'GET':                                  
         curl http://0.0.0.0:7997/models                                        
                                                                                
         Visit the docs for more information:                                   
         https://michaelfeil.github.io/infinity                                 
                                                                                
                                                                                
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:7997 (Press CTRL+C to quit)
spacemanidol changed pull request status to merged

Sign up or log in to comment