Fix Sentence Transformers integration; currently uses mean pooling accidentally

#3
by tomaarsen HF staff - opened

Hello @yliu279 and co-authors!

Preface

Congratulations on your paper & model releases! These seem very promising. Sadly, only very little work has been done on code retrieval, it's quite a shame.
Thank you for advancing this domain!

Pull Request overview

  • Add 'modules.json' which tells the Pooling module to use 1_Pooling/config.json.
  • Add sentence_bert_config.json to tell Sentence Transformers about the maximum sequence length.

Details

Sentence Transformers uses a modules.json file to "build" its model. These usually consist of 2 or 3 modules, often including a Transformer module (relying on transformers) and a Pooling module. You've already got the configuration for this Pooling module in 1_Pooling/config.json, but without the modules.json, Sentence Transformers won't realise that it has to look there. Instead, it will create the "default setup", which means using mean pooling.
As a result, there was a discrepancy between the outputs of Sentence Transformers and Transformers.

Beyond the fixes, I also added the outputs to the code snippets in the README - I find that these help users get a good understanding of what the model does to go from inputs to outputs.
Lastly, I added some tags so that this shows up when people search for retrieval embedding models in here: https://huggingface.co./models?library=sentence-transformers, similarly to what we've done for https://huggingface.co./Salesforce/SFR-Embedding-2_R.

Lastly, you might be able to increase visibility of this model by adding it to your Salesforce SFR-Embedding collection: https://huggingface.co./collections/Salesforce/sfr-embedding-models-66abe671200408925487b6c8

  • Tom Aarsen
tomaarsen changed pull request status to open
Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment