Fix Sentence Transformers integration; currently uses mean pooling accidentally
Hello @yliu279 and co-authors!
Preface
Congratulations on your paper & model releases! These seem very promising. Sadly, only very little work has been done on code retrieval, it's quite a shame.
Thank you for advancing this domain!
Pull Request overview
- Add 'modules.json' which tells the Pooling module to use
1_Pooling/config.json
. - Add
sentence_bert_config.json
to tell Sentence Transformers about the maximum sequence length.
Details
Sentence Transformers uses a modules.json
file to "build" its model. These usually consist of 2 or 3 modules, often including a Transformer module (relying on transformers
) and a Pooling module. You've already got the configuration for this Pooling module in 1_Pooling/config.json
, but without the modules.json
, Sentence Transformers won't realise that it has to look there. Instead, it will create the "default setup", which means using mean pooling.
As a result, there was a discrepancy between the outputs of Sentence Transformers and Transformers.
Beyond the fixes, I also added the outputs to the code snippets in the README - I find that these help users get a good understanding of what the model does to go from inputs to outputs.
Lastly, I added some tags
so that this shows up when people search for retrieval embedding models in here: https://huggingface.co./models?library=sentence-transformers, similarly to what we've done for https://huggingface.co./Salesforce/SFR-Embedding-2_R.
Lastly, you might be able to increase visibility of this model by adding it to your Salesforce SFR-Embedding collection: https://huggingface.co./collections/Salesforce/sfr-embedding-models-66abe671200408925487b6c8
- Tom Aarsen