license: apache-2.0
pipeline_tag: sentence-similarity
tags:
- cross-lingual
- multilingual
- question-answering
- retrieval
- sentence-similarity
- variational
VMSST
Published as a long paper at ACL 2023.
Contrastive learning has been successfully used for retrieval of semantically aligned sentences, but it often requires large batch sizes and carefully engineered heuristics to work well. In this paper, we instead propose a generative model for learning multilingual text embeddings which can be used to retrieve or score sentence pairs. Our model operates on parallel data in N languages and, through an approximation we introduce, efficiently encourages source separation in this multilingual setting, separating semantic information that is shared between translations from stylistic or language-specific variation. We show careful large-scale comparisons between contrastive and generation-based approaches for learning multilingual text embeddings, a comparison that has not been done to the best of our knowledge despite the popularity of these approaches. We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval––the last of which we introduce in this paper. Overall, our Variational Multilingual Source-Separation Transformer (VMSST) model outperforms both a strong contrastive and generative baseline on these tasks.
Checkpoints
T5X (Jax): https://storage.googleapis.com/gresearch/vmsst/vmsst-large-2048-t5x.zip
PyTorch: https://storage.googleapis.com/gresearch/vmsst/vmsst-large-2048-pytorch.zip
Usage
Installation
- Clone the following repository from Google Research.
git clone -b master --single-branch https://github.com/google-research/google-research.git
- Make sure
google-research
is the current directory:
cd google-research/vmsst
- Create and activate a new virtualenv:
python -m venv vmsst
source vmsst/bin/activate
- This repository is tested on Python 3.10+. Install required packages:
pip install -r requirements.txt
Test
To test that the checkpoint and installation are working as intended, run:
bash run.sh
The expected cosine similarity scores for the three sentences pairs are:
0.2573888301849365, 0.1563197821378708, and 0.28531330823898315.
Inference
To embed a list of sentences:
python score_sentence_pairs.py --sentence_pair_file test_data/test_sentence_pairs.tsv
To score a list of sentence pairs:
python embed_sentences.py --sentence_file test_data/test_sentences.txt
Citation
If you use our code or models your work please cite:
@article{wieting2022beyond,
title={Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval},
author={Wieting, John and Clark, Jonathan H and Cohen, William W and Neubig, Graham and Berg-Kirkpatrick, Taylor},
journal={arXiv preprint arXiv:2212.10726},
year={2022}
}