Model card for PM-AI/paraphrase-distilroberta-base-v2_de-en

For internal purposes and for testing, we have made a monolingual paraphrasing model from Sentence Transformers usable for German + English via Knowledge Distillation. The decision was made in favor of sentence-transformers/paraphrase-distilroberta-base-v2 because this model has no public available multilingual version (to our knowledge). In addition, it has significantly more training samples compared to its predecessor: 83.3 million samples were used instead of 24.6 million samples.

Training

  1. Download of datasets
  2. Execution of knowledge distillation

Training Data

Datasets used based on offical source:

  • AllNLI
  • sentence-compression
  • SimpleWiki
  • altlex
  • msmarco-triplets
  • quora_duplicates
  • coco_captions
  • flickr30k_captions
  • yahoo_answers_title_question
  • S2ORC_citation_pairs
  • stackexchange_duplicate_questions
  • wiki-atomic-edits

Training Execution

First we downloaded some german-english parallel datasets via get_parallel_data_*.py.

These datasets are: Tatoeba, WikiMatrix, TED2020, OpenSubtitles, Europarl, News-Commentary

Then we started knowledge distillation with make_multilingual_sys.py

Parameterization of training

  • Script: make_multilingual_sys.py
  • Datasets: Tatoeba, WikiMatrix, TED2020, OpenSubtitles, Europarl, News-Commentary
  • GPU: NVIDIA A40 (Driver Version: 515.48.07; CUDA Version: 11.7)
  • Batch Size: 64
  • Max Sequence Length: 256
  • Train Max Sentence Length: 600
  • Max Sentences Per Train File: 1000000
  • Teacher Model: sentence-transformers/paraphrase-distilroberta-base-v2
  • Student Model: xlm-roberta-base
  • Loss Function: MSE Loss
  • Learning Rate: 2e-5
  • Epochs: 20
  • Evaluation Steps: 10000
  • Warmup Steps: 10000

Acknowledgment

This work is a collaboration between Technical University of Applied Sciences Wildau (TH Wildau) and sense.ai.tion GmbH. You can contact us via:

This work was funded by the European Regional Development Fund (EFRE) and the State of Brandenburg. Project/Vorhaben: "ProFIT: Natürlichsprachliche Dialogassistenten in der Pflege".

Logo of European Regional Development Fund (EFRE)
Logo of senseaition GmbH
Logo of TH Wildau
Downloads last month
9
Safetensors
Model size
278M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.