|
--- |
|
license: apache-2.0 |
|
--- |
|
# Introduction |
|
The automatic paraphrasing model described and used in the paper |
|
"[AutoQA: From Databases to QA Semantic Parsers with Only Synthetic Training Data](https://arxiv.org/abs/2010.04806)" (EMNLP 2020). |
|
|
|
# Training data |
|
A cleaned version of the ParaBank 2 dataset introduced in "[Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering](https://aclanthology.org/K19-1005/)". |
|
ParaBank 2 is a paraphrasing dataset constructed by back-translating the Czech portion of an English-Czech parallel corpus. |
|
We use a subset of 5 million sentence pairs with the highest dual conditional cross-entropy score (which corresponds to the highest paraphrasing quality), and use only one of the five paraphrases provided for each sentence. |
|
The cleaning process involved removing sentences that do not look like normal English sentences, e.g. contain URLs, contain too many special characters, etc. |
|
|
|
# Training Procedure |
|
The model is fine-tuned for 4 epochs on the above-mentioned dataset, starting from `facebook/bart-large` checkpoint. |
|
We use token-level cross-entropy loss calculated using the gold paraphrase sentence. To ensure the output of the model is grammatical, during training, we use the back-translated Czech sentence as the input and the human-written English sentence as the output. Training is done with mini-batches of 1280 examples. For higher training efficiency, each mini-batch is constructed by grouping sentences of similar length together. |
|
|
|
# How to use |
|
Using `top_p=0.9` and `temperature` between `0` and `1` usually results in good generated paraphrases. Higher temperatures make paraphrases more diverse and more different from the input, but might slightly change the meaning of the original sentence. |
|
|
|
|
|
# Citation |
|
If you are using this model in your work, please use this citation: |
|
|
|
``` |
|
@inproceedings{xu-etal-2020-autoqa, |
|
title = "{A}uto{QA}: From Databases to {QA} Semantic Parsers with Only Synthetic Training Data", |
|
author = "Xu, Silei and Semnani, Sina and Campagna, Giovanni and Lam, Monica", |
|
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", |
|
month = nov, |
|
year = "2020", |
|
address = "Online", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://www.aclweb.org/anthology/2020.emnlp-main.31", |
|
pages = "422--434", |
|
} |
|
``` |