stanford-oval
/

paraphraser-bart-large

Text2Text Generation

Inference Endpoints

Model card Files Files and versions Community

paraphraser-bart-large / README.md

s-jse's picture

Update README

9115f57 about 2 years ago

|

2.41 kB

	---
	license: apache-2.0
	---
	# Introduction
	The automatic paraphrasing model described and used in the paper
	"[AutoQA: From Databases to QA Semantic Parsers with Only Synthetic Training Data](https://arxiv.org/abs/2010.04806)" (EMNLP 2020).

	# Training data
	A cleaned version of the ParaBank 2 dataset introduced in "[Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering](https://aclanthology.org/K19-1005/)".
	ParaBank 2 is a paraphrasing dataset constructed by back-translating the Czech portion of an English-Czech parallel corpus.
	We use a subset of 5 million sentence pairs with the highest dual conditional cross-entropy score (which corresponds to the highest paraphrasing quality), and use only one of the five paraphrases provided for each sentence.
	The cleaning process involved removing sentences that do not look like normal English sentences, e.g. contain URLs, contain too many special characters, etc.

	# Training Procedure
	The model is fine-tuned for 4 epochs on the above-mentioned dataset, starting from `facebook/bart-large` checkpoint.
	We use token-level cross-entropy loss calculated using the gold paraphrase sentence. To ensure the output of the model is grammatical, during training, we use the back-translated Czech sentence as the input and the human-written English sentence as the output. Training is done with mini-batches of 1280 examples. For higher training efficiency, each mini-batch is constructed by grouping sentences of similar length together.

	# How to use
	Using `top_p=0.9` and `temperature` between `0` and `1` usually results in good generated paraphrases. Higher temperatures make paraphrases more diverse and more different from the input, but might slightly change the meaning of the original sentence.


	# Citation
	If you are using this model in your work, please use this citation:

	```
	@inproceedings{xu-etal-2020-autoqa,
	title = "{A}uto{QA}: From Databases to {QA} Semantic Parsers with Only Synthetic Training Data",
	author = "Xu, Silei and Semnani, Sina and Campagna, Giovanni and Lam, Monica",
	booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
	month = nov,
	year = "2020",
	address = "Online",
	publisher = "Association for Computational Linguistics",
	url = "https://www.aclweb.org/anthology/2020.emnlp-main.31",
	pages = "422--434",
	}
	```