Update README
Browse files
README.md
CHANGED
@@ -1,9 +1,25 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
-
|
|
|
5 |
"[AutoQA: From Databases to QA Semantic Parsers with Only Synthetic Training Data](https://arxiv.org/abs/2010.04806)" (EMNLP 2020).
|
6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
If you are using this model in your work, please use this citation:
|
8 |
|
9 |
```
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
+
# Introduction
|
5 |
+
The automatic paraphrasing model described and used in the paper
|
6 |
"[AutoQA: From Databases to QA Semantic Parsers with Only Synthetic Training Data](https://arxiv.org/abs/2010.04806)" (EMNLP 2020).
|
7 |
|
8 |
+
# Training data
|
9 |
+
A cleaned version of the ParaBank 2 dataset introduced in "[Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering](https://aclanthology.org/K19-1005/)".
|
10 |
+
ParaBank 2 is a paraphrasing dataset constructed by back-translating the Czech portion of an English-Czech parallel corpus.
|
11 |
+
We use a subset of 5 million sentence pairs with the highest dual conditional cross-entropy score (which corresponds to the highest paraphrasing quality), and use only one of the five paraphrases provided for each sentence.
|
12 |
+
The cleaning process involved removing sentences that do not look like normal English sentences, e.g. contain URLs, contain too many special characters, etc.
|
13 |
+
|
14 |
+
# Training Procedure
|
15 |
+
The model is fine-tuned for 4 epochs on the above-mentioned dataset, starting from `facebook/bart-large` checkpoint.
|
16 |
+
We use token-level cross-entropy loss calculated using the gold paraphrase sentence. To ensure the output of the model is grammatical, during training, we use the back-translated Czech sentence as the input and the human-written English sentence as the output. Training is done with mini-batches of 1280 examples. For higher training efficiency, each mini-batch is constructed by grouping sentences of similar length together.
|
17 |
+
|
18 |
+
# How to use
|
19 |
+
Using `top_p=0.9` and `temperature` between `0` and `1` usually results in good generated paraphrases. Higher temperatures make paraphrases more diverse and more different from the input, but might slightly change the meaning of the original sentence.
|
20 |
+
|
21 |
+
|
22 |
+
# Citation
|
23 |
If you are using this model in your work, please use this citation:
|
24 |
|
25 |
```
|