s-jse commited on
Commit
9115f57
1 Parent(s): 80c6fd0

Update README

Browse files
Files changed (1) hide show
  1. README.md +17 -1
README.md CHANGED
@@ -1,9 +1,25 @@
1
  ---
2
  license: apache-2.0
3
  ---
4
- The paraphrasing model described and used in the paper
 
5
  "[AutoQA: From Databases to QA Semantic Parsers with Only Synthetic Training Data](https://arxiv.org/abs/2010.04806)" (EMNLP 2020).
6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  If you are using this model in your work, please use this citation:
8
 
9
  ```
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+ # Introduction
5
+ The automatic paraphrasing model described and used in the paper
6
  "[AutoQA: From Databases to QA Semantic Parsers with Only Synthetic Training Data](https://arxiv.org/abs/2010.04806)" (EMNLP 2020).
7
 
8
+ # Training data
9
+ A cleaned version of the ParaBank 2 dataset introduced in "[Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering](https://aclanthology.org/K19-1005/)".
10
+ ParaBank 2 is a paraphrasing dataset constructed by back-translating the Czech portion of an English-Czech parallel corpus.
11
+ We use a subset of 5 million sentence pairs with the highest dual conditional cross-entropy score (which corresponds to the highest paraphrasing quality), and use only one of the five paraphrases provided for each sentence.
12
+ The cleaning process involved removing sentences that do not look like normal English sentences, e.g. contain URLs, contain too many special characters, etc.
13
+
14
+ # Training Procedure
15
+ The model is fine-tuned for 4 epochs on the above-mentioned dataset, starting from `facebook/bart-large` checkpoint.
16
+ We use token-level cross-entropy loss calculated using the gold paraphrase sentence. To ensure the output of the model is grammatical, during training, we use the back-translated Czech sentence as the input and the human-written English sentence as the output. Training is done with mini-batches of 1280 examples. For higher training efficiency, each mini-batch is constructed by grouping sentences of similar length together.
17
+
18
+ # How to use
19
+ Using `top_p=0.9` and `temperature` between `0` and `1` usually results in good generated paraphrases. Higher temperatures make paraphrases more diverse and more different from the input, but might slightly change the meaning of the original sentence.
20
+
21
+
22
+ # Citation
23
  If you are using this model in your work, please use this citation:
24
 
25
  ```