Spaces:
Paused
Paused
File size: 1,871 Bytes
ee6e328 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
### Motivation
Without processing, english-> romanian mbart-large-en-ro gets BLEU score 26.8 on the WMT data.
With post processing, it can score 37..
Here is the postprocessing code, stolen from @mjpost in this [issue](https://github.com/pytorch/fairseq/issues/1758)
### Instructions
Note: You need to have your test_generations.txt before you start this process.
(1) Setup `mosesdecoder` and `wmt16-scripts`
```bash
cd $HOME
git clone [email protected]:moses-smt/mosesdecoder.git
cd mosesdecoder
git clone [email protected]:rsennrich/wmt16-scripts.git
```
(2) define a function for post processing.
It removes diacritics and does other things I don't understand
```bash
ro_post_process () {
sys=$1
ref=$2
export MOSES_PATH=$HOME/mosesdecoder
REPLACE_UNICODE_PUNCT=$MOSES_PATH/scripts/tokenizer/replace-unicode-punctuation.perl
NORM_PUNC=$MOSES_PATH/scripts/tokenizer/normalize-punctuation.perl
REM_NON_PRINT_CHAR=$MOSES_PATH/scripts/tokenizer/remove-non-printing-char.perl
REMOVE_DIACRITICS=$MOSES_PATH/wmt16-scripts/preprocess/remove-diacritics.py
NORMALIZE_ROMANIAN=$MOSES_PATH/wmt16-scripts/preprocess/normalise-romanian.py
TOKENIZER=$MOSES_PATH/scripts/tokenizer/tokenizer.perl
lang=ro
for file in $sys $ref; do
cat $file \
| $REPLACE_UNICODE_PUNCT \
| $NORM_PUNC -l $lang \
| $REM_NON_PRINT_CHAR \
| $NORMALIZE_ROMANIAN \
| $REMOVE_DIACRITICS \
| $TOKENIZER -no-escape -l $lang \
> $(basename $file).tok
done
# compute BLEU
cat $(basename $sys).tok | sacrebleu -tok none -s none -b $(basename $ref).tok
}
```
(3) Call the function on test_generations.txt and test.target
For example,
```bash
ro_post_process enro_finetune/test_generations.txt wmt_en_ro/test.target
```
This will split out a new blue score and write a new fine called `test_generations.tok` with post-processed outputs.
```
|