|
## Training a pointer-generator model on the Extreme Summarization dataset |
|
|
|
##### 1. Download the Extreme Summarization data and preprocess it |
|
|
|
Follow the instructions [here](https://github.com/EdinburghNLP/XSum) to obtain |
|
the original Extreme Summarization dataset. You should have six files, |
|
{train,validation,test}.{document,summary}. |
|
|
|
##### 2. Create a vocabulary and extend it with source position markers |
|
|
|
```bash |
|
vocab_size=10000 |
|
position_markers=1000 |
|
export LC_ALL=C |
|
cat train.document train.summary | |
|
tr -s '[:space:]' '\n' | |
|
sort | |
|
uniq -c | |
|
sort -k1,1bnr -k2 | |
|
head -n "$((vocab_size - 4))" | |
|
awk '{ print $2 " " $1 }' >dict.pg.txt |
|
python3 -c "[print('<unk-{}> 0'.format(n)) for n in range($position_markers)]" >>dict.pg.txt |
|
``` |
|
|
|
This creates the file dict.pg.txt that contains the 10k most frequent words, |
|
followed by 1k source position markers: |
|
|
|
``` |
|
the 4954867 |
|
. 4157552 |
|
, 3439668 |
|
to 2212159 |
|
a 1916857 |
|
of 1916820 |
|
and 1823350 |
|
... |
|
<unk-0> 0 |
|
<unk-1> 0 |
|
<unk-2> 0 |
|
<unk-3> 0 |
|
<unk-4> 0 |
|
... |
|
``` |
|
|
|
##### 2. Preprocess the text data |
|
|
|
```bash |
|
./preprocess.py --source train.document --target train.summary --vocab <(cut -d' ' -f1 dict.pg.txt) --source-out train.pg.src --target-out train.pg.tgt |
|
./preprocess.py --source validation.document --target validation.summary --vocab <(cut -d' ' -f1 dict.pg.txt) --source-out valid.pg.src --target-out valid.pg.tgt |
|
./preprocess.py --source test.document --vocab <(cut -d' ' -f1 dict.pg.txt) --source-out test.pg.src |
|
``` |
|
|
|
The data should now contain `<unk-N>` tokens in place of out-of-vocabulary words. |
|
|
|
##### 3. Binarize the dataset: |
|
|
|
```bash |
|
fairseq-preprocess \ |
|
--source-lang src \ |
|
--target-lang tgt \ |
|
--trainpref train.pg \ |
|
--validpref valid.pg \ |
|
--destdir bin \ |
|
--workers 60 \ |
|
--srcdict dict.pg.txt \ |
|
--joined-dictionary |
|
``` |
|
|
|
##### 3. Train a model |
|
|
|
```bash |
|
total_updates=20000 |
|
warmup_updates=500 |
|
lr=0.001 |
|
max_tokens=4096 |
|
update_freq=4 |
|
pointer_layer=-2 |
|
|
|
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 fairseq-train bin \ |
|
--user-dir examples/pointer_generator/pointer_generator_src \ |
|
--max-tokens "$max_tokens" \ |
|
--task translation \ |
|
--source-lang src --target-lang tgt \ |
|
--truncate-source \ |
|
--layernorm-embedding \ |
|
--share-all-embeddings \ |
|
--encoder-normalize-before \ |
|
--decoder-normalize-before \ |
|
--required-batch-size-multiple 1 \ |
|
--arch transformer_pointer_generator \ |
|
--alignment-layer "$pointer_layer" \ |
|
--alignment-heads 1 \ |
|
--source-position-markers 1000 \ |
|
--criterion label_smoothed_cross_entropy \ |
|
--label-smoothing 0.1 \ |
|
--dropout 0.1 --attention-dropout 0.1 \ |
|
--weight-decay 0.01 --optimizer adam --adam-betas "(0.9, 0.999)" --adam-eps 1e-08 \ |
|
--clip-norm 0.1 \ |
|
--lr-scheduler inverse_sqrt --lr "$lr" --max-update "$total_updates" --warmup-updates "$warmup_updates" \ |
|
--update-freq "$update_freq" \ |
|
--skip-invalid-size-inputs-valid-test |
|
``` |
|
|
|
Above we specify that our dictionary contains 1000 source position markers, and |
|
that we want to use one attention head from the penultimate decoder layer for |
|
pointing. It should run in 5.5 hours on one node with eight 32GB V100 GPUs. The |
|
logged messages confirm that dictionary indices above 10000 will be mapped to |
|
the `<unk>` embedding: |
|
|
|
``` |
|
2020-09-24 20:43:53 | INFO | fairseq.tasks.translation | [src] dictionary: 11000 types |
|
2020-09-24 20:43:53 | INFO | fairseq.tasks.translation | [tgt] dictionary: 11000 types |
|
2020-09-24 20:43:53 | INFO | fairseq.data.data_utils | loaded 11332 examples from: bin/valid.src-tgt.src |
|
2020-09-24 20:43:53 | INFO | fairseq.data.data_utils | loaded 11332 examples from: bin/valid.src-tgt.tgt |
|
2020-09-24 20:43:53 | INFO | fairseq.tasks.translation | bin valid src-tgt 11332 examples |
|
2020-09-24 20:43:53 | INFO | fairseq.models.transformer_pg | dictionary indices from 10000 to 10999 will be mapped to 3 |
|
``` |
|
|
|
##### 4. Summarize the test sequences |
|
|
|
```bash |
|
batch_size=32 |
|
beam_size=6 |
|
max_length=60 |
|
length_penalty=1.0 |
|
|
|
fairseq-interactive bin \ |
|
--user-dir examples/pointer_generator/pointer_generator_src \ |
|
--batch-size "$batch_size" \ |
|
--task translation \ |
|
--source-lang src --target-lang tgt \ |
|
--path checkpoints/checkpoint_last.pt \ |
|
--input test.pg.src \ |
|
--buffer-size 200 \ |
|
--max-len-a 0 \ |
|
--max-len-b "$max_length" \ |
|
--lenpen "$length_penalty" \ |
|
--beam "$beam_size" \ |
|
--skip-invalid-size-inputs-valid-test | |
|
tee generate.out |
|
grep ^H generate.out | cut -f 3- >generate.hyp |
|
``` |
|
|
|
Now you should have the generated sequences in `generate.hyp`. They contain |
|
`<unk-N>` tokens that the model has copied from the source sequence. In order to |
|
retrieve the original words, we need the unprocessed source sequences from |
|
`test.document`. |
|
|
|
##### 5. Process the generated output |
|
|
|
Since we skipped too long inputs when producing `generate.hyp`, we also have to |
|
skip too long sequences now that we read `test.document`. |
|
|
|
```bash |
|
./postprocess.py \ |
|
--source <(awk 'NF<1024' test.document) \ |
|
--target generate.hyp \ |
|
--target-out generate.hyp.processed |
|
``` |
|
|
|
Now you'll find the final sequences from `generate.hyp.processed`, with |
|
`<unk-N>` replaced with the original word from the source sequence. |
|
|
|
##### An example of a summarized sequence |
|
|
|
The original source document in `test.document`: |
|
|
|
> de roon moved to teesside in june 2016 for an initial # 8.8 m fee and played 33 premier league games last term . the netherlands international , 26 , scored five goals in 36 league and cup games during his spell at boro . meanwhile , manager garry monk confirmed the championship club 's interest in signing chelsea midfielder lewis baker . `` he 's a target and one of many that we 've had throughout the summer months , '' said monk . find all the latest football transfers on our dedicated page . |
|
|
|
The preprocessed source document in `test.src.pg`: |
|
|
|
> de \<unk-1> moved to \<unk-4> in june 2016 for an initial # \<unk-12> m fee and played 33 premier league games last term . the netherlands international , 26 , scored five goals in 36 league and cup games during his spell at boro . meanwhile , manager garry monk confirmed the championship club 's interest in signing chelsea midfielder lewis baker . `` he 's a target and one of many that we 've had throughout the summer months , '' said monk . find all the latest football transfers on our dedicated page . |
|
|
|
The generated summary in `generate.hyp`: |
|
|
|
> middlesbrough striker \<unk> de \<unk-1> has joined spanish side \<unk> on a season-long loan . |
|
|
|
The generated summary after postprocessing in `generate.hyp.processed`: |
|
|
|
> middlesbrough striker \<unk> de roon has joined spanish side \<unk> on a season-long loan . |
|
|