Better results than the reported results and some questions about the model

#1
by JingFan - opened

Hi there,

my colleague @dennlinger and I are from the Institute of Computer Science at Heidelberg University, currently investigating the performance of German abstractive summarizers. We are very interested in your model and we have also evaluated your model with the MLSUM test set (all samples, no filtering). We found that our results are better (see table below) than the results you reported in the model card.

Parameters Rouge1-F1 Rouge2-F1 RougeL-F1
MLSUM (max_length=354, min_length=13, do_sample=false, truncation=True) 0.1882 0.0553 0.1448

Length parameters were obtained based on statistics from the MLSUM training set. We further do not explicitly prepend the "summarize :" to any of the samples, FWIW.
Given the differing results, we had some further questions about your model:

  1. For the data preprocessing, only the records with no more than 94 summary tokens were selected. Could we ask why you chose 94? Is this somehow related to the SwissText dataset?
  2. As a follow-up, did you also perform similar filtering on the validation and test set (potentially explaining the different results we obtain)?
  3. There are no parameters specified for the generation of summaries in the test set (aside from "no beams"). Does this imply that you simply performed greedy search for the summary without specification of any further parameters (like minimum/maximum length)?
  4. We randomly checked some source articles and the generated summaries and have found that there is frequent repetition and semantic incoherence in the generated summaries. The following figure is an example of the first five articles in the MLSUM test set for your model's output. Do you have any insights into why this might occur?

mit.png

We would appreciate any comment and feedback!

Best wishes,

Dennis and Jing

Deutsche Telekom AG org
edited Jul 28, 2022

Hi,
sorry for the late reply.
Unfortunately, the computer on which I trained the models is not accessible right now. Therefore, I must unfortunately put you off a little longer.

For the data preprocessing, only the records with no more than 94 summary tokens were selected. Could we ask why you chose 94? Is this somehow related to the SwissText dataset?

I did some experiments to get a value for input and output seq. len.
I had to somehow balance them to avoid OOMs on the GPUs.

These models can not be trained with FP16 due to a strange bug :-(

94 is 96 - 2
-2 because of start and end token if I remember correctly.

As a follow-up, did you also perform similar filtering on the validation and test set (potentially explaining the different results we obtain)?

I have to check that on the code I can not access right now...

There are no parameters specified for the generation of summaries in the test set (aside from "no beams"). Does this imply that you simply performed greedy search for the summary without specification of any further parameters (like minimum/maximum length)?

For evaluation (and training) I did use this script:
https://github.com/huggingface/transformers/blob/main/examples/pytorch/summarization/run_summarization.py

maybe you find your answers there?

Do you have any insights into why this might occur?

Thanks for sharing these insights.

No, unfortunately not.
Do you have any idea how to avoid it?

PS:
I plan to train more summarization models based on our new German T5 model:
https://huggingface.co./GermanT5/t5-efficient-gc4-german-base-nl36

If you have time and desire I would be happy to work with you on it. Maybe you have ideas which data sets can be taken and so?

If you want I can set us - after my vacation times - a conference call?

Hi Philip,
thanks so much for your response! Also very interesting to hear about the FP16 issue, I wasn't aware of it, so good to know.

As for the repetitions, we have since experimented with n-gram repetition penalties (either no_repeat_ngram_size or less strictly repetition_penalty, although the latter is less transparent to set as a parameter). On paper, the ROUGE scores did not move significantly, but at least this avoids issues such as the above quite well. I still think that we'd have to investigate a bit more to say for sure that this doesn't cause other problems, but it seems to be a quick fix for very frequent repetitions.

As for the length, one follow-up would be whether this then also applies to the test set, or did you use the (unfiltered) full test set for your scoring runs?

And finally, great to hear that you are working on some newer models! I think Stefan is already aware of my dataset Klexikon, however, texts are generally too long for the standard 512 input (and often also for the 512 output). I'm working on a weakly paragraph-aligned variant that might be suitable for your training purposes, although I expect that there is some content overlap in the source articles, given that we also use Wikipedia as an input text, similar to the Swisstext challenge data. However, data is quite high quality, as I've manually checked about 10% of the dataset, so let me know if you have any particular questions about it.

Otherwise, we have used the German subset of the wiki_lingua dataset (careful, there are two different versions on the Huggingface Hub: the original and the re-crawled GEM version). However, the target "summaries" are often extremely short, and not very suitable for natural language summarization, I would say.

Feel free to set up a call after you are back from vacation, I'm very keen to hear more!
Best,
Dennis

Sign up or log in to comment