blogs/2022/kaggle_commonlit_readability.md

metadata

title: Reflections on Kaggle competition [CommonLit Readability Prize]
desc: Reflecting on what worked and what didn't
published: true
date_published: 2022-01-05T00:00:00.000Z
tags: kaggle nlp

The CommonLit Readability Prize was my second Kaggle competition, as well as my worst leaderboard finish. In retrospect, it was a great learning opportunity and I'm still only a little disappointed that I didn't rank higher. Here are some details about what I learned, what I was impressed by, and what I would do differently in the future.

Competition Details 🏆

The task for the competition was to rate the readability of a passage of text - basically assigning a grade level based on the difficulty of the passage. For 99% of the participants this task seemed like a straightforward regression problem, but some interesting approaches emerged after the competition ended. Transformers were, of course, necessary to do well. The competition had over 3000 teams which I think was due to the fact that there was a large amount of prize money and students who would normally be in school having more free time during the summer.

Defining Readability 📚

Readability is pretty subjective, so the hosts got many teachers from grades 3-12 to do pairwise comparisons between book excerpts. That is, the task for each teacher was to decide, given two texts, which one was relatively harder/easier to read. After getting over 100k pairwise comparisons, they used a Bradley-Terry (BT) Model to get values for all excerpts.

Approaches

Since the training data was basically a text and a number (the readability score), most participants turned it into a regression task. Give the model a bit of text and the model will output a score. Many of the winning solutions used this, but after the competition ended, a few people shared how they tried implementing pairwise models.

In essence, the pairwise model takes two texts as input and then gives a score for how much harder/easier one text is over the other. If one of those texts has a known readability score, such as those that are in the training data, the unlabeled text can be assigned a score based on the known score and the model's prediction of the relative difference in readability.

I'm probably not doing the explanation justice, so please refer to the following posts.

Chris Deotte (@cdeotte) explaining how he used BT to score passages,
Abhishek Thakur's (@ahbishek) notebook and post about his pairwise model.
User @Allohvk giving more details about BT as a loss function.

I don't think the pairwise approach was able to surpass the regression approach, but I think it could potentially be used again in future competitions, such as this one: Jigsaw Rate Severity of Toxic Comments - Rank relative ratings of toxicity between comments

To summarize the various techniques that the top solutions used, here is a quick list.

In-domain pre-training to adapt language models to this type of text.
- People used Project Gutenberg and other freely available literature.
Pseudo-labeling on unlabeled text and then training on the pseudo-labels.
- The unlabeled text came from sources similar to what was used for in-domain pre-training.
Ensembling models
- Some of the top approaches used over 10 different models with different architectures, different training schemes, and different sizes.
Adjusting predictions to have the same mean score as the training data
Using SVR on top of model predictions
Not using dropout when fine-tuning

The magic of no dropout ✨

One interesting discovery during this competition was the bizarre effect of dropout when using transformer models for regression. Typically dropout is essential to prevent overfitting, but if a transformer is being used for regression, dropout will actually hurt the performance. It seems hard to believe but it actually made a substantial difference. Some users did some digging and found published articles claiming that dropout for classification is fine because the magnitude of the outputs does not really matter – as long as the relative values produce the right answer when taking the greatest value, it doesn’t matter how big or small that output is. With regression, the magnitude of the output is precisely what you need fine control over. Here is a discussion on it.

Lessons Learned 👨‍🏫

Looking back, I can identify two main reasons why I struggled: working alone and refusing to use public notebooks. Due to a combination of factors I decided to fly solo, which meant I was able to learn a lot, but it also meant I went slowly and didn't have great support when running into problems. This was useful for learning purposes but it wasted a lot of valuable time when there were good public notebooks and models available. I realized how foolish my decision was when I read how the person who got 1st place used the public notebooks to make even better models. I'm still a little salty that I didn't do better, but I'm taking it as a learning opportunity and moving forward. 🤷‍♂️ Onto the next one!