# WSDM Cup 2023 BERT Checkpoints: - This repo contains the checkpoints of our competition in WSDM Cup 2023: [Pre-training for Web Search](https://aistudio.baidu.com/aistudio/competition/detail/536/0/leaderboard) and [Unbiased Learning for Web Search](https://aistudio.baidu.com/aistudio/competition/detail/534/0/leaderboard). ## Paper released Please refer to our paper for details in this competition: - Task1 Unbiased Learning to rank: [Multi-Feature Integration for Perception-Dependent Examination-Bias Estimation](https://arxiv.org/pdf/2302.13756.pdf) - Task2 Pretraining for web search: [Pretraining De-Biased Language Model with Large-scale Click Logs for Document Ranking](https://arxiv.org/pdf/2302.13498.pdf) ## Method Overview - Pre-training BERT with MLM and CTR prediction loss (or multi-task CTR prediction loss). - Finetuning BERT with pairwise ranking loss. - Obtain prediction scores from different BERTs. - Ensemble learning to combine BERT features and sparse features. Details will be updated in the submission paper. #### BERT features: ##### 1) Model details: [Checkpoints Download Here](https://huggingface.co./lixsh6/wsdm23_pretrain/tree/main) | Index| Model Flag | Method | Pretrain step | Finetune step | DCG on leaderboard | | --------| -------- | ------- |---------------| ------- | ------- | | 1| large_group2_wwm_from_unw4625K | M1 | 1700K | 5130 | 11.96214 | | 2| large_group2_wwm_from_unw4625K | M1 | 1700K | 5130 | NAN | | 3| base_group2_wwm | M2 | 2150K | 5130 | ~11.32363 | | 4| large_group2_wwm_from_unw4625K | M1 | 590K | 5130 | 11.94845 | | 5| large_group2_wwm_from_unw4625K | M1 | 1700K | 4180 | NAN | | 6| large_group2_mt_pretrain | M3 | 1940K | 5130 | NAN | ##### 2) Method details | Method | Model Layers | Details | | -------- | ------- | ------- | | M1 | 24 | WWM & CTR prediction as pretraining tasks| | M2 | 12 | WWM & CTR prediction as pretraining tasks | | M3 | 24 | WWM & Multi-task CTR prediction as pretraining tasks| ## Contacts - Xiangsheng Li: [lixsh6@gmail.com](lixsh6@gmail.com). - Xiaoshu Chen: [xschenranker@gmail.com](xschenranker@gmail.com)