Sebastian Hofstätter commited on
Commit
6d34d25
1 Parent(s): bad2ce0

Initial Model & Readme

Browse files
README.md CHANGED
@@ -1,3 +1,39 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language: "en"
4
+ tags:
5
+ - bag-of-words
6
+ - dense-passage-retrieval
7
+ - knowledge-distillation
8
+ datasets:
9
+ - ms_marco
10
  ---
11
+
12
+ # Uni-ColBERTer (Dim: 1) for Passage Retrieval
13
+
14
+ If you want to know more about our (Uni-)ColBERTer architecture check out our paper: https://arxiv.org/abs/2203.13088 🎉
15
+
16
+ For more information, source code, and a minimal usage example please visit: https://github.com/sebastian-hofstaetter/colberter
17
+
18
+ ## Limitations & Bias
19
+
20
+ - The model is only trained on english text.
21
+
22
+ - The model inherits social biases from both DistilBERT and MSMARCO.
23
+
24
+ - The model is only trained on relatively short passages of MSMARCO (avg. 60 words length), so it might struggle with longer text.
25
+
26
+ ## Citation
27
+
28
+ If you use our model checkpoint please cite our work as:
29
+
30
+ ```
31
+ @article{Hofstaetter2022_colberter,
32
+ author = {Sebastian Hofst{\"a}tter and Omar Khattab and Sophia Althammer and Mete Sertkan and Allan Hanbury},
33
+ title = {Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized Late Interactions using Enhanced Reduction},
34
+ publisher = {arXiv},
35
+ url = {https://arxiv.org/abs/2203.13088},
36
+ doi = {10.48550/ARXIV.2203.13088},
37
+ year = {2022},
38
+ }
39
+ ```
config.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "aggregate_unique_ids": true,
3
+ "architectures": [
4
+ "ColBERTer"
5
+ ],
6
+ "bert_model": "distilbert-base-uncased",
7
+ "compress_to_exact_mini_mode": true,
8
+ "compression_dim": 32,
9
+ "dual_loss": true,
10
+ "model_type": "ColBERT",
11
+ "retrieval_compression_dim": 128,
12
+ "return_vecs": false,
13
+ "second_compress_dim": 1,
14
+ "torch_dtype": "float32",
15
+ "trainable": true,
16
+ "transformers_version": "4.12.0",
17
+ "use_contextualized_stopwords": true
18
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9509bd5319affe32c3ec101af8960edd527be036a21aab0344f3e1a1a684ef33
3
+ size 265984167
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512, "special_tokens_map_file": null, "name_or_path": "distilbert-base-uncased", "tokenizer_class": "DistilBertTokenizer"}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff